Performance Evaluation of Computation and Communication Kernels of the Fast Multipole Method on Intel Manycore Architecture

Handle URI:
http://hdl.handle.net/10754/625714
Title:
Performance Evaluation of Computation and Communication Kernels of the Fast Multipole Method on Intel Manycore Architecture
Authors:
AbdulJabbar, Mustafa Abdulmajeed; Al Farhan, Mohammed; Yokota, Rio; Keyes, David E. ( 0000-0002-4052-7224 )
Abstract:
Manycore optimizations are essential for achieving performance worthy of anticipated exascale systems. Utilization of manycore chips is inevitable to attain the desired floating point performance of these energy-austere systems. In this work, we revisit ExaFMM, the open source Fast Multiple Method (FMM) library, in light of highly tuned shared-memory parallelization and detailed performance analysis on the new highly parallel Intel manycore architecture, Knights Landing (KNL). We assess scalability and performance gain using task-based parallelism of the FMM tree traversal. We also provide an in-depth analysis of the most computationally intensive part of the traversal kernel (i.e., the particle-to-particle (P2P) kernel), by comparing its performance across KNL and Broadwell architectures. We quantify different configurations that exploit the on-chip 512-bit vector units within different task-based threading paradigms. MPI communication-reducing and NUMA-aware approaches for the FMM’s global tree data exchange are examined with different cluster modes of KNL. By applying several algorithm- and architecture-aware optimizations for FMM, we show that the N-Body kernel on 256 threads of KNL achieves on average 2.8× speedup compared to the non-vectorized version, whereas on 56 threads of Broadwell, it achieves on average 2.9× speedup. In addition, the tree traversal kernel on KNL scales monotonically up to 256 threads with task-based programming models. The MPI-based communication-reducing algorithms show expected improvements of the data locality across the KNL on-chip network.
KAUST Department:
Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division; Computer Science Program; Applied Mathematics and Computational Science Program; Extreme Computing Research Center
Citation:
Abduljabbar M, Al Farhan M, Yokota R, Keyes D (2017) Performance Evaluation of Computation and Communication Kernels of the Fast Multipole Method on Intel Manycore Architecture. Euro-Par 2017: Parallel Processing: 553–564. Available: http://dx.doi.org/10.1007/978-3-319-64203-1_40.
Publisher:
Springer International Publishing
Journal:
Euro-Par 2017: Parallel Processing
Conference/Event name:
23rd International Conference on Parallel and Distributed Computing, Euro-Par 2017
Issue Date:
31-Jul-2017
DOI:
10.1007/978-3-319-64203-1_40
Type:
Conference Paper
ISSN:
0302-9743; 1611-3349
Additional Links:
https://link.springer.com/chapter/10.1007%2F978-3-319-64203-1_40
Appears in Collections:
Conference Papers; Applied Mathematics and Computational Science Program; Extreme Computing Research Center; Computer Science Program; Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division

Full metadata record

DC FieldValue Language
dc.contributor.authorAbdulJabbar, Mustafa Abdulmajeeden
dc.contributor.authorAl Farhan, Mohammeden
dc.contributor.authorYokota, Rioen
dc.contributor.authorKeyes, David E.en
dc.date.accessioned2017-10-03T12:49:35Z-
dc.date.available2017-10-03T12:49:35Z-
dc.date.issued2017-07-31en
dc.identifier.citationAbduljabbar M, Al Farhan M, Yokota R, Keyes D (2017) Performance Evaluation of Computation and Communication Kernels of the Fast Multipole Method on Intel Manycore Architecture. Euro-Par 2017: Parallel Processing: 553–564. Available: http://dx.doi.org/10.1007/978-3-319-64203-1_40.en
dc.identifier.issn0302-9743en
dc.identifier.issn1611-3349en
dc.identifier.doi10.1007/978-3-319-64203-1_40en
dc.identifier.urihttp://hdl.handle.net/10754/625714-
dc.description.abstractManycore optimizations are essential for achieving performance worthy of anticipated exascale systems. Utilization of manycore chips is inevitable to attain the desired floating point performance of these energy-austere systems. In this work, we revisit ExaFMM, the open source Fast Multiple Method (FMM) library, in light of highly tuned shared-memory parallelization and detailed performance analysis on the new highly parallel Intel manycore architecture, Knights Landing (KNL). We assess scalability and performance gain using task-based parallelism of the FMM tree traversal. We also provide an in-depth analysis of the most computationally intensive part of the traversal kernel (i.e., the particle-to-particle (P2P) kernel), by comparing its performance across KNL and Broadwell architectures. We quantify different configurations that exploit the on-chip 512-bit vector units within different task-based threading paradigms. MPI communication-reducing and NUMA-aware approaches for the FMM’s global tree data exchange are examined with different cluster modes of KNL. By applying several algorithm- and architecture-aware optimizations for FMM, we show that the N-Body kernel on 256 threads of KNL achieves on average 2.8× speedup compared to the non-vectorized version, whereas on 56 threads of Broadwell, it achieves on average 2.9× speedup. In addition, the tree traversal kernel on KNL scales monotonically up to 256 threads with task-based programming models. The MPI-based communication-reducing algorithms show expected improvements of the data locality across the KNL on-chip network.en
dc.publisherSpringer International Publishingen
dc.relation.urlhttps://link.springer.com/chapter/10.1007%2F978-3-319-64203-1_40en
dc.subjectAVX-512en
dc.subjectFast multipole methoden
dc.subjectIntel knights landingen
dc.titlePerformance Evaluation of Computation and Communication Kernels of the Fast Multipole Method on Intel Manycore Architectureen
dc.typeConference Paperen
dc.contributor.departmentComputer, Electrical and Mathematical Sciences and Engineering (CEMSE) Divisionen
dc.contributor.departmentComputer Science Programen
dc.contributor.departmentApplied Mathematics and Computational Science Programen
dc.contributor.departmentExtreme Computing Research Centeren
dc.identifier.journalEuro-Par 2017: Parallel Processingen
dc.conference.date2017-08-28 to 2017-09-01en
dc.conference.name23rd International Conference on Parallel and Distributed Computing, Euro-Par 2017en
dc.conference.locationSantiago de Compostela, ESPen
dc.contributor.institutionTokyo Institute of Technology, Tokyo, Japanen
kaust.authorAbdulJabbar, Mustafa Abdulmajeeden
kaust.authorAl Farhan, Mohammeden
kaust.authorKeyes, David E.en
All Items in KAUST are protected by copyright, with all rights reserved, unless otherwise indicated.