Performance Evaluation of Computation and Communication Kernels of the Fast Multipole Method on Intel Manycore Architecture
KAUST DepartmentApplied Mathematics and Computational Science Program
Computer Science Program
Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division
Extreme Computing Research Center
Online Publication Date2017-08-01
Print Publication Date2017
Permanent link to this recordhttp://hdl.handle.net/10754/625714
MetadataShow full item record
AbstractManycore optimizations are essential for achieving performance worthy of anticipated exascale systems. Utilization of manycore chips is inevitable to attain the desired floating point performance of these energy-austere systems. In this work, we revisit ExaFMM, the open source Fast Multiple Method (FMM) library, in light of highly tuned shared-memory parallelization and detailed performance analysis on the new highly parallel Intel manycore architecture, Knights Landing (KNL). We assess scalability and performance gain using task-based parallelism of the FMM tree traversal. We also provide an in-depth analysis of the most computationally intensive part of the traversal kernel (i.e., the particle-to-particle (P2P) kernel), by comparing its performance across KNL and Broadwell architectures. We quantify different configurations that exploit the on-chip 512-bit vector units within different task-based threading paradigms. MPI communication-reducing and NUMA-aware approaches for the FMM’s global tree data exchange are examined with different cluster modes of KNL. By applying several algorithm- and architecture-aware optimizations for FMM, we show that the N-Body kernel on 256 threads of KNL achieves on average 2.8× speedup compared to the non-vectorized version, whereas on 56 threads of Broadwell, it achieves on average 2.9× speedup. In addition, the tree traversal kernel on KNL scales monotonically up to 256 threads with task-based programming models. The MPI-based communication-reducing algorithms show expected improvements of the data locality across the KNL on-chip network.
CitationAbduljabbar M, Al Farhan M, Yokota R, Keyes D (2017) Performance Evaluation of Computation and Communication Kernels of the Fast Multipole Method on Intel Manycore Architecture. Euro-Par 2017: Parallel Processing: 553–564. Available: http://dx.doi.org/10.1007/978-3-319-64203-1_40.
Conference/Event name23rd International Conference on Parallel and Distributed Computing, Euro-Par 2017