Performance Evaluation of Computation and Communication Kernels of the Fast Multipole Method on Intel Manycore Architecture
Type
Conference PaperKAUST Department
Applied Mathematics and Computational Science ProgramComputer Science Program
Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division
Extreme Computing Research Center
Date
2017-08-01Online Publication Date
2017-08-01Print Publication Date
2017Permanent link to this record
http://hdl.handle.net/10754/625714
Metadata
Show full item recordAbstract
Manycore optimizations are essential for achieving performance worthy of anticipated exascale systems. Utilization of manycore chips is inevitable to attain the desired floating point performance of these energy-austere systems. In this work, we revisit ExaFMM, the open source Fast Multiple Method (FMM) library, in light of highly tuned shared-memory parallelization and detailed performance analysis on the new highly parallel Intel manycore architecture, Knights Landing (KNL). We assess scalability and performance gain using task-based parallelism of the FMM tree traversal. We also provide an in-depth analysis of the most computationally intensive part of the traversal kernel (i.e., the particle-to-particle (P2P) kernel), by comparing its performance across KNL and Broadwell architectures. We quantify different configurations that exploit the on-chip 512-bit vector units within different task-based threading paradigms. MPI communication-reducing and NUMA-aware approaches for the FMM’s global tree data exchange are examined with different cluster modes of KNL. By applying several algorithm- and architecture-aware optimizations for FMM, we show that the N-Body kernel on 256 threads of KNL achieves on average 2.8× speedup compared to the non-vectorized version, whereas on 56 threads of Broadwell, it achieves on average 2.9× speedup. In addition, the tree traversal kernel on KNL scales monotonically up to 256 threads with task-based programming models. The MPI-based communication-reducing algorithms show expected improvements of the data locality across the KNL on-chip network.Citation
Abduljabbar M, Al Farhan M, Yokota R, Keyes D (2017) Performance Evaluation of Computation and Communication Kernels of the Fast Multipole Method on Intel Manycore Architecture. Euro-Par 2017: Parallel Processing: 553–564. Available: http://dx.doi.org/10.1007/978-3-319-64203-1_40.Publisher
Springer NatureConference/Event name
23rd International Conference on Parallel and Distributed Computing, Euro-Par 2017Additional Links
https://link.springer.com/chapter/10.1007%2F978-3-319-64203-1_40ae974a485f413a2113503eed53cd6c53
10.1007/978-3-319-64203-1_40