Communication Reducing Approaches and Shared-Memory Optimizations for the Hierarchical Fast Multipole Method on Distributed and Many-core Systems
AdvisorsKeyes, David E.
Permanent link to this recordhttp://hdl.handle.net/10754/630221
MetadataShow full item record
AbstractWe present algorithms and implementations that overcome obstacles in the migration of the Fast Multipole Method (FMM), one of the most important algorithms in computational science and engineering, to exascale computing. Emerging architectural approaches to exascale computing are all characterized by data movement rates that are slow relative to the demand of aggregate floating point capability, resulting in performance that is bandwidth limited. Practical parallel applications of FMM are impeded in their scaling by irregularity of domains and dominance of collective tree communication, which is known not to scale well. We introduce novel ideas that improve partitioning of the N-body problem with boundary distribution through a sampling-based mechanism that hybridizes two well-known partitioning techniques, Hashed Octree (HOT) and Orthogonal Recursive Bisection (ORB). To reduce communication cost, we employ two methodologies. First, we directly utilize features available in parallel runtime systems to enable asynchronous computing and overlap it with communication. Second, we present Hierarchical Sparse Data Exchange (HSDX), a new all-to-all algorithm that inherently relieves communication by relaying sparse data in a few steps of neighbor exchanges. HSDX exhibits superior scalability and improves relative performance compared to the default MPI alltoall and other relevant literature implementations. We test this algorithm alongside others on a Cray XC40 tightly coupled with the Aries network and on Intel Many Integrated Core Architecture (MIC) represented by Intel Knights Corner (KNC) and Intel Knights Landing (KNL) as modern shared-memory CPU environments. Tests include comparisons of thoroughly tuned handwritten versus auto-vectorization of FMM Particle-to-Particle (P2P) and Multipole-to-Local (M2L) kernels. Scalability of task-based parallelism is assessed with FMM’s tree traversal kernel using different threading libraries. The MIC tests show large performance gains after adopting the prescribed techniques, which are inevitable in a world that is moving towards many-core parallelism.