Show simple item record

dc.contributor.authorAbdulJabbar, Mustafa Abdulmajeed
dc.contributor.authorAl Farhan, Mohammed
dc.contributor.authorAl-Harthi, Noha A.
dc.contributor.authorChen, Rui
dc.contributor.authorYokota, Rio
dc.contributor.authorBagci, Hakan
dc.contributor.authorKeyes, David E.
dc.date.accessioned2018-04-16T11:27:42Z
dc.date.available2018-04-16T11:27:42Z
dc.date.issued2018-03-27
dc.identifier.urihttp://hdl.handle.net/10754/627504
dc.description.abstractAlgorithmic and architecture-oriented optimizations are essential for achieving performance worthy of anticipated energy-austere exascale systems. In this paper, we present an extreme scale FMM-accelerated boundary integral equation solver for wave scattering, which uses FMM as a matrix-vector multiplication inside the GMRES iterative method. Our FMM Helmholtz kernels treat nontrivial singular and near-field integration points. We implement highly optimized kernels for both shared and distributed memory, targeting emerging Intel extreme performance HPC architectures. We extract the potential thread- and data-level parallelism of the key Helmholtz kernels of FMM. Our application code is well optimized to exploit the AVX-512 SIMD units of Intel Skylake and Knights Landing architectures. We provide different performance models for tuning the task-based tree traversal implementation of FMM, and develop optimal architecture-specific and algorithm aware partitioning, load balancing, and communication reducing mechanisms to scale up to 6,144 compute nodes of a Cray XC40 with 196,608 hardware cores. With shared memory optimizations, we achieve roughly 77% of peak single precision floating point performance of a 56-core Skylake processor, and on average 60% of peak single precision floating point performance of a 72-core KNL. These numbers represent nearly 5.4x and 10x speedup on Skylake and KNL, respectively, compared to the baseline scalar code. With distributed memory optimizations, on the other hand, we report near-optimal efficiency in the weak scalability study with respect to both the logarithmic communication complexity as well as the theoretical scaling complexity of FMM. In addition, we exhibit up to 85% efficiency in strong scaling. We compute in excess of 2 billion DoF on the full-scale of the Cray XC40 supercomputer.
dc.publisherarXiv
dc.relation.urlhttp://arxiv.org/abs/1803.09948v1
dc.relation.urlhttp://arxiv.org/pdf/1803.09948v1
dc.rightsArchived with thanks to arXiv
dc.titleExtreme Scale FMM-Accelerated Boundary Integral Equation Solver for Wave Scattering
dc.typePreprint
dc.contributor.departmentComputer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division
dc.contributor.departmentComputer Science Program
dc.contributor.departmentElectrical Engineering Program
dc.contributor.departmentApplied Mathematics and Computational Science Program
dc.contributor.departmentExtreme Computing Research Center
dc.eprint.versionPre-print
dc.contributor.institutionTokyo Institute of Technology, Tokyo, Japan.
dc.identifier.arxividarXiv:1803.09948
kaust.personAbdulJabbar, Mustafa Abdulmajeed
kaust.personAl Farhan, Mohammed
kaust.personAl-Harthi, Noha A.
kaust.personChen, Rui
kaust.personBagci, Hakan
kaust.personKeyes, David E.
refterms.dateFOA2018-06-14T04:20:17Z


Files in this item

Thumbnail
Name:
1803.09948v1.pdf
Size:
3.445Mb
Format:
PDF
Description:
Preprint

This item appears in the following Collection(s)

Show simple item record