Type
ArticleAuthors
Zhang, JunchaoBrown, Jed
Balay, Satish
Faibussowitsch, Jacob
Knepley, Matthew
Marin, Oana
Mills, Richard Tran
Munson, Todd
Smith, Barry F.
Zampini, Stefano

KAUST Department
Computer, Electrical and Mathematical Science and Engineering (CEMSE) DivisionExtreme Computing Research Center
Date
2021-05-26Preprint Posting Date
2021-02-25Online Publication Date
2021Print Publication Date
2022-04-01Permanent link to this record
http://hdl.handle.net/10754/667821
Metadata
Show full item recordAbstract
PetscSF, the communication component of the Portable, Extensible Toolkit for Scientific Computation (PETSc), is designed to provide PETScs communication infrastructure suitable for exascale computers that utilize GPUs and other accelerators. PetscSF provides a simple application programming interface (API) for managing common communication patterns in scientific computations by using a star-forest graph representation. PetscSF supports several implementations based on MPI and NVSHMEM, whose selection is based on the characteristics of the application or the target architecture. An efficient and portable model for network and intra-node communication is essential for implementing large-scale applications. The Message Passing Interface, which has been the de facto standard for distributed memory systems, has developed into a large complex API that does not yet provide high performance on the emerging heterogeneous CPU-GPU-based exascale systems. In this paper, we discuss the design of PetscSF, how it can overcome some difficulties of working directly with MPI on GPUs, and we demonstrate its performance, scalability, and novel features.Citation
Zhang, J., Brown, J., Balay, S., Faibussowitsch, J., Knepley, M., Marin, O., … Zampini, S. (2021). The PetscSF Scalable Communication Layer. IEEE Transactions on Parallel and Distributed Systems, 1–1. doi:10.1109/tpds.2021.3084070Sponsors
We thank Akhil Langer and Jim Dinan from the NVIDIA NVSHMEM team for their assistance. This work was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration, and by the U.S. Department of Energy under Contract DE-AC02-06CH11357 and Office of Science Awards DESC0016140 and DE-AC02-0000011838. This research used resources of the Oak Ridge Leadership Computing Facilities, a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725.Publisher
IEEEarXiv
2102.13018Additional Links
https://ieeexplore.ieee.org/document/9442258/https://ieeexplore.ieee.org/document/9442258/
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9442258
ae974a485f413a2113503eed53cd6c53
10.1109/TPDS.2021.3084070