Hierarchical approach to optimization of parallel matrix multiplication on large-scale platforms
Online Publication Date2014-03-04
Print Publication Date2015-11
Permanent link to this recordhttp://hdl.handle.net/10754/598455
MetadataShow full item record
Abstract© 2014, Springer Science+Business Media New York. Many state-of-the-art parallel algorithms, which are widely used in scientific applications executed on high-end computing systems, were designed in the twentieth century with relatively small-scale parallelism in mind. Indeed, while in 1990s a system with few hundred cores was considered a powerful supercomputer, modern top supercomputers have millions of cores. In this paper, we present a hierarchical approach to optimization of message-passing parallel algorithms for execution on large-scale distributed-memory systems. The idea is to reduce the communication cost by introducing hierarchy and hence more parallelism in the communication scheme. We apply this approach to SUMMA, the state-of-the-art parallel algorithm for matrix–matrix multiplication, and demonstrate both theoretically and experimentally that the modified Hierarchical SUMMA significantly improves the communication cost and the overall performance on large-scale platforms.
CitationHasanov K, Quintin J-N, Lastovetsky A (2014) Hierarchical approach to optimization of parallel matrix multiplication on large-scale platforms. The Journal of Supercomputing. Available: http://dx.doi.org/10.1007/s11227-014-1133-x.
SponsorsThe research in this paper was supported by IRCSET (Irish Research Council for Science, Engineering and Technology) and IBM, grant numbers EPSG/2011/188 and EPSPD/2011/207. Some of the experiments presented in this paper were carried out using the Grid’5000 experimental testbed, being developed under the INRIA ALADDIN development action with support from CNRS, RENATER and several Universities as well as other funding bodies (see https://www.grid5000.fr) Another part of the experiments was carried out using the resources of the Supercomputing Laboratory at King Abdullah University of Science & Technology (KAUST) in Thuwal, Saudi Arabia. The authors would like to thank Ashley DeFlumere for her useful comments and corrections.
JournalThe Journal of Supercomputing