Hierarchical Parallel Matrix Multiplication on Large-Scale Distributed Memory Platforms

Handle URI:
http://hdl.handle.net/10754/598457
Title:
Hierarchical Parallel Matrix Multiplication on Large-Scale Distributed Memory Platforms
Authors:
Quintin, Jean-Noel; Hasanov, Khalid; Lastovetsky, Alexey
Abstract:
Matrix multiplication is a very important computation kernel both in its own right as a building block of many scientific applications and as a popular representative for other scientific applications. Cannon's algorithm which dates back to 1969 was the first efficient algorithm for parallel matrix multiplication providing theoretically optimal communication cost. However this algorithm requires a square number of processors. In the mid-1990s, the SUMMA algorithm was introduced. SUMMA overcomes the shortcomings of Cannon's algorithm as it can be used on a nonsquare number of processors as well. Since then the number of processors in HPC platforms has increased by two orders of magnitude making the contribution of communication in the overall execution time more significant. Therefore, the state of the art parallel matrix multiplication algorithms should be revisited to reduce the communication cost further. This paper introduces a new parallel matrix multiplication algorithm, Hierarchical SUMMA (HSUMMA), which is a redesign of SUMMA. Our algorithm reduces the communication cost of SUMMA by introducing a two-level virtual hierarchy into the two-dimensional arrangement of processors. Experiments on an IBM BlueGene/P demonstrate the reduction of communication cost up to 2.08 times on 2048 cores and up to 5.89 times on 16384 cores. © 2013 IEEE.
Citation:
Quintin J-N, Hasanov K, Lastovetsky A (2013) Hierarchical Parallel Matrix Multiplication on Large-Scale Distributed Memory Platforms. 2013 42nd International Conference on Parallel Processing. Available: http://dx.doi.org/10.1109/ICPP.2013.89.
Publisher:
Institute of Electrical and Electronics Engineers (IEEE)
Journal:
2013 42nd International Conference on Parallel Processing
Issue Date:
Oct-2013
DOI:
10.1109/ICPP.2013.89
Type:
Conference Paper
Sponsors:
The research in this paper was supported by IRCSET(IrishResearch Council for Science, Engineering and Technol-ogy) and IBM, grant numbers EPSG/2011/188 and EP-SPD/2011/207.Some of the experiments presented in this paper werecarried out using the Grid’5000 experimental testbed, beingdeveloped under the INRIA ALADDIN development actionwith support from CNRS, RENATER and several Universitiesas well as other funding bodies (see https://www.grid5000.fr)Another part of the experiments in this research were carriedout using the resources of the Supercomputing Laboratory atKing Abdullah University of Science&Technology (KAUST)in Thuwal, Saudi Arabia.
Appears in Collections:
Publications Acknowledging KAUST Support

Full metadata record

DC FieldValue Language
dc.contributor.authorQuintin, Jean-Noelen
dc.contributor.authorHasanov, Khaliden
dc.contributor.authorLastovetsky, Alexeyen
dc.date.accessioned2016-02-25T13:21:03Zen
dc.date.available2016-02-25T13:21:03Zen
dc.date.issued2013-10en
dc.identifier.citationQuintin J-N, Hasanov K, Lastovetsky A (2013) Hierarchical Parallel Matrix Multiplication on Large-Scale Distributed Memory Platforms. 2013 42nd International Conference on Parallel Processing. Available: http://dx.doi.org/10.1109/ICPP.2013.89.en
dc.identifier.doi10.1109/ICPP.2013.89en
dc.identifier.urihttp://hdl.handle.net/10754/598457en
dc.description.abstractMatrix multiplication is a very important computation kernel both in its own right as a building block of many scientific applications and as a popular representative for other scientific applications. Cannon's algorithm which dates back to 1969 was the first efficient algorithm for parallel matrix multiplication providing theoretically optimal communication cost. However this algorithm requires a square number of processors. In the mid-1990s, the SUMMA algorithm was introduced. SUMMA overcomes the shortcomings of Cannon's algorithm as it can be used on a nonsquare number of processors as well. Since then the number of processors in HPC platforms has increased by two orders of magnitude making the contribution of communication in the overall execution time more significant. Therefore, the state of the art parallel matrix multiplication algorithms should be revisited to reduce the communication cost further. This paper introduces a new parallel matrix multiplication algorithm, Hierarchical SUMMA (HSUMMA), which is a redesign of SUMMA. Our algorithm reduces the communication cost of SUMMA by introducing a two-level virtual hierarchy into the two-dimensional arrangement of processors. Experiments on an IBM BlueGene/P demonstrate the reduction of communication cost up to 2.08 times on 2048 cores and up to 5.89 times on 16384 cores. © 2013 IEEE.en
dc.description.sponsorshipThe research in this paper was supported by IRCSET(IrishResearch Council for Science, Engineering and Technol-ogy) and IBM, grant numbers EPSG/2011/188 and EP-SPD/2011/207.Some of the experiments presented in this paper werecarried out using the Grid’5000 experimental testbed, beingdeveloped under the INRIA ALADDIN development actionwith support from CNRS, RENATER and several Universitiesas well as other funding bodies (see https://www.grid5000.fr)Another part of the experiments in this research were carriedout using the resources of the Supercomputing Laboratory atKing Abdullah University of Science&Technology (KAUST)in Thuwal, Saudi Arabia.en
dc.publisherInstitute of Electrical and Electronics Engineers (IEEE)en
dc.titleHierarchical Parallel Matrix Multiplication on Large-Scale Distributed Memory Platformsen
dc.typeConference Paperen
dc.identifier.journal2013 42nd International Conference on Parallel Processingen
dc.contributor.institutionExtrem Computing R and D Bull, , Franceen
dc.contributor.institutionUniversity College Dublin, Dublin, Irelanden
All Items in KAUST are protected by copyright, with all rights reserved, unless otherwise indicated.