High Performance Polar Decomposition on Distributed Memory Systems

Handle URI:
http://hdl.handle.net/10754/622144
Title:
High Performance Polar Decomposition on Distributed Memory Systems
Authors:
Sukkari, Dalal E.; Ltaief, Hatem ( 0000-0002-6897-1095 ) ; Keyes, David E. ( 0000-0002-4052-7224 )
Abstract:
The polar decomposition of a dense matrix is an important operation in linear algebra. It can be directly calculated through the singular value decomposition (SVD) or iteratively using the QR dynamically-weighted Halley algorithm (QDWH). The former is difficult to parallelize due to the preponderant number of memory-bound operations during the bidiagonal reduction. We investigate the latter scenario, which performs more floating-point operations but exposes at the same time more parallelism, and therefore, runs closer to the theoretical peak performance of the system, thanks to more compute-bound matrix operations. Profiling results show the performance scalability of QDWH for calculating the polar decomposition using around 9200 MPI processes on well and ill-conditioned matrices of 100K×100K problem size. We study then the performance impact of the QDWH-based polar decomposition as a pre-processing step toward calculating the SVD itself. The new distributed-memory implementation of the QDWH-SVD solver achieves up to five-fold speedup against current state-of-the-art vendor SVD implementations. © Springer International Publishing Switzerland 2016.
KAUST Department:
Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division; Extreme Computing Research Center
Citation:
Sukkari D, Ltaief H, Keyes D (2016) High Performance Polar Decomposition on Distributed Memory Systems. Lecture Notes in Computer Science: 605–616. Available: http://dx.doi.org/10.1007/978-3-319-43659-3_44.
Publisher:
Springer Nature
Journal:
Euro-Par 2016: Parallel Processing
Conference/Event name:
22nd International Conference on Parallel and Distributed Computing, Euro-Par 2016
Issue Date:
8-Aug-2016
DOI:
10.1007/978-3-319-43659-3_44
Type:
Conference Paper
ISSN:
0302-9743; 1611-3349
Sponsors:
For computer time, this research used the resources from the Swiss National Supercomputing Centre (CSCS) in Lugano, Switzerland.
Additional Links:
http://link.springer.com/chapter/10.1007%2F978-3-319-43659-3_44
Appears in Collections:
Conference Papers; Extreme Computing Research Center; Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division

Full metadata record

DC FieldValue Language
dc.contributor.authorSukkari, Dalal E.en
dc.contributor.authorLtaief, Hatemen
dc.contributor.authorKeyes, David E.en
dc.date.accessioned2017-01-02T08:10:21Z-
dc.date.available2017-01-02T08:10:21Z-
dc.date.issued2016-08-08en
dc.identifier.citationSukkari D, Ltaief H, Keyes D (2016) High Performance Polar Decomposition on Distributed Memory Systems. Lecture Notes in Computer Science: 605–616. Available: http://dx.doi.org/10.1007/978-3-319-43659-3_44.en
dc.identifier.issn0302-9743en
dc.identifier.issn1611-3349en
dc.identifier.doi10.1007/978-3-319-43659-3_44en
dc.identifier.urihttp://hdl.handle.net/10754/622144-
dc.description.abstractThe polar decomposition of a dense matrix is an important operation in linear algebra. It can be directly calculated through the singular value decomposition (SVD) or iteratively using the QR dynamically-weighted Halley algorithm (QDWH). The former is difficult to parallelize due to the preponderant number of memory-bound operations during the bidiagonal reduction. We investigate the latter scenario, which performs more floating-point operations but exposes at the same time more parallelism, and therefore, runs closer to the theoretical peak performance of the system, thanks to more compute-bound matrix operations. Profiling results show the performance scalability of QDWH for calculating the polar decomposition using around 9200 MPI processes on well and ill-conditioned matrices of 100K×100K problem size. We study then the performance impact of the QDWH-based polar decomposition as a pre-processing step toward calculating the SVD itself. The new distributed-memory implementation of the QDWH-SVD solver achieves up to five-fold speedup against current state-of-the-art vendor SVD implementations. © Springer International Publishing Switzerland 2016.en
dc.description.sponsorshipFor computer time, this research used the resources from the Swiss National Supercomputing Centre (CSCS) in Lugano, Switzerland.en
dc.publisherSpringer Natureen
dc.relation.urlhttp://link.springer.com/chapter/10.1007%2F978-3-319-43659-3_44en
dc.titleHigh Performance Polar Decomposition on Distributed Memory Systemsen
dc.typeConference Paperen
dc.contributor.departmentComputer, Electrical and Mathematical Sciences and Engineering (CEMSE) Divisionen
dc.contributor.departmentExtreme Computing Research Centeren
dc.identifier.journalEuro-Par 2016: Parallel Processingen
dc.conference.date2016-08-24 to 2016-08-26en
dc.conference.name22nd International Conference on Parallel and Distributed Computing, Euro-Par 2016en
dc.conference.locationGrenoble, FRAen
kaust.authorSukkari, Dalal E.en
kaust.authorLtaief, Hatemen
kaust.authorKeyes, David E.en
All Items in KAUST are protected by copyright, with all rights reserved, unless otherwise indicated.