A QDWH-Based SVD Software Framework on Distributed-Memory Manycore Systems

Handle URI:
http://hdl.handle.net/10754/626212
Title:
A QDWH-Based SVD Software Framework on Distributed-Memory Manycore Systems
Authors:
Sukkari, Dalal; Ltaief, Hatem ( 0000-0002-6897-1095 ) ; Esposito, Aniello; Keyes, David E. ( 0000-0002-4052-7224 )
Abstract:
This paper presents a high performance software framework for computing a dense SVD on distributed- memory manycore systems. Originally introduced by Nakatsukasa et al. (Nakatsukasa et al. 2010; Nakatsukasa and Higham 2013), the SVD solver relies on the polar decomposition using the QR Dynamically-Weighted Halley algorithm (QDWH). Although the QDWH-based SVD algorithm performs a significant amount of extra floating-point operations compared to the traditional SVD with the one-stage bidiagonal reduction, the inherent high level of concurrency associated with Level 3 BLAS compute-bound kernels ultimately compensates for the arithmetic complexity overhead. Using the ScaLAPACK two-dimensional block cyclic data distribution with a rectangular processor topology, the resulting QDWH-SVD further reduces excessive communications during the panel factorization, while increasing the degree of parallelism during the update of the trailing submatrix, as opposed to relying to the default square processor grid. After detailing the algorithmic complexity and the memory footprint of the algorithm, we conduct a thorough performance analysis and study the impact of the grid topology on the performance by looking at the communication and computation profiling trade-offs. We report performance results against state-of-the-art existing QDWH software implementations (e.g., Elemental) and their SVD extensions on large-scale distributed-memory manycore systems based on commodity Intel x86 Haswell processors and Knights Landing (KNL) architecture. The QDWH-SVD framework achieves up to 3/8-fold on the Haswell/KNL-based platforms, respectively, against ScaLAPACK PDGESVD and turns out to be a competitive alternative for well and ill-conditioned matrices. We finally come up herein with a performance model based on these empirical results. Our QDWH-based polar decomposition and its SVD extension are freely available at https://github.com/ecrc/qdwh.git and https://github.com/ecrc/ksvd.git, respectively, and have been integrated into the Cray Scientific numerical library LibSci v17.11.1.
KAUST Department:
ECRC; CEMSE Division
Issue Date:
2017
Type:
Technical Report
Appears in Collections:
Technical Reports

Full metadata record

DC FieldValue Language
dc.contributor.authorSukkari, Dalalen
dc.contributor.authorLtaief, Hatemen
dc.contributor.authorEsposito, Anielloen
dc.contributor.authorKeyes, David E.en
dc.date.accessioned2017-11-28T05:46:53Z-
dc.date.available2017-11-28T05:46:53Z-
dc.date.issued2017-
dc.identifier.urihttp://hdl.handle.net/10754/626212-
dc.description.abstractThis paper presents a high performance software framework for computing a dense SVD on distributed- memory manycore systems. Originally introduced by Nakatsukasa et al. (Nakatsukasa et al. 2010; Nakatsukasa and Higham 2013), the SVD solver relies on the polar decomposition using the QR Dynamically-Weighted Halley algorithm (QDWH). Although the QDWH-based SVD algorithm performs a significant amount of extra floating-point operations compared to the traditional SVD with the one-stage bidiagonal reduction, the inherent high level of concurrency associated with Level 3 BLAS compute-bound kernels ultimately compensates for the arithmetic complexity overhead. Using the ScaLAPACK two-dimensional block cyclic data distribution with a rectangular processor topology, the resulting QDWH-SVD further reduces excessive communications during the panel factorization, while increasing the degree of parallelism during the update of the trailing submatrix, as opposed to relying to the default square processor grid. After detailing the algorithmic complexity and the memory footprint of the algorithm, we conduct a thorough performance analysis and study the impact of the grid topology on the performance by looking at the communication and computation profiling trade-offs. We report performance results against state-of-the-art existing QDWH software implementations (e.g., Elemental) and their SVD extensions on large-scale distributed-memory manycore systems based on commodity Intel x86 Haswell processors and Knights Landing (KNL) architecture. The QDWH-SVD framework achieves up to 3/8-fold on the Haswell/KNL-based platforms, respectively, against ScaLAPACK PDGESVD and turns out to be a competitive alternative for well and ill-conditioned matrices. We finally come up herein with a performance model based on these empirical results. Our QDWH-based polar decomposition and its SVD extension are freely available at https://github.com/ecrc/qdwh.git and https://github.com/ecrc/ksvd.git, respectively, and have been integrated into the Cray Scientific numerical library LibSci v17.11.1.en
dc.subjectDense SVD solveren
dc.subjectPolar decompositionen
dc.subjectQDWHen
dc.subjectPerformance analysis,en
dc.subjectDistributed-memory manycore systemsen
dc.titleA QDWH-Based SVD Software Framework on Distributed-Memory Manycore Systemsen
dc.typeTechnical Reporten
dc.contributor.departmentECRCen
dc.contributor.departmentCEMSE Divisionen
All Items in KAUST are protected by copyright, with all rights reserved, unless otherwise indicated.