Performance Impact of Rank-Reordering on Advanced Polar Decomposition Algorithms

Type
Technical Report

Authors
Esposito, Aniello
Keyes, David E.
Ltaief, Hatem
Sukkari, Dalal

KAUST Department
ECRC

Date
2018

Abstract
We demonstrate the importance of both MPI rank reordering and choice of processor grid topology in the context of advanced dense linear algebra (DLA) applications for distributed-memory systems. In particular, we focus on the advanced polar decomposition (PD) algorithm, based on the QR-based Dynamically Weighted Halley method (QDWH). The QDWH algorithm may be used as the first computational step toward solving symmetric eigenvalue problems and the singular value decomposition. Sukkari et al. (ACM TOMS, 2017) have shown that QDWH may benefit from rectangular instead of square processor grid topologies, which directly impact the performance of the underlying ScaLAPACK algorithms. In this work, we experiment an extensive combination of grid topologies and rank reorderings for different matrix sizes and number of nodes, and use QDWH as a proxy for advanced compute-bound linear algebra operations, since it is rich in dense linear solvers and factorizations. A performance improvement of up to 54% can be observed for QDWH on 800 nodes of a Cray XC system, thanks to an optimal combination, especially in strong scaling mode of operation, for which communication overheads may become dominant. We perform a thorough application profiling to analyze the impact of reordering and grid topologies on the various linear algebra components of the QDWH algorithm. It turns out that point- to-point communications may be considerably reduced thanks to a judicious choice of grid topology, while properly setting the rank reordering using the features from the cray-mpich library.

Permanent link to this record