Asynchronous Task-Based Polar Decomposition on Single Node Manycore Architectures

Handle URI:
http://hdl.handle.net/10754/625885
Title:
Asynchronous Task-Based Polar Decomposition on Single Node Manycore Architectures
Authors:
Sukkari, Dalal E.; Ltaief, Hatem ( 0000-0002-6897-1095 ) ; Faverge, Mathieu; Keyes, David E. ( 0000-0002-4052-7224 )
Abstract:
This paper introduces the first asynchronous, task-based formulation of the polar decomposition and its corresponding implementation on manycore architectures. Based on a formulation of the iterative QR dynamically-weighted Halley algorithm (QDWH) for the calculation of the polar decomposition, the proposed implementation replaces the original LU factorization for the condition number estimator by the more adequate QR factorization to enable software portability across various architectures. Relying on fine-grained computations, the novel task-based implementation is capable of taking advantage of the identity structure of the matrix involved during the QDWH iterations, which decreases the overall algorithmic complexity. Furthermore, the artifactual synchronization points have been weakened compared to previous implementations, unveiling look-ahead opportunities for better hardware occupancy. The overall QDWH-based polar decomposition can then be represented as a directed acyclic graph (DAG), where nodes represent computational tasks and edges define the inter-task data dependencies. The StarPU dynamic runtime system is employed to traverse the DAG, to track the various data dependencies and to asynchronously schedule the computational tasks on the underlying hardware resources, resulting in an out-of-order task scheduling. Benchmarking experiments show significant improvements against existing state-of-the-art high performance implementations for the polar decomposition on latest shared-memory vendors' systems, while maintaining numerical accuracy.
KAUST Department:
ECRC, KAUST, Jeddah, Jeddah Saudi Arabia; Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division
Citation:
Sukkari D, Ltaief H, Faverge M, Keyes D (2017) Asynchronous Task-Based Polar Decomposition on Single Node Manycore Architectures. IEEE Transactions on Parallel and Distributed Systems: 1–1. Available: http://dx.doi.org/10.1109/TPDS.2017.2755655.
Publisher:
Institute of Electrical and Electronics Engineers (IEEE)
Journal:
IEEE Transactions on Parallel and Distributed Systems
Issue Date:
29-Sep-2017
DOI:
10.1109/TPDS.2017.2755655
Type:
Article
ISSN:
1045-9219
Sponsors:
The authors would like to thank Samuel Thibault from Inria for his support with StarPU, Jack Poulson from Google Inc. for his help in tuning Elemental and the vendors Cray/IBM/Intel/NVIDIA for their hardware donations and/or systems’ remote accesses in the context of the Cray Center of Excellence, the Intel Parallel Computing Center and the NVIDIA GPU Research Center awarded to the Extreme Computing Research Center at KAUST.
Additional Links:
http://ieeexplore.ieee.org/document/8053812/
Appears in Collections:
Articles; Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division

Full metadata record

DC FieldValue Language
dc.contributor.authorSukkari, Dalal E.en
dc.contributor.authorLtaief, Hatemen
dc.contributor.authorFaverge, Mathieuen
dc.contributor.authorKeyes, David E.en
dc.date.accessioned2017-10-17T11:47:39Z-
dc.date.available2017-10-17T11:47:39Z-
dc.date.issued2017-09-29en
dc.identifier.citationSukkari D, Ltaief H, Faverge M, Keyes D (2017) Asynchronous Task-Based Polar Decomposition on Single Node Manycore Architectures. IEEE Transactions on Parallel and Distributed Systems: 1–1. Available: http://dx.doi.org/10.1109/TPDS.2017.2755655.en
dc.identifier.issn1045-9219en
dc.identifier.doi10.1109/TPDS.2017.2755655en
dc.identifier.urihttp://hdl.handle.net/10754/625885-
dc.description.abstractThis paper introduces the first asynchronous, task-based formulation of the polar decomposition and its corresponding implementation on manycore architectures. Based on a formulation of the iterative QR dynamically-weighted Halley algorithm (QDWH) for the calculation of the polar decomposition, the proposed implementation replaces the original LU factorization for the condition number estimator by the more adequate QR factorization to enable software portability across various architectures. Relying on fine-grained computations, the novel task-based implementation is capable of taking advantage of the identity structure of the matrix involved during the QDWH iterations, which decreases the overall algorithmic complexity. Furthermore, the artifactual synchronization points have been weakened compared to previous implementations, unveiling look-ahead opportunities for better hardware occupancy. The overall QDWH-based polar decomposition can then be represented as a directed acyclic graph (DAG), where nodes represent computational tasks and edges define the inter-task data dependencies. The StarPU dynamic runtime system is employed to traverse the DAG, to track the various data dependencies and to asynchronously schedule the computational tasks on the underlying hardware resources, resulting in an out-of-order task scheduling. Benchmarking experiments show significant improvements against existing state-of-the-art high performance implementations for the polar decomposition on latest shared-memory vendors' systems, while maintaining numerical accuracy.en
dc.description.sponsorshipThe authors would like to thank Samuel Thibault from Inria for his support with StarPU, Jack Poulson from Google Inc. for his help in tuning Elemental and the vendors Cray/IBM/Intel/NVIDIA for their hardware donations and/or systems’ remote accesses in the context of the Cray Center of Excellence, the Intel Parallel Computing Center and the NVIDIA GPU Research Center awarded to the Extreme Computing Research Center at KAUST.en
dc.publisherInstitute of Electrical and Electronics Engineers (IEEE)en
dc.relation.urlhttp://ieeexplore.ieee.org/document/8053812/en
dc.rights(c) 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.en
dc.subjectAsynchronous executionen
dc.subjectComplexity theoryen
dc.subjectComputer architectureen
dc.subjectDirected acyclic graphen
dc.subjectDynamic runtime systemen
dc.subjectFine-grained executionen
dc.subjectHardwareen
dc.subjectHeuristic algorithmsen
dc.subjectHigh performance computingen
dc.subjectMatrix decompositionen
dc.subjectPolar decompositionen
dc.subjectSoftware algorithmsen
dc.titleAsynchronous Task-Based Polar Decomposition on Single Node Manycore Architecturesen
dc.typeArticleen
dc.contributor.departmentECRC, KAUST, Jeddah, Jeddah Saudi Arabiaen
dc.contributor.departmentComputer, Electrical and Mathematical Sciences and Engineering (CEMSE) Divisionen
dc.identifier.journalIEEE Transactions on Parallel and Distributed Systemsen
dc.eprint.versionPost-printen
dc.contributor.institutionHiePACS, Bordeaux INP, Talence, Acquitaine Franceen
kaust.authorSukkari, Dalal E.en
kaust.authorLtaief, Hatemen
kaust.authorKeyes, David E.en
All Items in KAUST are protected by copyright, with all rights reserved, unless otherwise indicated.