Asynchronous Task-Based Polar Decomposition on Manycore Architectures

Handle URI:
http://hdl.handle.net/10754/621202
Title:
Asynchronous Task-Based Polar Decomposition on Manycore Architectures
Authors:
Sukkari, Dalal; Ltaief, Hatem ( 0000-0002-6897-1095 ) ; Faverge, Mathieu; Keyes, David E. ( 0000-0002-4052-7224 )
Abstract:
This paper introduces the first asynchronous, task-based implementation of the polar decomposition on manycore architectures. Based on a new formulation of the iterative QR dynamically-weighted Halley algorithm (QDWH) for the calculation of the polar decomposition, the proposed implementation replaces the original and hostile LU factorization for the condition number estimator by the more adequate QR factorization to enable software portability across various architectures. Relying on fine-grained computations, the novel task-based implementation is also capable of taking advantage of the identity structure of the matrix involved during the QDWH iterations, which decreases the overall algorithmic complexity. Furthermore, the artifactual synchronization points have been severely weakened compared to previous implementations, unveiling look-ahead opportunities for better hardware occupancy. The overall QDWH-based polar decomposition can then be represented as a directed acyclic graph (DAG), where nodes represent computational tasks and edges define the inter-task data dependencies. The StarPU dynamic runtime system is employed to traverse the DAG, to track the various data dependencies and to asynchronously schedule the computational tasks on the underlying hardware resources, resulting in an out-of-order task scheduling. Benchmarking experiments show significant improvements against existing state-of-the-art high performance implementations (i.e., Intel MKL and Elemental) for the polar decomposition on latest shared-memory vendors' systems (i.e., Intel Haswell/Broadwell/Knights Landing, NVIDIA K80/P100 GPUs and IBM Power8), while maintaining high numerical accuracy.
KAUST Department:
CEMSE
Issue Date:
25-Oct-2016
Type:
Technical Report
Sponsors:
Cray, Intel, NVIDIA
Appears in Collections:
Technical Reports

Full metadata record

DC FieldValue Language
dc.contributor.authorSukkari, Dalalen
dc.contributor.authorLtaief, Hatemen
dc.contributor.authorFaverge, Mathieuen
dc.contributor.authorKeyes, David E.en
dc.date.accessioned2016-10-25T05:28:33Z-
dc.date.available2016-10-25T05:28:33Z-
dc.date.issued2016-10-25-
dc.identifier.urihttp://hdl.handle.net/10754/621202-
dc.description.abstractThis paper introduces the first asynchronous, task-based implementation of the polar decomposition on manycore architectures. Based on a new formulation of the iterative QR dynamically-weighted Halley algorithm (QDWH) for the calculation of the polar decomposition, the proposed implementation replaces the original and hostile LU factorization for the condition number estimator by the more adequate QR factorization to enable software portability across various architectures. Relying on fine-grained computations, the novel task-based implementation is also capable of taking advantage of the identity structure of the matrix involved during the QDWH iterations, which decreases the overall algorithmic complexity. Furthermore, the artifactual synchronization points have been severely weakened compared to previous implementations, unveiling look-ahead opportunities for better hardware occupancy. The overall QDWH-based polar decomposition can then be represented as a directed acyclic graph (DAG), where nodes represent computational tasks and edges define the inter-task data dependencies. The StarPU dynamic runtime system is employed to traverse the DAG, to track the various data dependencies and to asynchronously schedule the computational tasks on the underlying hardware resources, resulting in an out-of-order task scheduling. Benchmarking experiments show significant improvements against existing state-of-the-art high performance implementations (i.e., Intel MKL and Elemental) for the polar decomposition on latest shared-memory vendors' systems (i.e., Intel Haswell/Broadwell/Knights Landing, NVIDIA K80/P100 GPUs and IBM Power8), while maintaining high numerical accuracy.en
dc.description.sponsorshipCray, Intel, NVIDIAen
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/*
dc.subjectPolar decompositionen
dc.subjectAsynchronous executionen
dc.subjectDynamic runtime systemen
dc.subjectFine-grained executionen
dc.subjectDirected acyclic graphen
dc.subjectHigh performance computingen
dc.titleAsynchronous Task-Based Polar Decomposition on Manycore Architecturesen
dc.typeTechnical Reporten
dc.contributor.departmentCEMSEen
dc.contributor.institutionBordeaux INP, CNRS, INRIA et Universite de Bordeauxen
This item is licensed under a Creative Commons License
Creative Commons
All Items in KAUST are protected by copyright, with all rights reserved, unless otherwise indicated.