A framework for dense triangular matrix kernels on various manycore architectures

Handle URI:
http://hdl.handle.net/10754/622077
Title:
A framework for dense triangular matrix kernels on various manycore architectures
Authors:
Charara, Ali ( 0000-0002-9509-7794 ) ; Keyes, David E. ( 0000-0002-4052-7224 ) ; Ltaief, Hatem ( 0000-0002-6897-1095 )
Abstract:
We present a new high-performance framework for dense triangular Basic Linear Algebra Subroutines (BLAS) kernels, ie, triangular matrix-matrix multiplication (TRMM) and triangular solve (TRSM), on various manycore architectures. This is an extension of a previous work on a single GPU by the same authors, presented at the EuroPar'16 conference, in which we demonstrated the effectiveness of recursive formulations in enhancing the performance of these kernels. In this paper, the performance of triangular BLAS kernels on a single GPU is further enhanced by implementing customized in-place CUDA kernels for TRMM and TRSM, which are called at the bottom of the recursion. In addition, a multi-GPU implementation of TRMM and TRSM is proposed and we show an almost linear performance scaling, as the number of GPUs increases. Finally, the algorithmic recursive formulation of these triangular BLAS kernels is in fact oblivious to the targeted hardware architecture. We, therefore, port these recursive kernels to homogeneous x86 hardware architectures by relying on the vendor optimized BLAS implementations. Results reported on various hardware architectures highlight a significant performance improvement against state-of-the-art implementations. These new kernels are freely available in the KAUST BLAS (KBLAS) open-source library at https://github.com/ecrc/kblas.
KAUST Department:
Extreme Computing Research Center
Citation:
Charara A, Keyes D, Ltaief H (2017) A framework for dense triangular matrix kernels on various manycore architectures. Concurrency and Computation: Practice and Experience 29: e4187. Available: http://dx.doi.org/10.1002/cpe.4187.
Publisher:
Wiley-Blackwell
Journal:
Concurrency and Computation: Practice and Experience
Issue Date:
6-Jun-2017
DOI:
10.1002/cpe.4187
Type:
Article
ISSN:
1532-0626
Sponsors:
We would like to thank NVIDIA for hardware donations in the context of a GPU Research Center and Intel for support in the form of a Parallel Computing Center award to the Extreme Computing Research Center at King Abdullah University of Science and Technology and KAUST IT Research Computing for their hardware support on the GPU-based system.
Additional Links:
http://onlinelibrary.wiley.com/doi/10.1002/cpe.4187/full
Appears in Collections:
Articles; Extreme Computing Research Center

Full metadata record

DC FieldValue Language
dc.contributor.authorCharara, Alien
dc.contributor.authorKeyes, David E.en
dc.contributor.authorLtaief, Hatemen
dc.date.accessioned2017-10-02T12:59:30Z-
dc.date.available2016-12-28T06:27:48Z-
dc.date.available2017-10-02T12:59:30Z-
dc.date.issued2017-06-06en
dc.identifier.citationCharara A, Keyes D, Ltaief H (2017) A framework for dense triangular matrix kernels on various manycore architectures. Concurrency and Computation: Practice and Experience 29: e4187. Available: http://dx.doi.org/10.1002/cpe.4187.en
dc.identifier.issn1532-0626en
dc.identifier.doi10.1002/cpe.4187en
dc.identifier.urihttp://hdl.handle.net/10754/622077-
dc.description.abstractWe present a new high-performance framework for dense triangular Basic Linear Algebra Subroutines (BLAS) kernels, ie, triangular matrix-matrix multiplication (TRMM) and triangular solve (TRSM), on various manycore architectures. This is an extension of a previous work on a single GPU by the same authors, presented at the EuroPar'16 conference, in which we demonstrated the effectiveness of recursive formulations in enhancing the performance of these kernels. In this paper, the performance of triangular BLAS kernels on a single GPU is further enhanced by implementing customized in-place CUDA kernels for TRMM and TRSM, which are called at the bottom of the recursion. In addition, a multi-GPU implementation of TRMM and TRSM is proposed and we show an almost linear performance scaling, as the number of GPUs increases. Finally, the algorithmic recursive formulation of these triangular BLAS kernels is in fact oblivious to the targeted hardware architecture. We, therefore, port these recursive kernels to homogeneous x86 hardware architectures by relying on the vendor optimized BLAS implementations. Results reported on various hardware architectures highlight a significant performance improvement against state-of-the-art implementations. These new kernels are freely available in the KAUST BLAS (KBLAS) open-source library at https://github.com/ecrc/kblas.en
dc.description.sponsorshipWe would like to thank NVIDIA for hardware donations in the context of a GPU Research Center and Intel for support in the form of a Parallel Computing Center award to the Extreme Computing Research Center at King Abdullah University of Science and Technology and KAUST IT Research Computing for their hardware support on the GPU-based system.en
dc.language.isoenen
dc.publisherWiley-Blackwellen
dc.relation.urlhttp://onlinelibrary.wiley.com/doi/10.1002/cpe.4187/fullen
dc.rightsThis is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.en
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/en
dc.subjectDense triangular matrix computationsen
dc.subjectKBLASen
dc.subjectManycore optimizationsen
dc.subjectRecursive formulationen
dc.titleA framework for dense triangular matrix kernels on various manycore architecturesen
dc.typeArticleen
dc.contributor.departmentExtreme Computing Research Centeren
dc.identifier.journalConcurrency and Computation: Practice and Experienceen
dc.eprint.versionPublisher's Version/PDFen
dc.contributor.affiliationKing Abdullah University of Science and Technology (KAUST)en
kaust.authorCharara, Alien
kaust.authorKeyes, David E.en
kaust.authorLtaief, Hatemen

Version History

VersionItem Editor Date Summary
2 10754/622077grenzdm2017-10-02 12:57:17.602Final version published with DOI.
1 10754/622077.1ltaiefh2016-12-28 06:27:48.0
All Items in KAUST are protected by copyright, with all rights reserved, unless otherwise indicated.