Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs

Handle URI:
http://hdl.handle.net/10754/622975
Title:
Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs
Authors:
Charara, Ali ( 0000-0002-9509-7794 ) ; Keyes, David Elliot ( 0000-0002-4052-7224 ) ; Ltaief, Hatem ( 0000-0002-6897-1095 )
Abstract:
Batched dense linear algebra kernels are becoming ubiquitous in scientific applications, ranging from tensor contractions in deep learning to data compression in hierarchical low-rank matrix approximation. Within a single API call, these kernels are capable of simultaneously launching up to thousands of similar matrix computations, removing the expensive overhead of multiple API calls while increasing the occupancy of the underlying hardware. A challenge is that for the existing hardware landscape (x86, GPUs, etc.), only a subset of the required batched operations is implemented by the vendors, with limited support for very small problem sizes. We describe the design and performance of a new class of batched triangular dense linear algebra kernels on very small data sizes using single and multiple GPUs. By deploying two-sided recursive formulations, stressing the register usage, maintaining data locality, reducing threads synchronization and fusing successive kernel calls, the new batched kernels outperform existing state-of-the-art implementations.
KAUST Department:
Extreme Computing Research Center
Citation:
Ali Charara, David Keyes, and Hatem Ltaief. 2017. Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs. ACM Trans. Math. Softw. 9, 4, Article 39 (March 2017), 26 pages.
Publisher:
Association for Computing Machinery
Journal:
ACM Transactions on Mathematical Software
Issue Date:
6-Mar-2017
Type:
Article
Sponsors:
The authors would like to thank the NVIDIA for their hardware donations and remote access to their systems in the context of the NVIDIA GPU Research Center awarded to the Extreme Computing Research Center at KAUST.
Appears in Collections:
Articles

Full metadata record

DC FieldValue Language
dc.contributor.authorCharara, Alien
dc.contributor.authorKeyes, David Ellioten
dc.contributor.authorLtaief, Hatemen
dc.date.accessioned2017-03-07T13:38:03Z-
dc.date.available2017-03-07T13:38:03Z-
dc.date.issued2017-03-06-
dc.identifier.citationAli Charara, David Keyes, and Hatem Ltaief. 2017. Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs. ACM Trans. Math. Softw. 9, 4, Article 39 (March 2017), 26 pages.en
dc.identifier.urihttp://hdl.handle.net/10754/622975-
dc.description.abstractBatched dense linear algebra kernels are becoming ubiquitous in scientific applications, ranging from tensor contractions in deep learning to data compression in hierarchical low-rank matrix approximation. Within a single API call, these kernels are capable of simultaneously launching up to thousands of similar matrix computations, removing the expensive overhead of multiple API calls while increasing the occupancy of the underlying hardware. A challenge is that for the existing hardware landscape (x86, GPUs, etc.), only a subset of the required batched operations is implemented by the vendors, with limited support for very small problem sizes. We describe the design and performance of a new class of batched triangular dense linear algebra kernels on very small data sizes using single and multiple GPUs. By deploying two-sided recursive formulations, stressing the register usage, maintaining data locality, reducing threads synchronization and fusing successive kernel calls, the new batched kernels outperform existing state-of-the-art implementations.en
dc.description.sponsorshipThe authors would like to thank the NVIDIA for their hardware donations and remote access to their systems in the context of the NVIDIA GPU Research Center awarded to the Extreme Computing Research Center at KAUST.en
dc.language.isoenen
dc.publisherAssociation for Computing Machineryen
dc.rights"© ACM, 2017. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM Transactions on Mathematical Software, 9, 4, (2017).en
dc.subjectBatched BLAS Kernelsen
dc.subjectDense Linear Algebraen
dc.subjectKBLASen
dc.subjectHardware Acceleratorsen
dc.subjectRecursive formulationen
dc.titleBatched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUsen
dc.typeArticleen
dc.contributor.departmentExtreme Computing Research Centeren
dc.identifier.journalACM Transactions on Mathematical Softwareen
dc.eprint.versionPost-printen
dc.contributor.affiliationKing Abdullah University of Science and Technology (KAUST)en
All Items in KAUST are protected by copyright, with all rights reserved, unless otherwise indicated.