Show simple item record

dc.contributor.authorCharara, Ali
dc.contributor.authorKeyes, David E.
dc.contributor.authorLtaief, Hatem
dc.date.accessioned2019-08-22T11:16:57Z
dc.date.available2017-03-07T13:38:03Z
dc.date.available2019-08-22T11:16:57Z
dc.date.issued2019-04-01
dc.identifier.citationCharara, A., Keyes, D., & Ltaief, H. (2019). Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs. ACM Transactions on Mathematical Software, 45(2), 1–28. doi:10.1145/3267101
dc.identifier.doi10.1145/3267101
dc.identifier.urihttp://hdl.handle.net/10754/622975
dc.description.abstractBatched dense linear algebra kernels are becoming ubiquitous in scientific applications, ranging from tensor contractions in deep learning to data compression in hierarchical low-rank matrix approximation. Within a single API call, these kernels are capable of simultaneously launching up to thousands of similar matrix computations, removing the expensive overhead of multiple API calls while increasing the occupancy of the underlying hardware. A challenge is that for the existing hardware landscape (x86, GPUs, etc.), only a subset of the required batched operations is implemented by the vendors, with limited support for very small problem sizes. We describe the design and performance of a new class of batched triangular dense linear algebra kernels on very small data sizes (up to 256) using single and multiple GPUs. By deploying recursive formulations, stressing the register usage, maintaining data locality, reducing threads synchronization, and fusing successive kernel calls, the new batched kernels outperform existing state-of-the-art implementations.
dc.description.sponsorshipThe authors thank NVIDIA for their hardware donations and remote access to their systems in the context of the NVIDIA GPU Research Center awarded to the Extreme Computing Research Center at KAUST.
dc.language.isoen
dc.publisherAssociation for Computing Machinery acmhelp@acm.org
dc.relation.urlhttp://dl.acm.org/citation.cfm?doid=3326465.3267101
dc.rights© ACM, 2019. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM Transactions on Mathematical Software, {[Volume], [Issue], (2019-04-01)} http://doi.acm.org/10.1145/3267101
dc.subjectBatched BLAS Kernels
dc.subjectDense Linear Algebra
dc.subjectKBLAS
dc.subjectHardware Accelerators
dc.subjectRecursive formulation
dc.titleBatched triangular dense linear algebra kernels for very small matrix sizes on GPUs
dc.typeArticle
dc.contributor.departmentApplied Mathematics and Computational Science Program
dc.contributor.departmentExtreme Computing Research Center
dc.contributor.departmentComputer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division
dc.identifier.journalACM Transactions on Mathematical Software
dc.eprint.versionPost-print
dc.contributor.institution8716 Barbee Lane, Knoxville, TN, 37923, USA
dc.contributor.affiliationKing Abdullah University of Science and Technology (KAUST)
pubs.publication-statusPublished
kaust.personKeyes, David E.
kaust.personLtaief, Hatem
refterms.dateFOA2018-06-13T18:11:51Z
kaust.acknowledged.supportUnitExtreme Computing Research Center


Files in this item

Thumbnail
Name:
batchla-1file.pdf
Size:
2.044Mb
Format:
PDF
Description:
Accepted manuscript

This item appears in the following Collection(s)

Show simple item record

VersionItemEditorDateSummary

*Selected version