Batched Triangular DLA for Very Small Matrices on GPUs

Handle URI:
http://hdl.handle.net/10754/623339
Title:
Batched Triangular DLA for Very Small Matrices on GPUs
Authors:
Charara, Ali ( 0000-0002-9509-7794 ) ; Keyes, David E. ( 0000-0002-4052-7224 ) ; Ltaief, Hatem ( 0000-0002-6897-1095 )
Abstract:
In several scientific applications, like tensor contractions in deep learning computation or data compression in hierarchical low rank matrix approximation, the bulk of computation typically resides in performing thousands of independent dense linear algebra operations on very small matrix sizes (usually less than 100). Batched dense linear algebra kernels are becoming ubiquitous for such scientific computations. Within a single API call, these kernels are capable of simultaneously launching a large number of similar matrix computations, removing the expensive overhead of multiple API calls while increasing the utilization of the underlying hardware.
KAUST Department:
Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division; Extreme Computing Research Center
Conference/Event name:
High Performance Computing Saudi Arabia (HPC Saudi) 2017
Issue Date:
13-Mar-2017
Type:
Poster
Appears in Collections:
Posters; High Performance Computing Saudi Arabia (HPC Saudi) 2017

Full metadata record

DC FieldValue Language
dc.contributor.authorCharara, Alien
dc.contributor.authorKeyes, David E.en
dc.contributor.authorLtaief, Hatemen
dc.date.accessioned2017-05-04T12:33:22Z-
dc.date.available2017-05-04T12:33:22Z-
dc.date.issued2017-03-13-
dc.identifier.urihttp://hdl.handle.net/10754/623339-
dc.description.abstractIn several scientific applications, like tensor contractions in deep learning computation or data compression in hierarchical low rank matrix approximation, the bulk of computation typically resides in performing thousands of independent dense linear algebra operations on very small matrix sizes (usually less than 100). Batched dense linear algebra kernels are becoming ubiquitous for such scientific computations. Within a single API call, these kernels are capable of simultaneously launching a large number of similar matrix computations, removing the expensive overhead of multiple API calls while increasing the utilization of the underlying hardware.en
dc.titleBatched Triangular DLA for Very Small Matrices on GPUsen
dc.typePosteren
dc.contributor.departmentComputer, Electrical and Mathematical Sciences and Engineering (CEMSE) Divisionen
dc.contributor.departmentExtreme Computing Research Centeren
dc.conference.dateMarch 13-15, 2017en
dc.conference.nameHigh Performance Computing Saudi Arabia (HPC Saudi) 2017en
dc.conference.locationKAUSTen
kaust.authorCharara, Alien
kaust.authorKeyes, David E.en
kaust.authorLtaief, Hatemen
All Items in KAUST are protected by copyright, with all rights reserved, unless otherwise indicated.