Batched Triangular DLA for Very Small Matrices on GPUs

In several scientific applications, like tensor contractions in deep learning computation or data compression in hierarchical low rank matrix approximation, the bulk of computation typically resides in performing thousands of independent dense linear algebra operations on very small matrix sizes (usually less than 100). Batched dense linear algebra kernels are becoming ubiquitous for such scientific computations. Within a single API call, these kernels are capable of simultaneously launching a large number of similar matrix computations, removing the expensive overhead of multiple API calls while increasing the utilization of the underlying hardware.

Conference/Event Name
High Performance Computing Saudi Arabia (HPC Saudi) 2017

Permanent link to this record