Type
Conference PaperAuthors
Charara, Ali
Ltaief, Hatem

Keyes, David E.

KAUST Department
Extreme Computing Research CenterDate
2016-08-09Permanent link to this record
http://hdl.handle.net/10754/621824
Metadata
Show full item recordAbstract
A new implementation of the triangular matrix-matrix multiplication (TRMM) and the triangular solve (TRSM) kernels are described on GPU hardware accelerators. Although part of the Level 3 BLAS family, these highly computationally intensive kernels fail to achieve the percentage of the theoretical peak performance on GPUs that one would expect when running kernels with similar surface-to-volume ratio on hardware accelerators, i.e., the standard matrix-matrix multiplication (GEMM). The authors propose adopting a recursive formulation, which enriches the TRMM and TRSM inner structures with GEMM calls and, therefore, reduces memory traffic while increasing the level of concurrency. The new implementation enables efficient use of the GPU memory hierarchy and mitigates the latency overhead, to run at the speed of the higher cache levels. Performance comparisons show up to eightfold and twofold speedups for large dense matrix sizes, against the existing state-of-the-art TRMM and TRSM implementations from NVIDIA cuBLAS, respectively, across various GPU generations. Once integrated into high-level Cholesky-based dense linear algebra algorithms, the performance impact on the overall applications demonstrates up to fourfold and twofold speedups, against the equivalent native implementations, linked with cuBLAS TRMM and TRSM kernels, respectively. The new TRMM/TRSM kernel implementations are part of the open-source KBLAS software library (http://ecrc.kaust.edu.sa/Pages/Res-kblas.aspx) and are lined up for integration into the NVIDIA cuBLAS library in the upcoming v8.0 release.Sponsors
We thank NVIDIA for hardware donations in the context of the GPU Research Center Award to the Extreme Computing Research Center at the King Abdullah University of Science and Technology and KAUST IT Research Computing for hardware support on the GPU-based system.Publisher
Springer NatureConference/Event name
Euro-Par 2016: European Conference on Parallel ProcessingISBN
978-3-319-43659-3ISSN
0302-9743Additional Links
http://link.springer.com/chapter/10.1007%2F978-3-319-43659-3_35ae974a485f413a2113503eed53cd6c53
10.1007/978-3-319-43659-3_35