Show simple item record

dc.contributor.authorCharara, Ali
dc.contributor.authorLtaief, Hatem
dc.contributor.authorKeyes, David E.
dc.date.accessioned2016-11-17T07:36:00Z
dc.date.available2016-11-17T07:36:00Z
dc.date.issued2016-08-09
dc.identifier.isbn978-3-319-43659-3
dc.identifier.issn0302-9743
dc.identifier.doi10.1007/978-3-319-43659-3_35
dc.identifier.urihttp://hdl.handle.net/10754/621824
dc.description.abstractA new implementation of the triangular matrix-matrix multiplication (TRMM) and the triangular solve (TRSM) kernels are described on GPU hardware accelerators. Although part of the Level 3 BLAS family, these highly computationally intensive kernels fail to achieve the percentage of the theoretical peak performance on GPUs that one would expect when running kernels with similar surface-to-volume ratio on hardware accelerators, i.e., the standard matrix-matrix multiplication (GEMM). The authors propose adopting a recursive formulation, which enriches the TRMM and TRSM inner structures with GEMM calls and, therefore, reduces memory traffic while increasing the level of concurrency. The new implementation enables efficient use of the GPU memory hierarchy and mitigates the latency overhead, to run at the speed of the higher cache levels. Performance comparisons show up to eightfold and twofold speedups for large dense matrix sizes, against the existing state-of-the-art TRMM and TRSM implementations from NVIDIA cuBLAS, respectively, across various GPU generations. Once integrated into high-level Cholesky-based dense linear algebra algorithms, the performance impact on the overall applications demonstrates up to fourfold and twofold speedups, against the equivalent native implementations, linked with cuBLAS TRMM and TRSM kernels, respectively. The new TRMM/TRSM kernel implementations are part of the open-source KBLAS software library (http://ecrc.kaust.edu.sa/Pages/Res-kblas.aspx) and are lined up for integration into the NVIDIA cuBLAS library in the upcoming v8.0 release.
dc.description.sponsorshipWe thank NVIDIA for hardware donations in the context of the GPU Research Center Award to the Extreme Computing Research Center at the King Abdullah University of Science and Technology and KAUST IT Research Computing for hardware support on the GPU-based system.
dc.publisherSpringer Nature
dc.relation.urlhttp://link.springer.com/chapter/10.1007%2F978-3-319-43659-3_35
dc.rightsAuthor version of paper archived with thanks to Lecture Notes in Computer Science.
dc.subjectTriangular dense matrix computations
dc.subjectHigh Performance Computing
dc.subjectRecursive formulation
dc.subjectKBLAS
dc.subjectGPU Optimizations
dc.titleRedesigning Triangular Dense Matrix Computations on GPUs
dc.typeConference Paper
dc.contributor.departmentApplied Mathematics and Computational Science Program
dc.contributor.departmentComputer Science Program
dc.contributor.departmentComputer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division
dc.contributor.departmentExtreme Computing Research Center
dc.identifier.journalEuro-Par 2016: Parallel Processing
dc.conference.date22-24 Aug 2016
dc.conference.nameEuro-Par 2016: European Conference on Parallel Processing
dc.conference.locationGrenoble, France
dc.eprint.versionPost-print
kaust.personCharara, Ali
kaust.personLtaief, Hatem
kaust.personKeyes, David E.
refterms.dateFOA2018-06-13T20:10:43Z
dc.date.published-online2016-08-09
dc.date.published-print2016


Files in this item

Thumbnail
Name:
TriangularDLA_GPU.pdf
Size:
568.9Kb
Format:
PDF
Description:
Author Version

This item appears in the following Collection(s)

Show simple item record