A framework for dense triangular matrix kernels on various manycore architectures
Name:
Charara_et_al-2017-Concurrency_and_Computation__Practice_and_Experience.pdf
Size:
3.113Mb
Format:
PDF
Description:
Final published version
Type
ArticleAuthors
Charara, Ali
Keyes, David E.

Ltaief, Hatem

KAUST Department
Applied Mathematics and Computational Science ProgramComputer Science Program
Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division
Extreme Computing Research Center
Date
2017-06-05Online Publication Date
2017-06-05Print Publication Date
2017-08-10Permanent link to this record
http://hdl.handle.net/10754/622077
Metadata
Show full item recordAbstract
We present a new high-performance framework for dense triangular Basic Linear Algebra Subroutines (BLAS) kernels, ie, triangular matrix-matrix multiplication (TRMM) and triangular solve (TRSM), on various manycore architectures. This is an extension of a previous work on a single GPU by the same authors, presented at the EuroPar'16 conference, in which we demonstrated the effectiveness of recursive formulations in enhancing the performance of these kernels. In this paper, the performance of triangular BLAS kernels on a single GPU is further enhanced by implementing customized in-place CUDA kernels for TRMM and TRSM, which are called at the bottom of the recursion. In addition, a multi-GPU implementation of TRMM and TRSM is proposed and we show an almost linear performance scaling, as the number of GPUs increases. Finally, the algorithmic recursive formulation of these triangular BLAS kernels is in fact oblivious to the targeted hardware architecture. We, therefore, port these recursive kernels to homogeneous x86 hardware architectures by relying on the vendor optimized BLAS implementations. Results reported on various hardware architectures highlight a significant performance improvement against state-of-the-art implementations. These new kernels are freely available in the KAUST BLAS (KBLAS) open-source library at https://github.com/ecrc/kblas.Citation
Charara A, Keyes D, Ltaief H (2017) A framework for dense triangular matrix kernels on various manycore architectures. Concurrency and Computation: Practice and Experience 29: e4187. Available: http://dx.doi.org/10.1002/cpe.4187.Sponsors
We would like to thank NVIDIA for hardware donations in the context of a GPU Research Center and Intel for support in the form of a Parallel Computing Center award to the Extreme Computing Research Center at King Abdullah University of Science and Technology and KAUST IT Research Computing for their hardware support on the GPU-based system.Publisher
WileyDOI
10.1002/cpe.4187Additional Links
http://onlinelibrary.wiley.com/doi/10.1002/cpe.4187/fullRelations
Is Supplemented By:- [Software]
Title: ecrc/kblas:. Publication Date: 2016-12-28. github: ecrc/kblas Handle: 10754/667017
ae974a485f413a2113503eed53cd6c53
10.1002/cpe.4187
Scopus Count
Except where otherwise noted, this item's license is described as This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided
the original work is properly cited.