Systematic approach in optimizing numerical memory-bound kernels on GPU

Handle URI:
http://hdl.handle.net/10754/564656
Title:
Systematic approach in optimizing numerical memory-bound kernels on GPU
Authors:
Abdelfattah, Ahmad M.; Keyes, David E. ( 0000-0002-4052-7224 ) ; Ltaief, Hatem ( 0000-0002-6897-1095 )
Abstract:
The use of GPUs has been very beneficial in accelerating dense linear algebra computational kernels (DLA). Many high performance numerical libraries like CUBLAS, MAGMA, and CULA provide BLAS and LAPACK implementations on GPUs as well as hybrid computations involving both, CPUs and GPUs. GPUs usually score better performance than CPUs for compute-bound operations, especially those characterized by a regular data access pattern. This paper highlights a systematic approach for efficiently implementing memory-bound DLA kernels on GPUs, by taking advantage of the underlying device's architecture (e.g., high throughput). This methodology proved to outperform existing state-of-the-art GPU implementations for the symmetric matrix-vector multiplication (SYMV), characterized by an irregular data access pattern, in a recent work (Abdelfattah et. al, VECPAR 2012). We propose to extend this methodology to the general matrix-vector multiplication (GEMV) kernel. The performance results show that our GEMV implementation achieves better performance for relatively small to medium matrix sizes, making it very influential in calculating the Hessenberg and bidiagonal reductions of general matrices (radar applications), which are the first step toward computing eigenvalues and singular values, respectively. Considering small and medium size matrices (≤4500), our GEMV kernel achieves an average 60% improvement in single precision (SP) and an average 25% in double precision (DP) over existing open-source and commercial software solutions. These results improve reduction algorithms for both small and large matrices. The improved GEMV performances engender an averge 30% (SP) and 15% (DP) in Hessenberg reduction and up to 25% (SP) and 14% (DP) improvement for the bidiagonal reduction over the implementation provided by CUBLAS 5.0. © 2013 Springer-Verlag.
KAUST Department:
Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division; KAUST Supercomputing Laboratory (KSL); Applied Mathematics and Computational Science Program; Extreme Computing Research Center
Publisher:
Springer Science + Business Media
Journal:
Euro-Par 2012: Parallel Processing Workshops
Conference/Event name:
Parallel Processing Workshops, Euro-Par 2012: BDMC 2012, CGWS 2012, HeteroPar 2012, HiBB 2012, OMHI 2012, Paraphrase 2012, PROPER 2012, Resilience 2012, UCHPC 2012, VHPC 2012
Issue Date:
2013
DOI:
10.1007/978-3-642-36949-0_23
Type:
Conference Paper
ISSN:
03029743
ISBN:
9783642369483
Appears in Collections:
Conference Papers; Applied Mathematics and Computational Science Program; KAUST Supercomputing Laboratory (KSL); Extreme Computing Research Center; Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division

Full metadata record

DC FieldValue Language
dc.contributor.authorAbdelfattah, Ahmad M.en
dc.contributor.authorKeyes, David E.en
dc.contributor.authorLtaief, Hatemen
dc.date.accessioned2015-08-04T07:11:08Zen
dc.date.available2015-08-04T07:11:08Zen
dc.date.issued2013en
dc.identifier.isbn9783642369483en
dc.identifier.issn03029743en
dc.identifier.doi10.1007/978-3-642-36949-0_23en
dc.identifier.urihttp://hdl.handle.net/10754/564656en
dc.description.abstractThe use of GPUs has been very beneficial in accelerating dense linear algebra computational kernels (DLA). Many high performance numerical libraries like CUBLAS, MAGMA, and CULA provide BLAS and LAPACK implementations on GPUs as well as hybrid computations involving both, CPUs and GPUs. GPUs usually score better performance than CPUs for compute-bound operations, especially those characterized by a regular data access pattern. This paper highlights a systematic approach for efficiently implementing memory-bound DLA kernels on GPUs, by taking advantage of the underlying device's architecture (e.g., high throughput). This methodology proved to outperform existing state-of-the-art GPU implementations for the symmetric matrix-vector multiplication (SYMV), characterized by an irregular data access pattern, in a recent work (Abdelfattah et. al, VECPAR 2012). We propose to extend this methodology to the general matrix-vector multiplication (GEMV) kernel. The performance results show that our GEMV implementation achieves better performance for relatively small to medium matrix sizes, making it very influential in calculating the Hessenberg and bidiagonal reductions of general matrices (radar applications), which are the first step toward computing eigenvalues and singular values, respectively. Considering small and medium size matrices (≤4500), our GEMV kernel achieves an average 60% improvement in single precision (SP) and an average 25% in double precision (DP) over existing open-source and commercial software solutions. These results improve reduction algorithms for both small and large matrices. The improved GEMV performances engender an averge 30% (SP) and 15% (DP) in Hessenberg reduction and up to 25% (SP) and 14% (DP) improvement for the bidiagonal reduction over the implementation provided by CUBLAS 5.0. © 2013 Springer-Verlag.en
dc.publisherSpringer Science + Business Mediaen
dc.subjectBidiagonal Reductionen
dc.subjectGPU Optimizationsen
dc.subjectHessenberg Reductionen
dc.subjectMatrix-Vector Multiplicationen
dc.subjectMemory-Bound Operationsen
dc.titleSystematic approach in optimizing numerical memory-bound kernels on GPUen
dc.typeConference Paperen
dc.contributor.departmentComputer, Electrical and Mathematical Sciences and Engineering (CEMSE) Divisionen
dc.contributor.departmentKAUST Supercomputing Laboratory (KSL)en
dc.contributor.departmentApplied Mathematics and Computational Science Programen
dc.contributor.departmentExtreme Computing Research Centeren
dc.identifier.journalEuro-Par 2012: Parallel Processing Workshopsen
dc.conference.date27 August 2012 through 31 August 2012en
dc.conference.nameParallel Processing Workshops, Euro-Par 2012: BDMC 2012, CGWS 2012, HeteroPar 2012, HiBB 2012, OMHI 2012, Paraphrase 2012, PROPER 2012, Resilience 2012, UCHPC 2012, VHPC 2012en
dc.conference.locationRhodes Islanden
kaust.authorAbdelfattah, Ahmad M.en
kaust.authorKeyes, David E.en
kaust.authorLtaief, Hatemen
All Items in KAUST are protected by copyright, with all rights reserved, unless otherwise indicated.