Performance optimization of Sparse Matrix-Vector Multiplication for multi-component PDE-based applications using GPUs

Handle URI:
http://hdl.handle.net/10754/621728
Title:
Performance optimization of Sparse Matrix-Vector Multiplication for multi-component PDE-based applications using GPUs
Authors:
Abdelfattah, Ahmad; Ltaief, Hatem ( 0000-0002-6897-1095 ) ; Keyes, David E. ( 0000-0002-4052-7224 ) ; Dongarra, Jack
Abstract:
Simulations of many multi-component PDE-based applications, such as petroleum reservoirs or reacting flows, are dominated by the solution, on each time step and within each Newton step, of large sparse linear systems. The standard solver is a preconditioned Krylov method. Along with application of the preconditioner, memory-bound Sparse Matrix-Vector Multiplication (SpMV) is the most time-consuming operation in such solvers. Multi-species models produce Jacobians with a dense block structure, where the block size can be as large as a few dozen. Failing to exploit this dense block structure vastly underutilizes hardware capable of delivering high performance on dense BLAS operations. This paper presents a GPU-accelerated SpMV kernel for block-sparse matrices. Dense matrix-vector multiplications within the sparse-block structure leverage optimization techniques from the KBLAS library, a high performance library for dense BLAS kernels. The design ideas of KBLAS can be applied to block-sparse matrices. Furthermore, a technique is proposed to balance the workload among thread blocks when there are large variations in the lengths of nonzero rows. Multi-GPU performance is highlighted. The proposed SpMV kernel outperforms existing state-of-the-art implementations using matrices with real structures from different applications. Copyright © 2016 John Wiley & Sons, Ltd. Copyright © 2016 John Wiley & Sons, Ltd.
KAUST Department:
Extreme Computing Research Center
Citation:
Abdelfattah A, Ltaief H, Keyes D, Dongarra J (2016) Performance optimization of Sparse Matrix-Vector Multiplication for multi-component PDE-based applications using GPUs. Concurrency Computat: Pract Exper 28: 3447–3465. Available: http://dx.doi.org/10.1002/cpe.3874.
Publisher:
Wiley-Blackwell
Journal:
Concurrency and Computation: Practice and Experience
Issue Date:
23-May-2016
DOI:
10.1002/cpe.3874
Type:
Article
ISSN:
1532-0626
Sponsors:
This work is partly supported by Saudi Aramco, through research project RGC/3/1438. The authors would like also to thank NVIDIA for their support and generous hardware donations as well as Pascal Henon from TOTAL S.A. for fruitful technical discussions.
Appears in Collections:
Articles; Extreme Computing Research Center

Full metadata record

DC FieldValue Language
dc.contributor.authorAbdelfattah, Ahmaden
dc.contributor.authorLtaief, Hatemen
dc.contributor.authorKeyes, David E.en
dc.contributor.authorDongarra, Jacken
dc.date.accessioned2016-11-03T13:23:41Z-
dc.date.available2016-11-03T13:23:41Z-
dc.date.issued2016-05-23en
dc.identifier.citationAbdelfattah A, Ltaief H, Keyes D, Dongarra J (2016) Performance optimization of Sparse Matrix-Vector Multiplication for multi-component PDE-based applications using GPUs. Concurrency Computat: Pract Exper 28: 3447–3465. Available: http://dx.doi.org/10.1002/cpe.3874.en
dc.identifier.issn1532-0626en
dc.identifier.doi10.1002/cpe.3874en
dc.identifier.urihttp://hdl.handle.net/10754/621728-
dc.description.abstractSimulations of many multi-component PDE-based applications, such as petroleum reservoirs or reacting flows, are dominated by the solution, on each time step and within each Newton step, of large sparse linear systems. The standard solver is a preconditioned Krylov method. Along with application of the preconditioner, memory-bound Sparse Matrix-Vector Multiplication (SpMV) is the most time-consuming operation in such solvers. Multi-species models produce Jacobians with a dense block structure, where the block size can be as large as a few dozen. Failing to exploit this dense block structure vastly underutilizes hardware capable of delivering high performance on dense BLAS operations. This paper presents a GPU-accelerated SpMV kernel for block-sparse matrices. Dense matrix-vector multiplications within the sparse-block structure leverage optimization techniques from the KBLAS library, a high performance library for dense BLAS kernels. The design ideas of KBLAS can be applied to block-sparse matrices. Furthermore, a technique is proposed to balance the workload among thread blocks when there are large variations in the lengths of nonzero rows. Multi-GPU performance is highlighted. The proposed SpMV kernel outperforms existing state-of-the-art implementations using matrices with real structures from different applications. Copyright © 2016 John Wiley & Sons, Ltd. Copyright © 2016 John Wiley & Sons, Ltd.en
dc.description.sponsorshipThis work is partly supported by Saudi Aramco, through research project RGC/3/1438. The authors would like also to thank NVIDIA for their support and generous hardware donations as well as Pascal Henon from TOTAL S.A. for fruitful technical discussions.en
dc.publisherWiley-Blackwellen
dc.subjectBlock sparse matricesen
dc.subjectGPU optimizationsen
dc.subjectSparse matrix-vector multiplicationen
dc.titlePerformance optimization of Sparse Matrix-Vector Multiplication for multi-component PDE-based applications using GPUsen
dc.typeArticleen
dc.contributor.departmentExtreme Computing Research Centeren
dc.identifier.journalConcurrency and Computation: Practice and Experienceen
dc.contributor.institutionInnovative Computing Laboratory; University of Tennessee; Knoxville USAen
dc.contributor.institutionOak Ridge National Laboratory; USAen
dc.contributor.institutionUniversity of Manchester; UKen
kaust.authorLtaief, Hatemen
kaust.authorKeyes, David E.en
All Items in KAUST are protected by copyright, with all rights reserved, unless otherwise indicated.