Achieving numerical accuracy and high performance using recursive tile LU factorization with partial pivoting

Handle URI:
http://hdl.handle.net/10754/575581
Title:
Achieving numerical accuracy and high performance using recursive tile LU factorization with partial pivoting
Authors:
Dongarra, Jack; Faverge, Mathieu; Ltaief, Hatem ( 0000-0002-6897-1095 ) ; Luszczek, Piotr R.
Abstract:
The LU factorization is an important numerical algorithm for solving systems of linear equations in science and engineering and is a characteristic of many dense linear algebra computations. For example, it has become the de facto numerical algorithm implemented within the LINPACK benchmark to rank the most powerful supercomputers in the world, collected by the TOP500 website. Multicore processors continue to present challenges to the development of fast and robust numerical software due to the increasing levels of hardware parallelism and widening gap between core and memory speeds. In this context, the difficulty in developing new algorithms for the scientific community resides in the combination of two goals: achieving high performance while maintaining the accuracy of the numerical algorithm. This paper proposes a new approach for computing the LU factorization in parallel on multicore architectures, which not only improves the overall performance but also sustains the numerical quality of the standard LU factorization algorithm with partial pivoting. While the update of the trailing submatrix is computationally intensive and highly parallel, the inherently problematic portion of the LU factorization is the panel factorization due to its memory-bound characteristic as well as the atomicity of selecting the appropriate pivots. Our approach uses a parallel fine-grained recursive formulation of the panel factorization step and implements the update of the trailing submatrix with the tile algorithm. Based on conflict-free partitioning of the data and lockless synchronization mechanisms, our implementation lets the overall computation flow naturally without contention. The dynamic runtime system called QUARK is then able to schedule tasks with heterogeneous granularities and to transparently introduce algorithmic lookahead. The performance results of our implementation are competitive compared to the currently available software packages and libraries. For example, it is up to 40% faster when compared to the equivalent Intel MKL routine and up to threefold faster than LAPACK with multithreaded Intel MKL BLAS. Copyright © 2013 John Wiley & Sons, Ltd. Copyright © 2013 John Wiley & Sons, Ltd.
KAUST Department:
KAUST Supercomputing Laboratory (KSL); Extreme Computing Research Center
Publisher:
Wiley-Blackwell
Journal:
Concurrency and Computation: Practice and Experience
Issue Date:
18-Sep-2013
DOI:
10.1002/cpe.3110
Type:
Article
ISSN:
15320626
Sponsors:
Research reported here was partially supported by the National Science Foundation, Department of Energy, and Microsoft Research.
Appears in Collections:
Articles; KAUST Supercomputing Laboratory (KSL); Extreme Computing Research Center

Full metadata record

DC FieldValue Language
dc.contributor.authorDongarra, Jacken
dc.contributor.authorFaverge, Mathieuen
dc.contributor.authorLtaief, Hatemen
dc.contributor.authorLuszczek, Piotr R.en
dc.date.accessioned2015-08-24T08:33:24Zen
dc.date.available2015-08-24T08:33:24Zen
dc.date.issued2013-09-18en
dc.identifier.issn15320626en
dc.identifier.doi10.1002/cpe.3110en
dc.identifier.urihttp://hdl.handle.net/10754/575581en
dc.description.abstractThe LU factorization is an important numerical algorithm for solving systems of linear equations in science and engineering and is a characteristic of many dense linear algebra computations. For example, it has become the de facto numerical algorithm implemented within the LINPACK benchmark to rank the most powerful supercomputers in the world, collected by the TOP500 website. Multicore processors continue to present challenges to the development of fast and robust numerical software due to the increasing levels of hardware parallelism and widening gap between core and memory speeds. In this context, the difficulty in developing new algorithms for the scientific community resides in the combination of two goals: achieving high performance while maintaining the accuracy of the numerical algorithm. This paper proposes a new approach for computing the LU factorization in parallel on multicore architectures, which not only improves the overall performance but also sustains the numerical quality of the standard LU factorization algorithm with partial pivoting. While the update of the trailing submatrix is computationally intensive and highly parallel, the inherently problematic portion of the LU factorization is the panel factorization due to its memory-bound characteristic as well as the atomicity of selecting the appropriate pivots. Our approach uses a parallel fine-grained recursive formulation of the panel factorization step and implements the update of the trailing submatrix with the tile algorithm. Based on conflict-free partitioning of the data and lockless synchronization mechanisms, our implementation lets the overall computation flow naturally without contention. The dynamic runtime system called QUARK is then able to schedule tasks with heterogeneous granularities and to transparently introduce algorithmic lookahead. The performance results of our implementation are competitive compared to the currently available software packages and libraries. For example, it is up to 40% faster when compared to the equivalent Intel MKL routine and up to threefold faster than LAPACK with multithreaded Intel MKL BLAS. Copyright © 2013 John Wiley & Sons, Ltd. Copyright © 2013 John Wiley & Sons, Ltd.en
dc.description.sponsorshipResearch reported here was partially supported by the National Science Foundation, Department of Energy, and Microsoft Research.en
dc.publisherWiley-Blackwellen
dc.subjectLU factorizationen
dc.subjectparallel linear algebraen
dc.subjectrecursionen
dc.subjectshared memory synchronizationen
dc.subjectthreaded parallelismen
dc.titleAchieving numerical accuracy and high performance using recursive tile LU factorization with partial pivotingen
dc.typeArticleen
dc.contributor.departmentKAUST Supercomputing Laboratory (KSL)en
dc.contributor.departmentExtreme Computing Research Centeren
dc.identifier.journalConcurrency and Computation: Practice and Experienceen
dc.contributor.institutionDepartment of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, TN, United Statesen
kaust.authorLtaief, Hatemen
All Items in KAUST are protected by copyright, with all rights reserved, unless otherwise indicated.