A comprehensive study of task coalescing for selecting parallelism granularity in a two-stage bidiagonal reduction

Handle URI:
http://hdl.handle.net/10754/575805
Title:
A comprehensive study of task coalescing for selecting parallelism granularity in a two-stage bidiagonal reduction
Authors:
Haidar, Azzam; Ltaief, Hatem ( 0000-0002-6897-1095 ) ; Luszczek, Piotr R.; Dongarra, Jack
Abstract:
We present new high performance numerical kernels combined with advanced optimization techniques that significantly increase the performance of parallel bidiagonal reduction. Our approach is based on developing efficient fine-grained computational tasks as well as reducing overheads associated with their high-level scheduling during the so-called bulge chasing procedure that is an essential phase of a scalable bidiagonalization procedure. In essence, we coalesce multiple tasks in a way that reduces the time needed to switch execution context between the scheduler and useful computational tasks. At the same time, we maintain the crucial information about the tasks and their data dependencies between the coalescing groups. This is the necessary condition to preserve numerical correctness of the computation. We show our annihilation strategy based on multiple applications of single orthogonal reflectors. Despite non-trivial characteristics in computational complexity and memory access patterns, our optimization approach smoothly applies to the annihilation scenario. The coalescing positively influences another equally important aspect of the bulge chasing stage: the memory reuse. For the tasks within the coalescing groups, the data is retained in high levels of the cache hierarchy and, as a consequence, operations that are normally memory-bound increase their ratio of computation to off-chip communication and become compute-bound which renders them amenable to efficient execution on multicore architectures. The performance for the new two-stage bidiagonal reduction is staggering. Our implementation results in up to 50-fold and 12-fold improvement (∼130 Gflop/s) compared to the equivalent routines from LAPACK V3.2 and Intel MKL V10.3, respectively, on an eight socket hexa-core AMD Opteron multicore shared-memory system with a matrix size of 24000 x 24000. Last but not least, we provide a comprehensive study on the impact of the coalescing group size in terms of cache utilization and power consumption in the context of this new two-stage bidiagonal reduction. © 2012 IEEE.
KAUST Department:
KAUST Supercomputing Laboratory (KSL); Extreme Computing Research Center
Publisher:
Institute of Electrical and Electronics Engineers (IEEE)
Journal:
2012 IEEE 26th International Parallel and Distributed Processing Symposium
Conference/Event name:
2012 IEEE 26th International Parallel and Distributed Processing Symposium, IPDPS 2012
Issue Date:
May-2012
DOI:
10.1109/IPDPS.2012.13
Type:
Conference Paper
ISBN:
9780769546759
Appears in Collections:
Conference Papers; KAUST Supercomputing Laboratory (KSL); Extreme Computing Research Center; Extreme Computing Research Center

Full metadata record

DC FieldValue Language
dc.contributor.authorHaidar, Azzamen
dc.contributor.authorLtaief, Hatemen
dc.contributor.authorLuszczek, Piotr R.en
dc.contributor.authorDongarra, Jacken
dc.date.accessioned2015-08-24T09:26:41Zen
dc.date.available2015-08-24T09:26:41Zen
dc.date.issued2012-05en
dc.identifier.isbn9780769546759en
dc.identifier.doi10.1109/IPDPS.2012.13en
dc.identifier.urihttp://hdl.handle.net/10754/575805en
dc.description.abstractWe present new high performance numerical kernels combined with advanced optimization techniques that significantly increase the performance of parallel bidiagonal reduction. Our approach is based on developing efficient fine-grained computational tasks as well as reducing overheads associated with their high-level scheduling during the so-called bulge chasing procedure that is an essential phase of a scalable bidiagonalization procedure. In essence, we coalesce multiple tasks in a way that reduces the time needed to switch execution context between the scheduler and useful computational tasks. At the same time, we maintain the crucial information about the tasks and their data dependencies between the coalescing groups. This is the necessary condition to preserve numerical correctness of the computation. We show our annihilation strategy based on multiple applications of single orthogonal reflectors. Despite non-trivial characteristics in computational complexity and memory access patterns, our optimization approach smoothly applies to the annihilation scenario. The coalescing positively influences another equally important aspect of the bulge chasing stage: the memory reuse. For the tasks within the coalescing groups, the data is retained in high levels of the cache hierarchy and, as a consequence, operations that are normally memory-bound increase their ratio of computation to off-chip communication and become compute-bound which renders them amenable to efficient execution on multicore architectures. The performance for the new two-stage bidiagonal reduction is staggering. Our implementation results in up to 50-fold and 12-fold improvement (∼130 Gflop/s) compared to the equivalent routines from LAPACK V3.2 and Intel MKL V10.3, respectively, on an eight socket hexa-core AMD Opteron multicore shared-memory system with a matrix size of 24000 x 24000. Last but not least, we provide a comprehensive study on the impact of the coalescing group size in terms of cache utilization and power consumption in the context of this new two-stage bidiagonal reduction. © 2012 IEEE.en
dc.publisherInstitute of Electrical and Electronics Engineers (IEEE)en
dc.subjectBidiagonal Reductionen
dc.subjectBulge Chasingen
dc.subjectDynamic Schedulingen
dc.subjectGranularity Analysisen
dc.subjectPower Profilingen
dc.subjectTile Algorithmsen
dc.subjectTwo-Stage Approachen
dc.titleA comprehensive study of task coalescing for selecting parallelism granularity in a two-stage bidiagonal reductionen
dc.typeConference Paperen
dc.contributor.departmentKAUST Supercomputing Laboratory (KSL)en
dc.contributor.departmentExtreme Computing Research Centeren
dc.identifier.journal2012 IEEE 26th International Parallel and Distributed Processing Symposiumen
dc.conference.date21 May 2012 through 25 May 2012en
dc.conference.name2012 IEEE 26th International Parallel and Distributed Processing Symposium, IPDPS 2012en
dc.conference.locationShanghaien
dc.contributor.institutionInnovative Computing Laboratory, University of Tennessee, Knoxville, TN 37996, United Statesen
kaust.authorLtaief, Hatemen
All Items in KAUST are protected by copyright, with all rights reserved, unless otherwise indicated.