Exploring the interplay of resilience and energy consumption for a task-based partial differential equations preconditioner

Handle URI:
http://hdl.handle.net/10754/624982
Title:
Exploring the interplay of resilience and energy consumption for a task-based partial differential equations preconditioner
Authors:
Rizzi, F.; Morris, K.; Sargsyan, K.; Mycek, P.; Safta, C.; Le Maître, O.; Knio, Omar; Debusschere, B.J.
Abstract:
We discuss algorithm-based resilience to silent data corruptions (SDCs) in a task-based domain-decomposition preconditioner for partial differential equations (PDEs). The algorithm exploits a reformulation of the PDE as a sampling problem, followed by a solution update through data manipulation that is resilient to SDCs. The implementation is based on a server-client model where all state information is held by the servers, while clients are designed solely as computational units. Scalability tests run up to ∼ 51K cores show a parallel efficiency greater than 90%. We use a 2D elliptic PDE and a fault model based on random single and double bit-flip to demonstrate the resilience of the application to synthetically injected SDC. We discuss two fault scenarios: one based on the corruption of all data of a target task, and the other involving the corruption of a single data point. We show that for our application, given the test problem considered, a four-fold increase in the number of faults only yields a 2% change in the overhead to overcome their presence, from 7% to 9%. We then discuss potential savings in energy consumption via dynamic voltage/frequency scaling, and its interplay with fault-rates, and application overhead.
KAUST Department:
King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
Citation:
Rizzi F, Morris K, Sargsyan K, Mycek P, Safta C, et al. (2017) Exploring the interplay of resilience and energy consumption for a task-based partial differential equations preconditioner. Parallel Computing. Available: http://dx.doi.org/10.1016/j.parco.2017.05.005.
Publisher:
Elsevier BV
Journal:
Parallel Computing
Issue Date:
25-May-2017
DOI:
10.1016/j.parco.2017.05.005
Type:
Article
ISSN:
0167-8191
Sponsors:
This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, under Award Numbers 13-016717. Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.
Additional Links:
http://www.sciencedirect.com/science/article/pii/S0167819117300753
Appears in Collections:
Articles

Full metadata record

DC FieldValue Language
dc.contributor.authorRizzi, F.en
dc.contributor.authorMorris, K.en
dc.contributor.authorSargsyan, K.en
dc.contributor.authorMycek, P.en
dc.contributor.authorSafta, C.en
dc.contributor.authorLe Maître, O.en
dc.contributor.authorKnio, Omaren
dc.contributor.authorDebusschere, B.J.en
dc.date.accessioned2017-06-13T07:59:52Z-
dc.date.available2017-06-13T07:59:52Z-
dc.date.issued2017-05-25en
dc.identifier.citationRizzi F, Morris K, Sargsyan K, Mycek P, Safta C, et al. (2017) Exploring the interplay of resilience and energy consumption for a task-based partial differential equations preconditioner. Parallel Computing. Available: http://dx.doi.org/10.1016/j.parco.2017.05.005.en
dc.identifier.issn0167-8191en
dc.identifier.doi10.1016/j.parco.2017.05.005en
dc.identifier.urihttp://hdl.handle.net/10754/624982-
dc.description.abstractWe discuss algorithm-based resilience to silent data corruptions (SDCs) in a task-based domain-decomposition preconditioner for partial differential equations (PDEs). The algorithm exploits a reformulation of the PDE as a sampling problem, followed by a solution update through data manipulation that is resilient to SDCs. The implementation is based on a server-client model where all state information is held by the servers, while clients are designed solely as computational units. Scalability tests run up to ∼ 51K cores show a parallel efficiency greater than 90%. We use a 2D elliptic PDE and a fault model based on random single and double bit-flip to demonstrate the resilience of the application to synthetically injected SDC. We discuss two fault scenarios: one based on the corruption of all data of a target task, and the other involving the corruption of a single data point. We show that for our application, given the test problem considered, a four-fold increase in the number of faults only yields a 2% change in the overhead to overcome their presence, from 7% to 9%. We then discuss potential savings in energy consumption via dynamic voltage/frequency scaling, and its interplay with fault-rates, and application overhead.en
dc.description.sponsorshipThis material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, under Award Numbers 13-016717. Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.en
dc.publisherElsevier BVen
dc.relation.urlhttp://www.sciencedirect.com/science/article/pii/S0167819117300753en
dc.rightsNOTICE: this is the author’s version of a work that was accepted for publication in Parallel Computing. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Parallel Computing, 25 May 2017. DOI: 10.1016/j.parco.2017.05.005. © <year>. This manuscript version is made available under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/en
dc.subjectResiliencyen
dc.subjectServer-client programming modelen
dc.subjectDynamic voltage/frequency scalingen
dc.subjectPDEen
dc.subjectDomain-decompositionen
dc.subjectSilent data corruptionen
dc.titleExploring the interplay of resilience and energy consumption for a task-based partial differential equations preconditioneren
dc.typeArticleen
dc.contributor.departmentKing Abdullah University of Science and Technology, Thuwal, Saudi Arabiaen
dc.identifier.journalParallel Computingen
dc.eprint.versionPost-printen
dc.contributor.institutionSandia National Labs, Livermore, CA, USAen
dc.contributor.institutionDuke University, Durham, NC, USAen
dc.contributor.institutionLIMSI, Orsay, Franceen
kaust.authorKnio, Omaren
All Items in KAUST are protected by copyright, with all rights reserved, unless otherwise indicated.