Show simple item record

dc.contributor.authorRizzi, F.
dc.contributor.authorMorris, K.
dc.contributor.authorSargsyan, K.
dc.contributor.authorMycek, P.
dc.contributor.authorSafta, C.
dc.contributor.authorLe Maître, O.
dc.contributor.authorKnio, Omar
dc.contributor.authorDebusschere, B.J.
dc.date.accessioned2017-06-13T07:59:52Z
dc.date.available2017-06-13T07:59:52Z
dc.date.issued2017-05-25
dc.identifier.citationRizzi F, Morris K, Sargsyan K, Mycek P, Safta C, et al. (2017) Exploring the interplay of resilience and energy consumption for a task-based partial differential equations preconditioner. Parallel Computing. Available: http://dx.doi.org/10.1016/j.parco.2017.05.005.
dc.identifier.issn0167-8191
dc.identifier.doi10.1016/j.parco.2017.05.005
dc.identifier.urihttp://hdl.handle.net/10754/624982
dc.description.abstractWe discuss algorithm-based resilience to silent data corruptions (SDCs) in a task-based domain-decomposition preconditioner for partial differential equations (PDEs). The algorithm exploits a reformulation of the PDE as a sampling problem, followed by a solution update through data manipulation that is resilient to SDCs. The implementation is based on a server-client model where all state information is held by the servers, while clients are designed solely as computational units. Scalability tests run up to ∼ 51K cores show a parallel efficiency greater than 90%. We use a 2D elliptic PDE and a fault model based on random single and double bit-flip to demonstrate the resilience of the application to synthetically injected SDC. We discuss two fault scenarios: one based on the corruption of all data of a target task, and the other involving the corruption of a single data point. We show that for our application, given the test problem considered, a four-fold increase in the number of faults only yields a 2% change in the overhead to overcome their presence, from 7% to 9%. We then discuss potential savings in energy consumption via dynamic voltage/frequency scaling, and its interplay with fault-rates, and application overhead.
dc.description.sponsorshipThis material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, under Award Numbers 13-016717. Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.
dc.publisherElsevier BV
dc.relation.urlhttp://www.sciencedirect.com/science/article/pii/S0167819117300753
dc.rightsNOTICE: this is the author’s version of a work that was accepted for publication in Parallel Computing. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Parallel Computing, 25 May 2017. DOI: 10.1016/j.parco.2017.05.005. © <year>. This manuscript version is made available under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/
dc.subjectResiliency
dc.subjectServer-client programming model
dc.subjectDynamic voltage/frequency scaling
dc.subjectPDE
dc.subjectDomain-decomposition
dc.subjectSilent data corruption
dc.titleExploring the interplay of resilience and energy consumption for a task-based partial differential equations preconditioner
dc.typeArticle
dc.contributor.departmentComputer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division
dc.contributor.departmentApplied Mathematics and Computational Science Program
dc.identifier.journalParallel Computing
dc.eprint.versionPost-print
dc.contributor.institutionSandia National Labs, Livermore, CA, USA
dc.contributor.institutionDuke University, Durham, NC, USA
dc.contributor.institutionLIMSI, Orsay, France
kaust.personKnio, Omar
refterms.dateFOA2019-05-25T00:00:00Z


Files in this item

Thumbnail
Name:
main - Rizzi.pdf
Size:
861.0Kb
Format:
PDF
Description:
Accepted Manuscript

This item appears in the following Collection(s)

Show simple item record