Exploring the interplay of resilience and energy consumption for a task-based partial differential equations preconditioner
dc.contributor.author | Rizzi, F. | |
dc.contributor.author | Morris, K. | |
dc.contributor.author | Sargsyan, K. | |
dc.contributor.author | Mycek, P. | |
dc.contributor.author | Safta, C. | |
dc.contributor.author | Le Maître, O. | |
dc.contributor.author | Knio, Omar | |
dc.contributor.author | Debusschere, B.J. | |
dc.date.accessioned | 2017-06-13T07:59:52Z | |
dc.date.available | 2017-06-13T07:59:52Z | |
dc.date.issued | 2017-05-25 | |
dc.identifier.citation | Rizzi F, Morris K, Sargsyan K, Mycek P, Safta C, et al. (2017) Exploring the interplay of resilience and energy consumption for a task-based partial differential equations preconditioner. Parallel Computing. Available: http://dx.doi.org/10.1016/j.parco.2017.05.005. | |
dc.identifier.issn | 0167-8191 | |
dc.identifier.doi | 10.1016/j.parco.2017.05.005 | |
dc.identifier.uri | http://hdl.handle.net/10754/624982 | |
dc.description.abstract | We discuss algorithm-based resilience to silent data corruptions (SDCs) in a task-based domain-decomposition preconditioner for partial differential equations (PDEs). The algorithm exploits a reformulation of the PDE as a sampling problem, followed by a solution update through data manipulation that is resilient to SDCs. The implementation is based on a server-client model where all state information is held by the servers, while clients are designed solely as computational units. Scalability tests run up to ∼ 51K cores show a parallel efficiency greater than 90%. We use a 2D elliptic PDE and a fault model based on random single and double bit-flip to demonstrate the resilience of the application to synthetically injected SDC. We discuss two fault scenarios: one based on the corruption of all data of a target task, and the other involving the corruption of a single data point. We show that for our application, given the test problem considered, a four-fold increase in the number of faults only yields a 2% change in the overhead to overcome their presence, from 7% to 9%. We then discuss potential savings in energy consumption via dynamic voltage/frequency scaling, and its interplay with fault-rates, and application overhead. | |
dc.description.sponsorship | This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, under Award Numbers 13-016717. Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. | |
dc.publisher | Elsevier BV | |
dc.relation.url | http://www.sciencedirect.com/science/article/pii/S0167819117300753 | |
dc.rights | NOTICE: this is the author’s version of a work that was accepted for publication in Parallel Computing. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Parallel Computing, 25 May 2017. DOI: 10.1016/j.parco.2017.05.005. © <year>. This manuscript version is made available under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/ | |
dc.subject | Resiliency | |
dc.subject | Server-client programming model | |
dc.subject | Dynamic voltage/frequency scaling | |
dc.subject | PDE | |
dc.subject | Domain-decomposition | |
dc.subject | Silent data corruption | |
dc.title | Exploring the interplay of resilience and energy consumption for a task-based partial differential equations preconditioner | |
dc.type | Article | |
dc.contributor.department | Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division | |
dc.contributor.department | Applied Mathematics and Computational Science Program | |
dc.identifier.journal | Parallel Computing | |
dc.eprint.version | Post-print | |
dc.contributor.institution | Sandia National Labs, Livermore, CA, USA | |
dc.contributor.institution | Duke University, Durham, NC, USA | |
dc.contributor.institution | LIMSI, Orsay, France | |
kaust.person | Knio, Omar | |
refterms.dateFOA | 2019-05-25T00:00:00Z |
Files in this item
This item appears in the following Collection(s)
-
Articles
-
Applied Mathematics and Computational Science Program
For more information visit: https://cemse.kaust.edu.sa/amcs -
Computer, Electrical and Mathematical Science and Engineering (CEMSE) Division
For more information visit: https://cemse.kaust.edu.sa/