skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Scalable Failure Masking for Stencil Computations using Ghost Region Expansion and Cell to Rank Remapping

Abstract

In order to achieve exascale systems, application resilience needs to be addressed. Some programming models, such as task-DAG (directed acyclic graphs) architectures, currently embed resilience features whereas traditional SPMD (single program, multiple data) and message-passing models do not. Since a large part of the community's code base follows the latter models, it is still required to take advantage of application characteristics to minimize the overheads of fault tolerance. To that end, this paper explores how recovering from hard process/node failures in a local manner is a natural approach for certain applications to obtain resilience at lower costs in faulty environments. In particular, this paper targets enabling online, semitransparent local recovery for stencil computations on current leadership-class systems as well as presents programming support and scalable runtime mechanisms. Also described and demonstrated in this paper is the effect of failure masking, which allows the effective reduction of impact on total time to solution due to multiple failures. Furthermore, we discuss, implement, and evaluate ghost region expansion and cell-to-rank remapping to increase the probability of failure masking. To conclude, this paper shows the integration of all aforementioned mechanisms with the S3D combustion simulation through an experimental demonstration (using the Titan system) ofmore » the ability to tolerate high failure rates (i.e., node failures every five seconds) with low overhead while sustaining performance at large scales. In addition, this demonstration also displays the failure masking probability increase resulting from the combination of both ghost region expansion and cell-to-rank remapping.« less

Authors:
 [1];  [2];  [2];  [2];  [3];  [2];  [1]
  1. Rutgers Univ., Piscataway, NJ (United States). Rutgers Discovery Informatics Inst.
  2. Sandia National Lab. (SNL-CA), Livermore, CA (United States)
  3. Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States). Oak Ridge Leadership Computing Facility (OLCF); Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Sponsoring Org.:
USDOE Office of Science (SC); USDOE National Nuclear Security Administration (NNSA)
OSTI Identifier:
1411595
Report Number(s):
SAND2017-1397J
Journal ID: ISSN 1064-8275; 651103; TRN: US1800238
Grant/Contract Number:  
AC04-94AL85000
Resource Type:
Journal Article: Accepted Manuscript
Journal Name:
SIAM Journal on Scientific Computing
Additional Journal Information:
Journal Volume: 39; Journal Issue: 5; Journal ID: ISSN 1064-8275
Publisher:
SIAM
Country of Publication:
United States
Language:
English
Subject:
97 MATHEMATICS AND COMPUTING

Citation Formats

Gamell, Marc, Teranishi, Keita, Kolla, Hemanth, Mayo, Jackson, Heroux, Michael A., Chen, Jacqueline, and Parashar, Manish. Scalable Failure Masking for Stencil Computations using Ghost Region Expansion and Cell to Rank Remapping. United States: N. p., 2017. Web. doi:10.1137/16m1081610.
Gamell, Marc, Teranishi, Keita, Kolla, Hemanth, Mayo, Jackson, Heroux, Michael A., Chen, Jacqueline, & Parashar, Manish. Scalable Failure Masking for Stencil Computations using Ghost Region Expansion and Cell to Rank Remapping. United States. https://doi.org/10.1137/16m1081610
Gamell, Marc, Teranishi, Keita, Kolla, Hemanth, Mayo, Jackson, Heroux, Michael A., Chen, Jacqueline, and Parashar, Manish. 2017. "Scalable Failure Masking for Stencil Computations using Ghost Region Expansion and Cell to Rank Remapping". United States. https://doi.org/10.1137/16m1081610. https://www.osti.gov/servlets/purl/1411595.
@article{osti_1411595,
title = {Scalable Failure Masking for Stencil Computations using Ghost Region Expansion and Cell to Rank Remapping},
author = {Gamell, Marc and Teranishi, Keita and Kolla, Hemanth and Mayo, Jackson and Heroux, Michael A. and Chen, Jacqueline and Parashar, Manish},
abstractNote = {In order to achieve exascale systems, application resilience needs to be addressed. Some programming models, such as task-DAG (directed acyclic graphs) architectures, currently embed resilience features whereas traditional SPMD (single program, multiple data) and message-passing models do not. Since a large part of the community's code base follows the latter models, it is still required to take advantage of application characteristics to minimize the overheads of fault tolerance. To that end, this paper explores how recovering from hard process/node failures in a local manner is a natural approach for certain applications to obtain resilience at lower costs in faulty environments. In particular, this paper targets enabling online, semitransparent local recovery for stencil computations on current leadership-class systems as well as presents programming support and scalable runtime mechanisms. Also described and demonstrated in this paper is the effect of failure masking, which allows the effective reduction of impact on total time to solution due to multiple failures. Furthermore, we discuss, implement, and evaluate ghost region expansion and cell-to-rank remapping to increase the probability of failure masking. To conclude, this paper shows the integration of all aforementioned mechanisms with the S3D combustion simulation through an experimental demonstration (using the Titan system) of the ability to tolerate high failure rates (i.e., node failures every five seconds) with low overhead while sustaining performance at large scales. In addition, this demonstration also displays the failure masking probability increase resulting from the combination of both ghost region expansion and cell-to-rank remapping.},
doi = {10.1137/16m1081610},
url = {https://www.osti.gov/biblio/1411595}, journal = {SIAM Journal on Scientific Computing},
issn = {1064-8275},
number = 5,
volume = 39,
place = {United States},
year = {Thu Oct 26 00:00:00 EDT 2017},
month = {Thu Oct 26 00:00:00 EDT 2017}
}

Journal Article:
Free Publicly Available Full Text
Publisher's Version of Record

Citation Metrics:
Cited by: 2 works
Citation information provided by
Web of Science

Save / Share: