skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Fault tolerance in an inner-outer solver: A GVR-enabled case study

Journal Article · · Lecture Notes in Computer Science
 [1];  [1];  [2]
  1. Univ. of Chicago, Chicago, IL (United States)
  2. Sandia National Lab. (SNL-CA), Livermore, CA (United States); Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)

Resilience is a major challenge for large-scale systems. It is particularly important for iterative linear solvers, since they take much of the time of many scientific applications. We show that single bit flip errors in the Flexible GMRES iterative linear solver can lead to high computational overhead or even failure to converge to the right answer. Informed by these results, we design and evaluate several strategies for fault tolerance in both inner and outer solvers appropriate across a range of error rates. We implement them, extending Trilinos’ solver library with the Global View Resilience (GVR) programming model, which provides multi-stream snapshots, multi-version data structures with portable and rich error checking/recovery. Lastly, experimental results validate correct execution with low performance overhead under varied error conditions.

Research Organization:
Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Sponsoring Organization:
USDOE National Nuclear Security Administration (NNSA)
Grant/Contract Number:
AC04-94AL85000
OSTI ID:
1237365
Report Number(s):
SAND-2015-0174J; 562108
Journal Information:
Lecture Notes in Computer Science, Vol. 8969; ISSN 0302-9743
Publisher:
SpringerCopyright Statement
Country of Publication:
United States
Language:
English
Citation Metrics:
Cited by: 3 works
Citation information provided by
Web of Science

References (11)

The future of microprocessors journal May 2011
Soft error vulnerability of iterative linear algebra methods conference January 2008
Toward Exascale Resilience journal September 2009
Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods conference January 2013
High Performance Dense Linear System Solver with Soft Error Resilience conference September 2011
Evaluating the Impact of SDC on the GMRES Iterative Solver
  • Elliott, James; Hoemmen, Mark; Mueller, Frank
  • 2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2014 IEEE 28th International Parallel and Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2014.123
conference May 2014
An overview of the Trilinos project journal September 2005
Algorithm-Based Fault Tolerance for Matrix Operations journal June 1984
ROSE::FTTransform - A source-to-source translation framework for exascale fault-tolerance research
  • Lidman, Jacob; Quinlan, Daniel J.; Liao, Chunhua
  • 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012) https://doi.org/10.1109/DSNW.2012.6264672
conference June 2012
Iterative Methods for Sparse Linear Systems book January 2003
Fault tolerant preconditioned conjugate gradient for sparse linear system solution conference January 2012

Cited By (1)

End-to-End Resilience for HPC Applications
  • Rezaei, Arash; Khetawat, Harsh; Patil, Onkar
  • High Performance Computing: 34th International Conference, ISC High Performance 2019, Frankfurt/Main, Germany, June 16–20, 2019, Proceedings, p. 271-290 https://doi.org/10.1007/978-3-030-20656-7_14
book May 2019

Similar Records

Exploiting data representation for fault tolerance
Journal Article · Tue Jan 06 00:00:00 EST 2015 · Journal of Computational Science · OSTI ID:1237365

Exploring versioned distributed arrays for resilience in scientific applications: Global view resilience
Journal Article · Thu Sep 08 00:00:00 EDT 2016 · International Journal of High Performance Computing Applications · OSTI ID:1237365

Fault tolerance in an inner-outer solver: A GVR-enabled case study.
Conference · Wed Jan 01 00:00:00 EST 2014 · OSTI ID:1237365