Fault tolerance in an inner-outer solver: A GVR-enabled case study
- Univ. of Chicago, Chicago, IL (United States)
- Sandia National Lab. (SNL-CA), Livermore, CA (United States); Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
Resilience is a major challenge for large-scale systems. It is particularly important for iterative linear solvers, since they take much of the time of many scientific applications. We show that single bit flip errors in the Flexible GMRES iterative linear solver can lead to high computational overhead or even failure to converge to the right answer. Informed by these results, we design and evaluate several strategies for fault tolerance in both inner and outer solvers appropriate across a range of error rates. We implement them, extending Trilinos’ solver library with the Global View Resilience (GVR) programming model, which provides multi-stream snapshots, multi-version data structures with portable and rich error checking/recovery. Lastly, experimental results validate correct execution with low performance overhead under varied error conditions.
- Research Organization:
- Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)
- Sponsoring Organization:
- USDOE National Nuclear Security Administration (NNSA)
- Grant/Contract Number:
- AC04-94AL85000
- OSTI ID:
- 1237365
- Report Number(s):
- SAND-2015-0174J; 562108
- Journal Information:
- Lecture Notes in Computer Science, Vol. 8969; ISSN 0302-9743
- Publisher:
- SpringerCopyright Statement
- Country of Publication:
- United States
- Language:
- English
Web of Science
The future of microprocessors
|
journal | May 2011 |
Soft error vulnerability of iterative linear algebra methods
|
conference | January 2008 |
Toward Exascale Resilience
|
journal | September 2009 |
Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods
|
conference | January 2013 |
High Performance Dense Linear System Solver with Soft Error Resilience
|
conference | September 2011 |
Evaluating the Impact of SDC on the GMRES Iterative Solver
|
conference | May 2014 |
An overview of the Trilinos project
|
journal | September 2005 |
Algorithm-Based Fault Tolerance for Matrix Operations
|
journal | June 1984 |
ROSE::FTTransform - A source-to-source translation framework for exascale fault-tolerance research
|
conference | June 2012 |
Iterative Methods for Sparse Linear Systems | book | January 2003 |
Fault tolerant preconditioned conjugate gradient for sparse linear system solution
|
conference | January 2012 |
End-to-End Resilience for HPC Applications
|
book | May 2019 |
Similar Records
Exploring versioned distributed arrays for resilience in scientific applications: Global view resilience
Fault tolerance in an inner-outer solver: A GVR-enabled case study.