LADR: low-cost application-level detector for reducing silent output corruptions
- Georgia Institute of Technology
- Georgia Institute of Technology, Atlanta
- ORNL
Applications running on future high performance computing (HPC) systems are more likely to experience transient faults due to technology scaling trends with respect to higher circuit density, smaller transistor size and near-threshold voltage (NTV) operations. A transient fault could corrupt application state without warning, possibly leading to incorrect application output. Such errors are called silent data corruptions (SDCs).In this paper, we present LADR, a low-cost application-level SDC detector for scientific applications. LADR protects scientific applications from SDCs by watching for data anomalies in their state variables (those of scientific interest). It employs compile-time data-flow analysis to minimize the number of monitored variables, thereby reducing runtime and memory overheads while maintaining a high level of fault coverage with low false positive rates. We evaluated LADR with 4 scientific workloads and results show that LADR achieved < 80% fault coverage with only ~ 3% runtime overheads and ~ 1% memory overheads. As compared to prior state-of-the-art anomaly-based detection methods, SDC achieved comparable or improved fault coverage, but reduced runtime overheads by 21% ~ 75%, and memory overheads by 35% ~ 55% for the evaluated workloads. We believe that such an approach with low memory and runtime overheads coupled with attractive detection precision makes LADR a viable approach for assuring the correct output from large-scale high performance simulations.
- Research Organization:
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
- Sponsoring Organization:
- USDOE
- DOE Contract Number:
- AC05-00OR22725
- OSTI ID:
- 1468063
- Resource Relation:
- Conference: International Symposium on High-Performance Parallel and Distributed Computing , New York, New York, June 11-15, 2018
- Country of Publication:
- United States
- Language:
- English
Combining Partial Redundancy and Checkpointing for HPC
|
conference | June 2012 |
Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool
|
conference | November 2012 |
SWIFT: Software Implemented Fault Tolerance
|
conference | January 2005 |
Software fault tolerance for FPUs via vectorization
|
conference | July 2015 |
Failure Analysis of Virtual and Physical Machines: Patterns, Causes and Characteristics
|
conference | June 2014 |
Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications
|
conference | January 2015 |
Experimental and analytical study of Xeon Phi reliability
|
conference | January 2017 |
Understanding the propagation of hard errors to software and implications for resilient system design
|
conference | January 2008 |
Understanding Soft Error Resiliency of Blue Gene/Q Compute Chip through Hardware Proton Irradiation and Software Fault Injection
|
conference | November 2014 |
Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods
|
conference | January 2013 |
Hardware-Software Integrated Diagnosis for Intermittent Hardware Faults
|
conference | June 2014 |
Transient-fault recovery for chip multiprocessors
|
conference | January 2003 |
Characterization of Impact of Transient Faults and Detection of Data Corruption Errors in Large-Scale N-Body Programs Using Graphics Processing Units
|
conference | May 2014 |
Fast Error-Bounded Lossy HPC Data Compression with SZ
|
conference | May 2016 |
FTI: high performance fault tolerance interface for hybrid systems
|
conference | January 2011 |
ACR: automatic checkpoint/restart for soft and hard error protection
|
conference | January 2013 |
ED/sup 4/I: error detection by diverse data and duplicated instructions
|
journal | January 2002 |
Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications
|
journal | October 2016 |
SIMD-based soft error detection
|
conference | January 2016 |
Evaluating the viability of process replication reliability for exascale systems
|
conference | January 2011 |
Algorithm-based fault tolerance for dense matrix factorizations
|
conference | January 2012 |
Algorithm-based recovery for iterative methods without checkpointing
|
conference | January 2011 |
Hauberk: Lightweight Silent Data Corruption Error Detector for GPGPU
|
conference | May 2011 |
NUMARCK: Machine Learning Algorithm for Resiliency and Checkpointing
|
conference | November 2014 |
Detecting and Correcting Data Corruption in Stencil Applications through Multivariate Interpolation
|
conference | September 2015 |
Correcting soft errors online in LU factorization
|
conference | January 2013 |
Similar Records
Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications
MACORD: Online Adaptive Machine Learning Framework for Silent Error Detection