skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Programmer-guided reliability for extreme-scale applications

Journal Article · · International Journal of High Performance Computing Applications

We present “programmer-guided reliability” (PGR) as a systematic conceptual approach to address the expected rise in soft errors in coming extreme-scale systems at the application level. The approach involves instrumentation of the application with code to detect data corruption errors. The location and nature of these error detectors are at the discretion of the programmer, who uses their knowledge and experience with the problem domain, the application, the solution algorithms, etc., to determine the most vulnerable areas of the code and the most appropriate ways to detect data corruption. To illustrate the approach, we provide examples of error detectors from four different benchmark-scale applications. We also describe a simple control framework that allows for runtime configuration of the error detectors without recompilation of the application, as well as dynamic reconfiguration during the execution of the application. Lastly, we discuss a number of future directions building on the basic PGR approach, including the incorporation of some general error detectors into the programming environment in order to make them more easily usable by the programmer.

Research Organization:
Oak Ridge National Laboratory (ORNL), Oak Ridge, TN (United States)
Sponsoring Organization:
USDOE
Grant/Contract Number:
AC05-00OR22725
OSTI ID:
1464028
Journal Information:
International Journal of High Performance Computing Applications, Vol. 32, Issue 5; ISSN 1094-3420
Publisher:
SAGECopyright Statement
Country of Publication:
United States
Language:
English

References (36)

Detection and correction of silent data corruption for large-scale high-performance computing
  • Fiala, David; Mueller, Frank; Engelmann, Christian
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1109/SC.2012.49
conference November 2012
Gaining confidence in scientific applications through executable interface contracts journal July 2008
Transparent Redundant Computing with MPI book January 2010
Algorithm-based fault tolerance for floating-point operations in massively parallel systems conference January 1992
Fault recovery for a distributed SP-based delay constrained multicast routing algorithm conference January 2002
ABARIS: An Adaptable Fault Detection/Recovery Component Framework for MPIs conference March 2007
SWIFT: Software Implemented Fault Tolerance conference January 2005
PRASE: An Approach for Program Reliability Analysis with Soft Errors conference December 2008
Improving scientific software component quality through assertions
  • Dahlgren, Tamara L.; Devanbu, Premkumar T.
  • Proceedings of the second international workshop on Software engineering for high performance computing system applications - SE-HPCS '05 https://doi.org/10.1145/1145319.1145341
conference January 2005
See applications run and throughput jump: The case for redundant computing in HPC conference June 2010
Performance-Driven Interface Contract Enforcement for Scientific Components book January 2007
Adaptive incremental checkpointing for massively parallel systems conference January 2004
FITL: extending LLVM for the translation of fault-injection directives conference January 2015
OpenARC: Extensible OpenACC Compiler Framework for Directive-Based Accelerator Programming Study conference November 2014
ROSE::FTTransform - A source-to-source translation framework for exascale fault-tolerance research
  • Lidman, Jacob; Quinlan, Daniel J.; Liao, Chunhua
  • 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W), IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012) https://doi.org/10.1109/DSNW.2012.6264672
conference June 2012
Applying 'design by contract' journal October 1992
Fault tolerant algorithms for heat transfer problems journal May 2008
Fault injection techniques and tools journal April 1997
ACR: automatic checkpoint/restart for soft and hard error protection
  • Ni, Xiang; Meneses, Esteban; Jain, Nikhil
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '13 https://doi.org/10.1145/2503210.2503266
conference January 2013
CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems conference September 2009
Real-world design and evaluation of compiler-managed GPU redundant multithreading conference June 2014
Realization of User Level Fault Tolerant Policy Management through a Holistic Approach for Fault Correlation conference June 2011
High performance linpack benchmark: a fault tolerant implementation without checkpointing conference January 2011
Evaluating the viability of process replication reliability for exascale systems
  • Ferreira, Kurt; Stearley, Jon; Laros, James H.
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 https://doi.org/10.1145/2063384.2063443
conference January 2011
Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation journal November 2005
Characterizing the impact of soft errors on iterative methods in scientific computing conference January 2011
Strategies for Fault Tolerance in Multicomponent Applications journal January 2011
Soft error vulnerability of iterative linear algebra methods conference January 2008
Fault resilience of the algebraic multi-grid solver conference January 2012
Addressing failures in exascale computing journal March 2014
Algorithm-based recovery for iterative methods without checkpointing conference January 2011
Hauberk: Lightweight Silent Data Corruption Error Detector for GPGPU
  • Yim, Keun Soo; Pham, Cuong; Saleheen, Mushfiq
  • Distributed Processing Symposium (IPDPS), 2011 IEEE International Parallel & Distributed Processing Symposium https://doi.org/10.1109/IPDPS.2011.36
conference May 2011
Toward Exascale Resilience journal September 2009
Parallel Programmability and the Chapel Language journal August 2007
CheCUDA: A Checkpoint/Restart Tool for CUDA Applications
  • Takizawa, Hiroyuki; Sato, Katsuto; Komatsu, Kazuhiko
  • 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT) https://doi.org/10.1109/PDCAT.2009.78
conference December 2009
Containment Domains: A Scalable, Efficient and Flexible Resilience Scheme for Exascale Systems journal January 2013

Similar Records

Rolex: Resilience-oriented language extensions for extreme-scale systems
Journal Article · Thu May 26 00:00:00 EDT 2016 · Journal of Supercomputing · OSTI ID:1464028

MACORD: Online Adaptive Machine Learning Framework for Silent Error Detection
Conference · Tue Sep 05 00:00:00 EDT 2017 · OSTI ID:1464028

Exploring versioned distributed arrays for resilience in scientific applications: Global view resilience
Journal Article · Thu Sep 08 00:00:00 EDT 2016 · International Journal of High Performance Computing Applications · OSTI ID:1464028