skip to main content
OSTI.GOV title logo U.S. Department of Energy
Office of Scientific and Technical Information

Title: Understanding scale-Dependent soft-Error Behavior of Scientific Applications

Abstract

Analyzing application fault behavior on large-scale systems is time-consuming and resource-demanding. Currently, researchers need to perform fault injection campaigns at full scale to understand the effects of soft errors on applications and whether these faults result in silent data corruption. Both time and resource requirements greatly limit the scope of the resilience studies that can be currently performed. In this work, we propose a methodology to model application fault behavior at large scale based on a reduced set of experiments performed at small scale. We employ machine learning techniques to accurately model application fault behavior using a set of experiments that can be executed in parallel at small scale. Our methodology drastically reduces the set and the scale of the fault injection experiments to be performed and provides a validated methodology to study application fault behavior at large scale. We show that our methodology can accurately model application fault behavior at large scale by using only small scale experiments. In some cases, we can model the fault behavior of a parallel application running on 4,096 cores with about 90% accuracy based on experiments on a single core.

Authors:
ORCiD logo [1];  [1]; ORCiD logo [1];  [2]
  1. ORNL
  2. Pacific Northwest National Laboratory (PNNL)
Publication Date:
Research Org.:
Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
Sponsoring Org.:
USDOE
OSTI Identifier:
1474572
DOE Contract Number:  
AC05-00OR22725
Resource Type:
Conference
Resource Relation:
Conference: 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) - Washington D.C., District of Columbia, United States of America - 5/1/2018 8:00:00 AM-5/4/2018 8:00:00 AM
Country of Publication:
United States
Language:
English

Citation Formats

Kestor Gioiosa, Gokcen, Peng, Ivy Bo, Gioiosa, Roberto, and Krishnamoorthy, Sriram. Understanding scale-Dependent soft-Error Behavior of Scientific Applications. United States: N. p., 2018. Web. doi:10.1109/CCGRID.2018.00075.
Kestor Gioiosa, Gokcen, Peng, Ivy Bo, Gioiosa, Roberto, & Krishnamoorthy, Sriram. Understanding scale-Dependent soft-Error Behavior of Scientific Applications. United States. https://doi.org/10.1109/CCGRID.2018.00075
Kestor Gioiosa, Gokcen, Peng, Ivy Bo, Gioiosa, Roberto, and Krishnamoorthy, Sriram. 2018. "Understanding scale-Dependent soft-Error Behavior of Scientific Applications". United States. https://doi.org/10.1109/CCGRID.2018.00075. https://www.osti.gov/servlets/purl/1474572.
@article{osti_1474572,
title = {Understanding scale-Dependent soft-Error Behavior of Scientific Applications},
author = {Kestor Gioiosa, Gokcen and Peng, Ivy Bo and Gioiosa, Roberto and Krishnamoorthy, Sriram},
abstractNote = {Analyzing application fault behavior on large-scale systems is time-consuming and resource-demanding. Currently, researchers need to perform fault injection campaigns at full scale to understand the effects of soft errors on applications and whether these faults result in silent data corruption. Both time and resource requirements greatly limit the scope of the resilience studies that can be currently performed. In this work, we propose a methodology to model application fault behavior at large scale based on a reduced set of experiments performed at small scale. We employ machine learning techniques to accurately model application fault behavior using a set of experiments that can be executed in parallel at small scale. Our methodology drastically reduces the set and the scale of the fault injection experiments to be performed and provides a validated methodology to study application fault behavior at large scale. We show that our methodology can accurately model application fault behavior at large scale by using only small scale experiments. In some cases, we can model the fault behavior of a parallel application running on 4,096 cores with about 90% accuracy based on experiments on a single core.},
doi = {10.1109/CCGRID.2018.00075},
url = {https://www.osti.gov/biblio/1474572}, journal = {},
number = ,
volume = ,
place = {United States},
year = {Tue May 01 00:00:00 EDT 2018},
month = {Tue May 01 00:00:00 EDT 2018}
}

Conference:
Other availability
Please see Document Availability for additional information on obtaining the full-text document. Library patrons may search WorldCat to identify libraries that hold this conference proceeding.

Save / Share:

Works referenced in this record:

Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool
conference, November 2012

  • Li, Dong; Vetter, Jeffrey S.; Yu, Weikuan
  • 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis
  • https://doi.org/10.1109/SC.2012.29

Matrix Multiplication on GPUs with On-Line Fault Tolerance
conference, May 2011

  • Ding, Chong; Karlsson, Christer; Liu, Hui
  • 2011 IEEE 9th International Symposium on Parallel and Distributed Processing with Applications (ISPA), 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications
  • https://doi.org/10.1109/ISPA.2011.50

Shoestring: probabilistic soft error reliability on the cheap
conference, January 2010

  • Feng, Shuguang; Gupta, Shantanu; Ansari, Amin
  • Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems - ASPLOS '10
  • https://doi.org/10.1145/1736020.1736063

Localized Fault Recovery for Nested Fork-Join Programs
conference, May 2017


Radiation-induced soft errors in advanced semiconductor technologies
journal, September 2005


An Experimental Study of Soft Errors in Microprocessors
journal, November 2005


Understanding Soft Error Resiliency of Blue Gene/Q Compute Chip through Hardware Proton Irradiation and Software Fault Injection
conference, November 2014

  • Cher, Chen-Yong; Gupta, Meeta S.; Bose, Pradip
  • SC14: International Conference for High Performance Computing, Networking, Storage and Analysis
  • https://doi.org/10.1109/SC.2014.53

Quantitative evaluation of soft error injection techniques for robust system design
conference, January 2013


BoomerAMG: A parallel algebraic multigrid solver and preconditioner
journal, April 2002


Bulldozer: An Approach to Multithreaded Compute Performance
journal, March 2011


Fast Parallel Algorithms for Short-Range Molecular Dynamics
journal, March 1995


Quantifying the Accuracy of High-Level Fault Injection Techniques for Hardware Faults
conference, June 2014

  • Wei, Jiesheng; Thomas, Anna; Li, Guanpeng
  • 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
  • https://doi.org/10.1109/DSN.2014.2

Instruction-Level Impact Analysis of Low-Level Faults in a Modern Microprocessor Controller
journal, September 2011


Exploring Traditional and Emerging Parallel Programming Models Using a Proxy Application
conference, May 2013

  • Karlin, Ian; Bhatele, Abhinav; Keasler, Jeff
  • 2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processing
  • https://doi.org/10.1109/IPDPS.2013.115

SDCTune: a model for predicting the SDC proneness of an application for configurable protection
conference, October 2014

  • Lu, Qining; Pattabiraman, Karthik; Gupta, Meeta S.
  • ESWEEK'14: TENTH EMBEDDED SYSTEM WEEK, Proceedings of the 2014 International Conference on Compilers, Architecture and Synthesis for Embedded Systems
  • https://doi.org/10.1145/2656106.2656127

Evaluating the viability of process replication reliability for exascale systems
conference, January 2011

  • Ferreira, Kurt; Stearley, Jon; Laros, James H.
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11
  • https://doi.org/10.1145/2063384.2063443

A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor
conference, January 2003


Searching for exotic particles in high-energy physics with deep learning
journal, July 2014


The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing
journal, November 2005


Soft error vulnerability of iterative linear algebra methods
conference, January 2008


Fault resilience of the algebraic multi-grid solver
conference, January 2012


Quantitatively Modeling Application Resilience with the Data Vulnerability Factor
conference, November 2014


Fault Modeling of Extreme Scale Applications Using Machine Learning
conference, May 2016