Understanding scale-Dependent soft-Error Behavior of Scientific Applications
Abstract
Analyzing application fault behavior on large-scale systems is time-consuming and resource-demanding. Currently, researchers need to perform fault injection campaigns at full scale to understand the effects of soft errors on applications and whether these faults result in silent data corruption. Both time and resource requirements greatly limit the scope of the resilience studies that can be currently performed. In this work, we propose a methodology to model application fault behavior at large scale based on a reduced set of experiments performed at small scale. We employ machine learning techniques to accurately model application fault behavior using a set of experiments that can be executed in parallel at small scale. Our methodology drastically reduces the set and the scale of the fault injection experiments to be performed and provides a validated methodology to study application fault behavior at large scale. We show that our methodology can accurately model application fault behavior at large scale by using only small scale experiments. In some cases, we can model the fault behavior of a parallel application running on 4,096 cores with about 90% accuracy based on experiments on a single core.
- Authors:
-
- ORNL
- Pacific Northwest National Laboratory (PNNL)
- Publication Date:
- Research Org.:
- Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
- Sponsoring Org.:
- USDOE
- OSTI Identifier:
- 1474572
- DOE Contract Number:
- AC05-00OR22725
- Resource Type:
- Conference
- Resource Relation:
- Conference: 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) - Washington D.C., District of Columbia, United States of America - 5/1/2018 8:00:00 AM-5/4/2018 8:00:00 AM
- Country of Publication:
- United States
- Language:
- English
Citation Formats
Kestor Gioiosa, Gokcen, Peng, Ivy Bo, Gioiosa, Roberto, and Krishnamoorthy, Sriram. Understanding scale-Dependent soft-Error Behavior of Scientific Applications. United States: N. p., 2018.
Web. doi:10.1109/CCGRID.2018.00075.
Kestor Gioiosa, Gokcen, Peng, Ivy Bo, Gioiosa, Roberto, & Krishnamoorthy, Sriram. Understanding scale-Dependent soft-Error Behavior of Scientific Applications. United States. https://doi.org/10.1109/CCGRID.2018.00075
Kestor Gioiosa, Gokcen, Peng, Ivy Bo, Gioiosa, Roberto, and Krishnamoorthy, Sriram. 2018.
"Understanding scale-Dependent soft-Error Behavior of Scientific Applications". United States. https://doi.org/10.1109/CCGRID.2018.00075. https://www.osti.gov/servlets/purl/1474572.
@article{osti_1474572,
title = {Understanding scale-Dependent soft-Error Behavior of Scientific Applications},
author = {Kestor Gioiosa, Gokcen and Peng, Ivy Bo and Gioiosa, Roberto and Krishnamoorthy, Sriram},
abstractNote = {Analyzing application fault behavior on large-scale systems is time-consuming and resource-demanding. Currently, researchers need to perform fault injection campaigns at full scale to understand the effects of soft errors on applications and whether these faults result in silent data corruption. Both time and resource requirements greatly limit the scope of the resilience studies that can be currently performed. In this work, we propose a methodology to model application fault behavior at large scale based on a reduced set of experiments performed at small scale. We employ machine learning techniques to accurately model application fault behavior using a set of experiments that can be executed in parallel at small scale. Our methodology drastically reduces the set and the scale of the fault injection experiments to be performed and provides a validated methodology to study application fault behavior at large scale. We show that our methodology can accurately model application fault behavior at large scale by using only small scale experiments. In some cases, we can model the fault behavior of a parallel application running on 4,096 cores with about 90% accuracy based on experiments on a single core.},
doi = {10.1109/CCGRID.2018.00075},
url = {https://www.osti.gov/biblio/1474572},
journal = {},
number = ,
volume = ,
place = {United States},
year = {Tue May 01 00:00:00 EDT 2018},
month = {Tue May 01 00:00:00 EDT 2018}
}
Works referenced in this record:
Classifying soft error vulnerabilities in extreme-Scale scientific applications using a binary instrumentation tool
conference, November 2012
- Li, Dong; Vetter, Jeffrey S.; Yu, Weikuan
- 2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis
Matrix Multiplication on GPUs with On-Line Fault Tolerance
conference, May 2011
- Ding, Chong; Karlsson, Christer; Liu, Hui
- 2011 IEEE 9th International Symposium on Parallel and Distributed Processing with Applications (ISPA), 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications
Shoestring: probabilistic soft error reliability on the cheap
conference, January 2010
- Feng, Shuguang; Gupta, Shantanu; Ansari, Amin
- Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems - ASPLOS '10
Localized Fault Recovery for Nested Fork-Join Programs
conference, May 2017
- Kestor, Gokcen; Krishnamoorthy, Sriram; Ma, Wenjing
- 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
Radiation-induced soft errors in advanced semiconductor technologies
journal, September 2005
- Baumann, R. C.
- IEEE Transactions on Device and Materials Reliability, Vol. 5, Issue 3
An Experimental Study of Soft Errors in Microprocessors
journal, November 2005
- Saggese, G. P.; Wang, N. J.; Kalbarczyk, Z. T.
- IEEE Micro, Vol. 25, Issue 6
Understanding Soft Error Resiliency of Blue Gene/Q Compute Chip through Hardware Proton Irradiation and Software Fault Injection
conference, November 2014
- Cher, Chen-Yong; Gupta, Meeta S.; Bose, Pradip
- SC14: International Conference for High Performance Computing, Networking, Storage and Analysis
Quantitative evaluation of soft error injection techniques for robust system design
conference, January 2013
- Cho, Hyungmin; Mirkhani, Shahrzad; Cher, Chen-Yong
- Proceedings of the 50th Annual Design Automation Conference on - DAC '13
BoomerAMG: A parallel algebraic multigrid solver and preconditioner
journal, April 2002
- Henson, Van Emden; Yang, Ulrike Meier
- Applied Numerical Mathematics, Vol. 41, Issue 1
Bulldozer: An Approach to Multithreaded Compute Performance
journal, March 2011
- Butler, Michael; Barnes, Leslie; Sarma, Debjit Das
- IEEE Micro, Vol. 31, Issue 2
Fast Parallel Algorithms for Short-Range Molecular Dynamics
journal, March 1995
- Plimpton, Steve
- Journal of Computational Physics, Vol. 117, Issue 1
Quantifying the Accuracy of High-Level Fault Injection Techniques for Hardware Faults
conference, June 2014
- Wei, Jiesheng; Thomas, Anna; Li, Guanpeng
- 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
Instruction-Level Impact Analysis of Low-Level Faults in a Modern Microprocessor Controller
journal, September 2011
- Maniatakos, Michail; Karimi, Naghmeh; Tirumurti, Chandra
- IEEE Transactions on Computers, Vol. 60, Issue 9
Exploring Traditional and Emerging Parallel Programming Models Using a Proxy Application
conference, May 2013
- Karlin, Ian; Bhatele, Abhinav; Keasler, Jeff
- 2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on Parallel and Distributed Processing
SDCTune: a model for predicting the SDC proneness of an application for configurable protection
conference, October 2014
- Lu, Qining; Pattabiraman, Karthik; Gupta, Meeta S.
- ESWEEK'14: TENTH EMBEDDED SYSTEM WEEK, Proceedings of the 2014 International Conference on Compilers, Architecture and Synthesis for Embedded Systems
Evaluating the viability of process replication reliability for exascale systems
conference, January 2011
- Ferreira, Kurt; Stearley, Jon; Laros, James H.
- Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11
A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor
conference, January 2003
- Mukherjee, S. S.; Weaver, C.; Emer, J.
- 36th International Symposium on Microarchitecture, 22nd Digital Avionics Systems Conference. Proceedings (Cat. No.03CH37449)
Searching for exotic particles in high-energy physics with deep learning
journal, July 2014
- Baldi, P.; Sadowski, P.; Whiteson, D.
- Nature Communications, Vol. 5, Issue 1
The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing
journal, November 2005
- Sankaran, Sriram; Squyres, Jeffrey M.; Barrett, Brian
- The International Journal of High Performance Computing Applications, Vol. 19, Issue 4
Soft error vulnerability of iterative linear algebra methods
conference, January 2008
- Bronevetsky, Greg; de Supinski, Bronis
- Proceedings of the 22nd annual international conference on Supercomputing - ICS '08
Fault resilience of the algebraic multi-grid solver
conference, January 2012
- Casas, Marc; de Supinski, Bronis R.; Bronevetsky, Greg
- Proceedings of the 26th ACM international conference on Supercomputing - ICS '12
Quantitatively Modeling Application Resilience with the Data Vulnerability Factor
conference, November 2014
- Yu, Li; Li, Dong; Mittal, Sparsh
- SC14: International Conference for High Performance Computing, Networking, Storage and Analysis
Fault Modeling of Extreme Scale Applications Using Machine Learning
conference, May 2016
- Vishnu, Abhinav; van Dam, Hubertus; Tallent, Nathan R.
- 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)