research-article

CARE: compiler-assisted recovery from soft failures

Authors:
Chao Chen

Georgia Institute of Technology

Georgia Institute of Technology
View Profile

,
Greg Eisenhauer

Georgia Institute of Technology

Georgia Institute of Technology
View Profile

,
Santosh Pande

Georgia Institute of Technology

Georgia Institute of Technology
View Profile

,
Qiang Guan

Kent State University

Kent State University
View Profile

SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisNovember 2019Article No.: 58Pages 1–23https://doi.org/10.1145/3295500.3356194

Published:17 November 2019Publication History

SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Pages 1–23

ABSTRACT

As processors continue to boost the system performance with higher circuit density, shrinking process technology and near-threshold voltage (NTV) operations, they are projected to be more vulnerable to transient faults, which have become one of the major concerns for future extreme-scale HPC systems. Despite being relatively infrequent, crashes due to transient faults are incredibly disruptive, particularly for massively parallel jobs on supercomputers where they potentially kill the entire job, requiring an expensive rerun or restart from a checkpoint.

In this paper, we present CARE, a light-weight compiler-assisted technique to repair the (crashed) process on-the-fly when a crash-causing error is detected, allowing applications to continue their executions instead of being simply terminated and restarted. Specifically, CARE seeks to repair failures that would result in application crashes due to invalid memory references (segmentation violation). During the compilation of applications, CARE constructs a recovery kernel for each crash-prone instruction, and upon an occurrence of an error, CARE attempts to repair corrupted state of the process by executing the constructed recovery kernel to recompute the memory reference on-the-fly. We evaluated CARE with four scientific workloads. During their normal execution, CARE incurs almost zero runtime overhead and a fixed 27MB memory overheads. Meanwhile, CARE can recover on an average 83.54% of crash-causing errors within dozens of milliseconds. We also evaluated CARE with parallel jobs running on 3072 cores and showed that CARE can successfully mask the impact of crash-causing errors by providing almost uninterrupted execution. Finally, We present our preliminary evaluation result for BLAS, which shows that CARE is capable of recovering failures in libraries with a very high coverage rate of 83% and negligible overheads. With such an effective recovery mechanism, CARE could tremendously mitigate the overheads and resource requirements of the resilience subsystem in future extreme-scale systems.

References

Saman Amarasinghe, Dan Campbell, and William Carlson etc. 2009. Exascale Software Study: Software Challenges in Extreme Scale Systems. Technical Report. DARPA IPTO, Air Force Research Labs. Google ScholarCross Ref
Rizwan A. Ashraf, Roberto Gioiosa, Gokcen Kestor, Ronald F. DeMara, Chen-Yong Cher, and Pradip Bose. 2015. Understanding the Propagation of Transient Errors in HPC Applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). ACM, New York, NY, USA, Article 72, 12 pages. Google ScholarDigital Library
Jon Calhoun, Marc Snir, Luke N. Olson, and William D. Gropp. 2017. Towards a More Complete Understanding of SDC Propagation. In Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '17). ACM, New York, NY, USA, 131--142. Google ScholarDigital Library
Franck Cappello, Al Geist, Bill Gropp, Laxmikant Kale, Bill Kramer, and Marc Snir. 2009. Toward Exascale Resilience. Int. J. High Perform. Comput. Appl. 23, 4 (Nov. 2009), 374--388. Google ScholarDigital Library
Chun-Kai Chang, Sangkug Lym, Nicholas Kelly, Michael B. Sullivan, and Mattan Erez. 2018. Evaluating and Accelerating High-fidelity Error Injection for HPC. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '18). IEEE Press, Piscataway, NJ, USA, Article 45, 13 pages. http://dl.acm.org/citation.cfm?id=3291656.3291716Google ScholarDigital Library
Chao Chen, Greg Eisenhauer, Matthew Wolf, and Santosh Pande. 2018. LADR: Low-cost Application-level Detector for Reducing Silent Output Corruptions. In Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '18). ACM, New York, NY, USA, 156--167. Google ScholarDigital Library
Zizhong Chen. 2011. Algorithm-based Recovery for Iterative Methods Without Checkpointing. In Proceedings of the 20th International Symposium on High Performance Distributed Computing (HPDC '11). ACM, New York, NY, USA, 73--84. Google ScholarDigital Library
Zizhong Chen. 2013. Online-ABFT: An Online Algorithm Based Fault Tolerance Scheme for Soft Error Detection in Iterative Methods. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '13). ACM, New York, NY, USA, 167--176. Google ScholarDigital Library
Chen-Yong Cher, Meeta S. Gupta, Pradip Bose, and K. Paul Muller. 2014. Understanding Soft Error Resiliency of BlueGene/Q Compute Chip Through Hardware Proton Irradiation and Software Fault Injection. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '14). IEEE Press, Piscataway, NJ, USA, 587--596. Google ScholarDigital Library
Clang 2019. Clang. https://clang.llvm.org. (2019).Google Scholar
Majid Dadashi, Layali Rashid, Karthik Pattabiraman, and Sathish Gopalakrishnan. 2014. Hardware-Software Integrated Diagnosis for Intermittent Hardware Faults. In Proceedings of the International Conference on Dependable Systems and Networks. IEEE, Atlanta, GA, USA, 363--374. Google ScholarDigital Library
S. Di and F. Cappello. 2016. Fast Error-Bounded Lossy HPC Data Compression with SZ. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, Chicago, IL, USA, 730--739. Google ScholarCross Ref
Jack Dongarra, Pete Beckman, Terry Moore, Patrick Aerts, Giovanni Aloisio, Jean-Claude Andre, David Barkai, Jean-Yves Berthou, Taisuke Boku, Bertrand Braunschweig, Franck Cappello, Barbara Chapman, Xuebin Chi, Alok Choudhary, Sudip Dosanjh, Thom Dunning, Sandro Fiore, Al Geist, Bill Gropp, Robert Harrison, Mark Hereld, Michael Heroux, Adolfy Hoisie, Koh Hotta, Zhong Jin, Yutaka Ishikawa, Fred Johnson, Sanjay Kale, Richard Kenway, David Keyes, Bill Kramer, Jesus Labarta, Alain Lichnewsky, Thomas Lippert, Bob Lucas, Barney Maccabe, Satoshi Matsuoka, Paul Messina, Peter Michielse, Bernd Mohr, Matthias S. Mueller, Wolfgang E. Nagel, Hiroshi Nakashima, Michael E Papka, Dan Reed, Mitsuhisa Sato, Ed Seidel, John Shalf, David Skinner, Marc Snir, Thomas Sterling, Rick Stevens, Fred Streitz, Bob Sugar, Shinji Sumimoto, William Tang, John Taylor, Rajeev Thakur, Anne Trefethen, Mateo Valero, Aad Van Der Steen, Jeffrey Vetter, Peg Williams, Robert Wisniewski, and Kathy Yelick. 2011. The International Exascale Software Project Roadmap. Int. J. High Perform. Comput. Appl. 25, 1 (Feb. 2011), 3--60. Google ScholarDigital Library
DragonEgg 2019. DragonEgg. https://dragonegg.llvm.org. (2019).Google Scholar
James Elliott, Kishor Kharbas, David Fiala, Frank Mueller, Kurt Ferreira, and Christian Engelmann. 2012. Combining Partial Redundancy and Checkpointing for HPC. In Proceedings of the 2012 IEEE 32Nd International Conference on Distributed Computing Systems (ICDCS '12). IEEE Computer Society, Washington, DC, USA, 615--626. Google ScholarDigital Library
Bo Fang, Qiang Guan, Nathan Debardeleben, Karthik Pattabiraman, and Matei Ripeanu. 2017. LetGo: A Lightweight Continuous Framework for HPC Applications Under Failures. In Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '17). ACM, New York, NY, USA, 117--130. Google ScholarDigital Library
Flang 2019. FLANG. https://github.com/flang-compiler/flang. (2019).Google Scholar
Giorgis Georgakoudis, Ignacio Laguna, Dimitrios S. Nikolopoulos, and Martin Schulz. 2017. REFINE: Realistic Fault Injection via Compiler-based Instrumentation for Accuracy, Portability and Speed. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '17). ACM, New York, NY, USA, Article 29, 14 pages. Google ScholarDigital Library
gprotobuf 2019. Google Protobuf. https://developers.google.com/protocol-buffers/. (2019).Google Scholar
Michael A. Heroux. 2013. Toward Resilient Algorithms and Applications. In Proceedings of the 3rd Workshop on Fault-tolerance for HPC at Extreme Scale (FTXS '13). ACM, New York, NY, USA, 1--2. Google ScholarDigital Library
Saurabh Hukerikar and Robert F. Lucas. 2016. Rolex: resilience-oriented language extensions for extreme-scale systems. The Journal of Supercomputing 72, 12 (01 Dec 2016), 4662--4695. Google ScholarCross Ref
Dmitrii Kuvaiskii, Rasha Faqeh, Pramod Bhatotia, Pascal Felber, and Christof Fetzer. 2016. HAFT: Hardware-assisted Fault Tolerance. In Proceedings of the Eleventh European Conference on Computer Systems (EuroSys '16). ACM, New York, NY, USA, Article 25, 17 pages. Google ScholarDigital Library
Dong Li, Jeffrey S. Vetter, and Weikuan Yu. 2012. Classifying Soft Error Vulnerabilities in Extreme-scale Scientific Applications Using a Binary Instrumentation Tool. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '12). IEEE Computer Society Press, Los Alamitos, CA, USA, Article 57, 11 pages. http://dl.acm.org/citation.cfm?id=2388996.2389074Google ScholarDigital Library
libdwarf 2019. libdwarf. https://www.prevanders.net/dwarf.html. (2019).Google Scholar
libffi 2019. libffi. https://sourceware.org/libffi/. (2019).Google Scholar
llvm 2019. LLVM. https://llvm.org. (2019).Google Scholar
Fan Long, Stelios Sidiroglou-Douskos, and Martin Rinard. 2014. Automatic Runtime Error Repair and Containment via Recovery Shepherding. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '14). ACM, New York, NY, USA, 227--238. Google ScholarDigital Library
mhash 2019. mhash. http://mhash.sourceforge.net/. (2019).Google Scholar
S. Mitra, P. Bose, E. Cheng, C. Cher, H. Cho, R. Joshi, Y. M. Kim, C. R. Lefurgy, Y. Li, K. P. Rodbell, K. Skadron, J. Stathis, and L. Szafaryn. 2014. The resilience wall: Cross-layer solution strategies. In Proceedings of Technical Program - 2014 International Symposium on VLSI Technology, Systems and Application (VLSI-TSA). IEEE Press, Hsinchu, Taiwan, 1--11. Google ScholarCross Ref
Adam Moody, Greg Bronevetsky, Kathryn Mohror, and Bronis R. de Supinski. 2010. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC '10). IEEE Computer Society, Washington, DC, USA, 1--11. Google ScholarDigital Library
Daniel Oliveira, Laércio Pilla, Nathan DeBardeleben, Sean Blanchard, Heather Quinn, Israel Koren, Philippe Navaux, and Paolo Rech. 2017. Experimental and Analytical Study of Xeon Phi Reliability. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '17). ACM, New York, NY, USA, Article 28, 12 pages. Google ScholarDigital Library
Feng Qin, Joseph Tucek, Jagadeesan Sundaresan, and Yuanyuan Zhou. 2005. Rx: Treating Bugs As Allergies---a Safe Method to Survive Software Failures. In Proceedings of the Twentieth ACM Symposium on Operating Systems Principles (SOSP '05). ACM, New York, NY, USA, 235--248. Google ScholarDigital Library
Nguyen Anh Quynh. 2014. Capstone: Next-Gen Disassembly Framework. (2014).Google Scholar
Vilas Sridharan, Nathan DeBardeleben, Sean Blanchard, Kurt B. Ferreira, Jon Stearley, John Shalf, and Sudhanva Gurumurthi. 2015. Memory Errors in Modern Systems: The Good, The Bad, and The Ugly. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '15). ACM, New York, NY, USA, 297--310. Google ScholarDigital Library
Margaret H. Wright and Al. 2010. The opportunities and challenges of exascale computing. (2010). https://science.energy.gov/~/media/ascr/ascac/pdf/reports/Exascale_subcommittee_report.pdfGoogle Scholar

Index Terms

CARE: compiler-assisted recovery from soft failures
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
    1. Reliability
2. Software and its engineering
  1. Software organization and properties
    1. Extra-functional properties
      1. Software fault tolerance

Recommendations

LADR: low-cost application-level detector for reducing silent output corruptions
HPDC '18: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing

Applications running on future high performance computing (HPC) systems are more likely to experience transient faults due to technology scaling trends with respect to higher circuit density, smaller transistor size and near-threshold voltage (NTV) ...
Read More
Resilience and survivability in communication networks: Strategies, principles, and survey of disciplines

The Internet has become essential to all aspects of modern life, and thus the consequences of network disruption have become increasingly severe. It is widely recognised that the Internet is not sufficiently resilient, survivable, and dependable, and ...
Read More
An OS-Hypervisor Infrastructure for Automated OS Crash Diagnosis and Recovery in a Virtualized Environment
SBAC-PAD '12: Proceedings of the 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

Recovering from OS crashes has traditionally been done using reboot or checkpoint-restart mechanisms. Such techniques either fail to preserve the state before the crash happens or require modifications to applications. To eliminate these problems, we ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2019
1921 pages
ISBN:9781450362290
DOI:10.1145/3295500
General Chair:
Michela Taufer,
Program Chairs:
Pavan Balaji,
Antonio J. Peña
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 November 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
HPC
SDC
availability
online crash recovery
online failure recovery
reliability
resiliency
soft error
transient fault
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,516of6,373submissions,24%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 381
  Total Downloads
- Downloads (Last 12 months)19
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

CARE: compiler-assisted recovery from soft failures

SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

LADR: low-cost application-level detector for reducing silent output corruptions

Resilience and survivability in communication networks: Strategies, principles, and survey of disciplines

An OS-Hypervisor Infrastructure for Automated OS Crash Diagnosis and Recovery in a Virtualized Environment

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

CARE: compiler-assisted recovery from soft failures

SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

LADR: low-cost application-level detector for reducing silent output corruptions

Resilience and survivability in communication networks: Strategies, principles, and survey of disciplines

An OS-Hypervisor Infrastructure for Automated OS Crash Diagnosis and Recovery in a Virtualized Environment

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media