ABSTRACT
As processors continue to boost the system performance with higher circuit density, shrinking process technology and near-threshold voltage (NTV) operations, they are projected to be more vulnerable to transient faults, which have become one of the major concerns for future extreme-scale HPC systems. Despite being relatively infrequent, crashes due to transient faults are incredibly disruptive, particularly for massively parallel jobs on supercomputers where they potentially kill the entire job, requiring an expensive rerun or restart from a checkpoint.
In this paper, we present CARE, a light-weight compiler-assisted technique to repair the (crashed) process on-the-fly when a crash-causing error is detected, allowing applications to continue their executions instead of being simply terminated and restarted. Specifically, CARE seeks to repair failures that would result in application crashes due to invalid memory references (segmentation violation). During the compilation of applications, CARE constructs a recovery kernel for each crash-prone instruction, and upon an occurrence of an error, CARE attempts to repair corrupted state of the process by executing the constructed recovery kernel to recompute the memory reference on-the-fly. We evaluated CARE with four scientific workloads. During their normal execution, CARE incurs almost zero runtime overhead and a fixed 27MB memory overheads. Meanwhile, CARE can recover on an average 83.54% of crash-causing errors within dozens of milliseconds. We also evaluated CARE with parallel jobs running on 3072 cores and showed that CARE can successfully mask the impact of crash-causing errors by providing almost uninterrupted execution. Finally, We present our preliminary evaluation result for BLAS, which shows that CARE is capable of recovering failures in libraries with a very high coverage rate of 83% and negligible overheads. With such an effective recovery mechanism, CARE could tremendously mitigate the overheads and resource requirements of the resilience subsystem in future extreme-scale systems.
- Saman Amarasinghe, Dan Campbell, and William Carlson etc. 2009. Exascale Software Study: Software Challenges in Extreme Scale Systems. Technical Report. DARPA IPTO, Air Force Research Labs. Google ScholarCross Ref
- Rizwan A. Ashraf, Roberto Gioiosa, Gokcen Kestor, Ronald F. DeMara, Chen-Yong Cher, and Pradip Bose. 2015. Understanding the Propagation of Transient Errors in HPC Applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). ACM, New York, NY, USA, Article 72, 12 pages. Google ScholarDigital Library
- Jon Calhoun, Marc Snir, Luke N. Olson, and William D. Gropp. 2017. Towards a More Complete Understanding of SDC Propagation. In Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '17). ACM, New York, NY, USA, 131--142. Google ScholarDigital Library
- Franck Cappello, Al Geist, Bill Gropp, Laxmikant Kale, Bill Kramer, and Marc Snir. 2009. Toward Exascale Resilience. Int. J. High Perform. Comput. Appl. 23, 4 (Nov. 2009), 374--388. Google ScholarDigital Library
- Chun-Kai Chang, Sangkug Lym, Nicholas Kelly, Michael B. Sullivan, and Mattan Erez. 2018. Evaluating and Accelerating High-fidelity Error Injection for HPC. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '18). IEEE Press, Piscataway, NJ, USA, Article 45, 13 pages. http://dl.acm.org/citation.cfm?id=3291656.3291716Google ScholarDigital Library
- Chao Chen, Greg Eisenhauer, Matthew Wolf, and Santosh Pande. 2018. LADR: Low-cost Application-level Detector for Reducing Silent Output Corruptions. In Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '18). ACM, New York, NY, USA, 156--167. Google ScholarDigital Library
- Zizhong Chen. 2011. Algorithm-based Recovery for Iterative Methods Without Checkpointing. In Proceedings of the 20th International Symposium on High Performance Distributed Computing (HPDC '11). ACM, New York, NY, USA, 73--84. Google ScholarDigital Library
- Zizhong Chen. 2013. Online-ABFT: An Online Algorithm Based Fault Tolerance Scheme for Soft Error Detection in Iterative Methods. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '13). ACM, New York, NY, USA, 167--176. Google ScholarDigital Library
- Chen-Yong Cher, Meeta S. Gupta, Pradip Bose, and K. Paul Muller. 2014. Understanding Soft Error Resiliency of BlueGene/Q Compute Chip Through Hardware Proton Irradiation and Software Fault Injection. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '14). IEEE Press, Piscataway, NJ, USA, 587--596. Google ScholarDigital Library
- Clang 2019. Clang. https://clang.llvm.org. (2019).Google Scholar
- Majid Dadashi, Layali Rashid, Karthik Pattabiraman, and Sathish Gopalakrishnan. 2014. Hardware-Software Integrated Diagnosis for Intermittent Hardware Faults. In Proceedings of the International Conference on Dependable Systems and Networks. IEEE, Atlanta, GA, USA, 363--374. Google ScholarDigital Library
- S. Di and F. Cappello. 2016. Fast Error-Bounded Lossy HPC Data Compression with SZ. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, Chicago, IL, USA, 730--739. Google ScholarCross Ref
- Jack Dongarra, Pete Beckman, Terry Moore, Patrick Aerts, Giovanni Aloisio, Jean-Claude Andre, David Barkai, Jean-Yves Berthou, Taisuke Boku, Bertrand Braunschweig, Franck Cappello, Barbara Chapman, Xuebin Chi, Alok Choudhary, Sudip Dosanjh, Thom Dunning, Sandro Fiore, Al Geist, Bill Gropp, Robert Harrison, Mark Hereld, Michael Heroux, Adolfy Hoisie, Koh Hotta, Zhong Jin, Yutaka Ishikawa, Fred Johnson, Sanjay Kale, Richard Kenway, David Keyes, Bill Kramer, Jesus Labarta, Alain Lichnewsky, Thomas Lippert, Bob Lucas, Barney Maccabe, Satoshi Matsuoka, Paul Messina, Peter Michielse, Bernd Mohr, Matthias S. Mueller, Wolfgang E. Nagel, Hiroshi Nakashima, Michael E Papka, Dan Reed, Mitsuhisa Sato, Ed Seidel, John Shalf, David Skinner, Marc Snir, Thomas Sterling, Rick Stevens, Fred Streitz, Bob Sugar, Shinji Sumimoto, William Tang, John Taylor, Rajeev Thakur, Anne Trefethen, Mateo Valero, Aad Van Der Steen, Jeffrey Vetter, Peg Williams, Robert Wisniewski, and Kathy Yelick. 2011. The International Exascale Software Project Roadmap. Int. J. High Perform. Comput. Appl. 25, 1 (Feb. 2011), 3--60. Google ScholarDigital Library
- DragonEgg 2019. DragonEgg. https://dragonegg.llvm.org. (2019).Google Scholar
- James Elliott, Kishor Kharbas, David Fiala, Frank Mueller, Kurt Ferreira, and Christian Engelmann. 2012. Combining Partial Redundancy and Checkpointing for HPC. In Proceedings of the 2012 IEEE 32Nd International Conference on Distributed Computing Systems (ICDCS '12). IEEE Computer Society, Washington, DC, USA, 615--626. Google ScholarDigital Library
- Bo Fang, Qiang Guan, Nathan Debardeleben, Karthik Pattabiraman, and Matei Ripeanu. 2017. LetGo: A Lightweight Continuous Framework for HPC Applications Under Failures. In Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '17). ACM, New York, NY, USA, 117--130. Google ScholarDigital Library
- Flang 2019. FLANG. https://github.com/flang-compiler/flang. (2019).Google Scholar
- Giorgis Georgakoudis, Ignacio Laguna, Dimitrios S. Nikolopoulos, and Martin Schulz. 2017. REFINE: Realistic Fault Injection via Compiler-based Instrumentation for Accuracy, Portability and Speed. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '17). ACM, New York, NY, USA, Article 29, 14 pages. Google ScholarDigital Library
- gprotobuf 2019. Google Protobuf. https://developers.google.com/protocol-buffers/. (2019).Google Scholar
- Michael A. Heroux. 2013. Toward Resilient Algorithms and Applications. In Proceedings of the 3rd Workshop on Fault-tolerance for HPC at Extreme Scale (FTXS '13). ACM, New York, NY, USA, 1--2. Google ScholarDigital Library
- Saurabh Hukerikar and Robert F. Lucas. 2016. Rolex: resilience-oriented language extensions for extreme-scale systems. The Journal of Supercomputing 72, 12 (01 Dec 2016), 4662--4695. Google ScholarCross Ref
- Dmitrii Kuvaiskii, Rasha Faqeh, Pramod Bhatotia, Pascal Felber, and Christof Fetzer. 2016. HAFT: Hardware-assisted Fault Tolerance. In Proceedings of the Eleventh European Conference on Computer Systems (EuroSys '16). ACM, New York, NY, USA, Article 25, 17 pages. Google ScholarDigital Library
- Dong Li, Jeffrey S. Vetter, and Weikuan Yu. 2012. Classifying Soft Error Vulnerabilities in Extreme-scale Scientific Applications Using a Binary Instrumentation Tool. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '12). IEEE Computer Society Press, Los Alamitos, CA, USA, Article 57, 11 pages. http://dl.acm.org/citation.cfm?id=2388996.2389074Google ScholarDigital Library
- libdwarf 2019. libdwarf. https://www.prevanders.net/dwarf.html. (2019).Google Scholar
- libffi 2019. libffi. https://sourceware.org/libffi/. (2019).Google Scholar
- llvm 2019. LLVM. https://llvm.org. (2019).Google Scholar
- Fan Long, Stelios Sidiroglou-Douskos, and Martin Rinard. 2014. Automatic Runtime Error Repair and Containment via Recovery Shepherding. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '14). ACM, New York, NY, USA, 227--238. Google ScholarDigital Library
- mhash 2019. mhash. http://mhash.sourceforge.net/. (2019).Google Scholar
- S. Mitra, P. Bose, E. Cheng, C. Cher, H. Cho, R. Joshi, Y. M. Kim, C. R. Lefurgy, Y. Li, K. P. Rodbell, K. Skadron, J. Stathis, and L. Szafaryn. 2014. The resilience wall: Cross-layer solution strategies. In Proceedings of Technical Program - 2014 International Symposium on VLSI Technology, Systems and Application (VLSI-TSA). IEEE Press, Hsinchu, Taiwan, 1--11. Google ScholarCross Ref
- Adam Moody, Greg Bronevetsky, Kathryn Mohror, and Bronis R. de Supinski. 2010. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC '10). IEEE Computer Society, Washington, DC, USA, 1--11. Google ScholarDigital Library
- Daniel Oliveira, Laércio Pilla, Nathan DeBardeleben, Sean Blanchard, Heather Quinn, Israel Koren, Philippe Navaux, and Paolo Rech. 2017. Experimental and Analytical Study of Xeon Phi Reliability. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '17). ACM, New York, NY, USA, Article 28, 12 pages. Google ScholarDigital Library
- Feng Qin, Joseph Tucek, Jagadeesan Sundaresan, and Yuanyuan Zhou. 2005. Rx: Treating Bugs As Allergies---a Safe Method to Survive Software Failures. In Proceedings of the Twentieth ACM Symposium on Operating Systems Principles (SOSP '05). ACM, New York, NY, USA, 235--248. Google ScholarDigital Library
- Nguyen Anh Quynh. 2014. Capstone: Next-Gen Disassembly Framework. (2014).Google Scholar
- Vilas Sridharan, Nathan DeBardeleben, Sean Blanchard, Kurt B. Ferreira, Jon Stearley, John Shalf, and Sudhanva Gurumurthi. 2015. Memory Errors in Modern Systems: The Good, The Bad, and The Ugly. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '15). ACM, New York, NY, USA, 297--310. Google ScholarDigital Library
- Margaret H. Wright and Al. 2010. The opportunities and challenges of exascale computing. (2010). https://science.energy.gov/~/media/ascr/ascac/pdf/reports/Exascale_subcommittee_report.pdfGoogle Scholar
Index Terms
- CARE: compiler-assisted recovery from soft failures
Recommendations
LADR: low-cost application-level detector for reducing silent output corruptions
HPDC '18: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed ComputingApplications running on future high performance computing (HPC) systems are more likely to experience transient faults due to technology scaling trends with respect to higher circuit density, smaller transistor size and near-threshold voltage (NTV) ...
Resilience and survivability in communication networks: Strategies, principles, and survey of disciplines
The Internet has become essential to all aspects of modern life, and thus the consequences of network disruption have become increasingly severe. It is widely recognised that the Internet is not sufficiently resilient, survivable, and dependable, and ...
An OS-Hypervisor Infrastructure for Automated OS Crash Diagnosis and Recovery in a Virtualized Environment
SBAC-PAD '12: Proceedings of the 2012 IEEE 24th International Symposium on Computer Architecture and High Performance ComputingRecovering from OS crashes has traditionally been done using reboot or checkpoint-restart mechanisms. Such techniques either fail to preserve the state before the crash happens or require modifications to applications. To eliminate these problems, we ...
Comments