skip to main content
10.1145/3295500.3356194acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

CARE: compiler-assisted recovery from soft failures

Published:17 November 2019Publication History

ABSTRACT

As processors continue to boost the system performance with higher circuit density, shrinking process technology and near-threshold voltage (NTV) operations, they are projected to be more vulnerable to transient faults, which have become one of the major concerns for future extreme-scale HPC systems. Despite being relatively infrequent, crashes due to transient faults are incredibly disruptive, particularly for massively parallel jobs on supercomputers where they potentially kill the entire job, requiring an expensive rerun or restart from a checkpoint.

In this paper, we present CARE, a light-weight compiler-assisted technique to repair the (crashed) process on-the-fly when a crash-causing error is detected, allowing applications to continue their executions instead of being simply terminated and restarted. Specifically, CARE seeks to repair failures that would result in application crashes due to invalid memory references (segmentation violation). During the compilation of applications, CARE constructs a recovery kernel for each crash-prone instruction, and upon an occurrence of an error, CARE attempts to repair corrupted state of the process by executing the constructed recovery kernel to recompute the memory reference on-the-fly. We evaluated CARE with four scientific workloads. During their normal execution, CARE incurs almost zero runtime overhead and a fixed 27MB memory overheads. Meanwhile, CARE can recover on an average 83.54% of crash-causing errors within dozens of milliseconds. We also evaluated CARE with parallel jobs running on 3072 cores and showed that CARE can successfully mask the impact of crash-causing errors by providing almost uninterrupted execution. Finally, We present our preliminary evaluation result for BLAS, which shows that CARE is capable of recovering failures in libraries with a very high coverage rate of 83% and negligible overheads. With such an effective recovery mechanism, CARE could tremendously mitigate the overheads and resource requirements of the resilience subsystem in future extreme-scale systems.

References

  1. Saman Amarasinghe, Dan Campbell, and William Carlson etc. 2009. Exascale Software Study: Software Challenges in Extreme Scale Systems. Technical Report. DARPA IPTO, Air Force Research Labs. Google ScholarGoogle ScholarCross RefCross Ref
  2. Rizwan A. Ashraf, Roberto Gioiosa, Gokcen Kestor, Ronald F. DeMara, Chen-Yong Cher, and Pradip Bose. 2015. Understanding the Propagation of Transient Errors in HPC Applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). ACM, New York, NY, USA, Article 72, 12 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Jon Calhoun, Marc Snir, Luke N. Olson, and William D. Gropp. 2017. Towards a More Complete Understanding of SDC Propagation. In Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '17). ACM, New York, NY, USA, 131--142. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Franck Cappello, Al Geist, Bill Gropp, Laxmikant Kale, Bill Kramer, and Marc Snir. 2009. Toward Exascale Resilience. Int. J. High Perform. Comput. Appl. 23, 4 (Nov. 2009), 374--388. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Chun-Kai Chang, Sangkug Lym, Nicholas Kelly, Michael B. Sullivan, and Mattan Erez. 2018. Evaluating and Accelerating High-fidelity Error Injection for HPC. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '18). IEEE Press, Piscataway, NJ, USA, Article 45, 13 pages. http://dl.acm.org/citation.cfm?id=3291656.3291716Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Chao Chen, Greg Eisenhauer, Matthew Wolf, and Santosh Pande. 2018. LADR: Low-cost Application-level Detector for Reducing Silent Output Corruptions. In Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '18). ACM, New York, NY, USA, 156--167. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Zizhong Chen. 2011. Algorithm-based Recovery for Iterative Methods Without Checkpointing. In Proceedings of the 20th International Symposium on High Performance Distributed Computing (HPDC '11). ACM, New York, NY, USA, 73--84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Zizhong Chen. 2013. Online-ABFT: An Online Algorithm Based Fault Tolerance Scheme for Soft Error Detection in Iterative Methods. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '13). ACM, New York, NY, USA, 167--176. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Chen-Yong Cher, Meeta S. Gupta, Pradip Bose, and K. Paul Muller. 2014. Understanding Soft Error Resiliency of BlueGene/Q Compute Chip Through Hardware Proton Irradiation and Software Fault Injection. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '14). IEEE Press, Piscataway, NJ, USA, 587--596. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Clang 2019. Clang. https://clang.llvm.org. (2019).Google ScholarGoogle Scholar
  11. Majid Dadashi, Layali Rashid, Karthik Pattabiraman, and Sathish Gopalakrishnan. 2014. Hardware-Software Integrated Diagnosis for Intermittent Hardware Faults. In Proceedings of the International Conference on Dependable Systems and Networks. IEEE, Atlanta, GA, USA, 363--374. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Di and F. Cappello. 2016. Fast Error-Bounded Lossy HPC Data Compression with SZ. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, Chicago, IL, USA, 730--739. Google ScholarGoogle ScholarCross RefCross Ref
  13. Jack Dongarra, Pete Beckman, Terry Moore, Patrick Aerts, Giovanni Aloisio, Jean-Claude Andre, David Barkai, Jean-Yves Berthou, Taisuke Boku, Bertrand Braunschweig, Franck Cappello, Barbara Chapman, Xuebin Chi, Alok Choudhary, Sudip Dosanjh, Thom Dunning, Sandro Fiore, Al Geist, Bill Gropp, Robert Harrison, Mark Hereld, Michael Heroux, Adolfy Hoisie, Koh Hotta, Zhong Jin, Yutaka Ishikawa, Fred Johnson, Sanjay Kale, Richard Kenway, David Keyes, Bill Kramer, Jesus Labarta, Alain Lichnewsky, Thomas Lippert, Bob Lucas, Barney Maccabe, Satoshi Matsuoka, Paul Messina, Peter Michielse, Bernd Mohr, Matthias S. Mueller, Wolfgang E. Nagel, Hiroshi Nakashima, Michael E Papka, Dan Reed, Mitsuhisa Sato, Ed Seidel, John Shalf, David Skinner, Marc Snir, Thomas Sterling, Rick Stevens, Fred Streitz, Bob Sugar, Shinji Sumimoto, William Tang, John Taylor, Rajeev Thakur, Anne Trefethen, Mateo Valero, Aad Van Der Steen, Jeffrey Vetter, Peg Williams, Robert Wisniewski, and Kathy Yelick. 2011. The International Exascale Software Project Roadmap. Int. J. High Perform. Comput. Appl. 25, 1 (Feb. 2011), 3--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. DragonEgg 2019. DragonEgg. https://dragonegg.llvm.org. (2019).Google ScholarGoogle Scholar
  15. James Elliott, Kishor Kharbas, David Fiala, Frank Mueller, Kurt Ferreira, and Christian Engelmann. 2012. Combining Partial Redundancy and Checkpointing for HPC. In Proceedings of the 2012 IEEE 32Nd International Conference on Distributed Computing Systems (ICDCS '12). IEEE Computer Society, Washington, DC, USA, 615--626. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Bo Fang, Qiang Guan, Nathan Debardeleben, Karthik Pattabiraman, and Matei Ripeanu. 2017. LetGo: A Lightweight Continuous Framework for HPC Applications Under Failures. In Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '17). ACM, New York, NY, USA, 117--130. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Flang 2019. FLANG. https://github.com/flang-compiler/flang. (2019).Google ScholarGoogle Scholar
  18. Giorgis Georgakoudis, Ignacio Laguna, Dimitrios S. Nikolopoulos, and Martin Schulz. 2017. REFINE: Realistic Fault Injection via Compiler-based Instrumentation for Accuracy, Portability and Speed. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '17). ACM, New York, NY, USA, Article 29, 14 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. gprotobuf 2019. Google Protobuf. https://developers.google.com/protocol-buffers/. (2019).Google ScholarGoogle Scholar
  20. Michael A. Heroux. 2013. Toward Resilient Algorithms and Applications. In Proceedings of the 3rd Workshop on Fault-tolerance for HPC at Extreme Scale (FTXS '13). ACM, New York, NY, USA, 1--2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Saurabh Hukerikar and Robert F. Lucas. 2016. Rolex: resilience-oriented language extensions for extreme-scale systems. The Journal of Supercomputing 72, 12 (01 Dec 2016), 4662--4695. Google ScholarGoogle ScholarCross RefCross Ref
  22. Dmitrii Kuvaiskii, Rasha Faqeh, Pramod Bhatotia, Pascal Felber, and Christof Fetzer. 2016. HAFT: Hardware-assisted Fault Tolerance. In Proceedings of the Eleventh European Conference on Computer Systems (EuroSys '16). ACM, New York, NY, USA, Article 25, 17 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Dong Li, Jeffrey S. Vetter, and Weikuan Yu. 2012. Classifying Soft Error Vulnerabilities in Extreme-scale Scientific Applications Using a Binary Instrumentation Tool. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '12). IEEE Computer Society Press, Los Alamitos, CA, USA, Article 57, 11 pages. http://dl.acm.org/citation.cfm?id=2388996.2389074Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. libdwarf 2019. libdwarf. https://www.prevanders.net/dwarf.html. (2019).Google ScholarGoogle Scholar
  25. libffi 2019. libffi. https://sourceware.org/libffi/. (2019).Google ScholarGoogle Scholar
  26. llvm 2019. LLVM. https://llvm.org. (2019).Google ScholarGoogle Scholar
  27. Fan Long, Stelios Sidiroglou-Douskos, and Martin Rinard. 2014. Automatic Runtime Error Repair and Containment via Recovery Shepherding. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '14). ACM, New York, NY, USA, 227--238. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. mhash 2019. mhash. http://mhash.sourceforge.net/. (2019).Google ScholarGoogle Scholar
  29. S. Mitra, P. Bose, E. Cheng, C. Cher, H. Cho, R. Joshi, Y. M. Kim, C. R. Lefurgy, Y. Li, K. P. Rodbell, K. Skadron, J. Stathis, and L. Szafaryn. 2014. The resilience wall: Cross-layer solution strategies. In Proceedings of Technical Program - 2014 International Symposium on VLSI Technology, Systems and Application (VLSI-TSA). IEEE Press, Hsinchu, Taiwan, 1--11. Google ScholarGoogle ScholarCross RefCross Ref
  30. Adam Moody, Greg Bronevetsky, Kathryn Mohror, and Bronis R. de Supinski. 2010. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC '10). IEEE Computer Society, Washington, DC, USA, 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Daniel Oliveira, Laércio Pilla, Nathan DeBardeleben, Sean Blanchard, Heather Quinn, Israel Koren, Philippe Navaux, and Paolo Rech. 2017. Experimental and Analytical Study of Xeon Phi Reliability. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '17). ACM, New York, NY, USA, Article 28, 12 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Feng Qin, Joseph Tucek, Jagadeesan Sundaresan, and Yuanyuan Zhou. 2005. Rx: Treating Bugs As Allergies---a Safe Method to Survive Software Failures. In Proceedings of the Twentieth ACM Symposium on Operating Systems Principles (SOSP '05). ACM, New York, NY, USA, 235--248. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Nguyen Anh Quynh. 2014. Capstone: Next-Gen Disassembly Framework. (2014).Google ScholarGoogle Scholar
  34. Vilas Sridharan, Nathan DeBardeleben, Sean Blanchard, Kurt B. Ferreira, Jon Stearley, John Shalf, and Sudhanva Gurumurthi. 2015. Memory Errors in Modern Systems: The Good, The Bad, and The Ugly. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '15). ACM, New York, NY, USA, 297--310. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Margaret H. Wright and Al. 2010. The opportunities and challenges of exascale computing. (2010). https://science.energy.gov/~/media/ascr/ascac/pdf/reports/Exascale_subcommittee_report.pdfGoogle ScholarGoogle Scholar

Index Terms

  1. CARE: compiler-assisted recovery from soft failures

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
        November 2019
        1921 pages
        ISBN:9781450362290
        DOI:10.1145/3295500

        Copyright © 2019 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 17 November 2019

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate1,516of6,373submissions,24%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader