ABSTRACT
Fault-tolerance has always been an important topic when it comes to running massively parallel programs at scale. Statistically, hardware and software failures are expected to occur more often on systems gathering millions of computing units. Moreover, the larger jobs are, the more computing hours would be wasted by a crash. In this paper, we describe the work done in our MPI runtime to enable transparent checkpointing mechanism. Unlike the MPI 4.0 User-Level Failure Mitigation (ULFM) interface, our work targets solely Checkpoint/Restart (C/R) and ignores wider features such as resiliency. We show how existing transparent checkpointing methods can be practically applied to MPI implementations given a sufficient collaboration from the MPI runtime. Our C/R technique is then measured on MPI benchmarks such as IMB and Lulesh relying on Infiniband high-speed network, demonstrating that the chosen approach is sufficiently general and that performance is mostly preserved. We argue that enabling fault-tolerance without any modification inside target MPI applications is possible, and show how it could be the first step for more integrated resiliency combined with failure mitigation like ULFM.
- Jason Ansel, Kapil Arya, and Gene Cooperman. 2009. DMTCP: Transparent checkpointing for cluster computations and the desktop. In Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on. IEEE, 1--12. Google ScholarDigital Library
- Brian W Barrett, Ronald Brightwell, Scott Hemmert, Kevin Pedretti, Kyle Wheeler, Keith Underwood, Rolf Riesen, Arthur B Maccabe, and Trammell Hudson. 2012. The Portals 4.0 network programming interface. Sandia National Laboratories, November 2012, Technical Report SAND2012-10087 (2012).Google Scholar
- L. Bautista-Gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama, and S. Matsuoka. 2011. FTI: High performance Fault Tolerance Interface for hybrid systems. In 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 1--12. Google ScholarDigital Library
- Fabrice Bellard. 2005. QEMU, a fast and portable dynamic translator.. In USENIX Annual Technical Conference, FREENIX Track, Vol. 41. 46. Google ScholarDigital Library
- Jean-Baptiste Besnard, Julien Adam, Sameer Shende, Marc Pérache, Patrick Carribault, Julien Jaeger, and Allen D Maloney. 2016. Introducing Task-Containers as an Alternative to Runtime-Stacking. In Proceedings of the 23rd European MPI Users' Group Meeting. ACM, 51--63. Google ScholarDigital Library
- Wesley Bland, Aurelien Bouteiller, Thomas Herault, George Bosilca, and Jack Dongarra. 2013. Post-failure recovery of MPI communication capability: Design and rationale. The International Journal of High Performance Computing Applications 27, 3 (2013), 244--254. arXiv:https://doi.org/10.1177/1094342013488238 Google ScholarDigital Library
- Wesley Bland, Aurelien Bouteiller, Thomas Herault, Joshua Hursey, George Bosilca, and Jack J Dongarra. 2012. An evaluation of user-level failure mitigation support in MPI. In European MPI Users' Group Meeting. Springer, 193--203. Google ScholarDigital Library
- Aurelien Bouteiller, George Bosilca, and Jack J. Dongarra. 2015. Plan B: Interruption of Ongoing MPI Operations to Support Failure Recovery. In Proceedings of the 22Nd European MPI Users' Group Meeting (EuroMPI '15). ACM, New York, NY, USA, Article 11, 9 pages. Google ScholarDigital Library
- Darius Buntinas, Camille Coti, Thomas Herault, Pierre Lemarinier, Laurence Pilard, Ala Rezmerita, Eric Rodriguez, and Franck Cappello. 2008. Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI Protocols. Future Generation Computer Systems 24, 1 (2008), 73 -- 84.Google ScholarCross Ref
- Jiajun Cao, Gregory Kerr, Kapil Arya, and Gene Cooperman. 2014. Transparent Checkpoint-restart over Infiniband. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing (HPDC '14). ACM, New York, NY, USA, 13--24. Google ScholarDigital Library
- S. Derradji, T. Palfer-Sollier, J. P. Panziera, A. Poudes, and F. W. Atos. 2015. The BXI Interconnect Architecture. In 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects. 18--25. Google ScholarDigital Library
- James Dinan, Ryan E Grant, Pavan Balaji, David Goodell, Douglas Miller, Marc Snir, and Rajeev Thakur. 2014. Enabling communication concurrency through flexible MPI endpoints. The International Journal of High Performance Computing Applications 28, 4 (2014), 390--405. Google ScholarDigital Library
- P EMELYANOV. 2011. CRIU: Checkpoint/Restore In Userspace, July 2011. (2011). https://criu.org/Google Scholar
- Graham E. Fagg and Jack J. Dongarra. 2000. FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, Jack Dongarra, Peter Kacsuk, and Norbert Podhorszki (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 346--353. Google ScholarDigital Library
- Marc Gamell, Daniel S. Katz, Hemanth Kolla, Jacqueline Chen, Scott Klasky, and Manish Parashar. 2014. Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '14). IEEE Press, Piscataway, NJ, USA, 895--906. Google ScholarDigital Library
- R. Garg, K. Sodha, Z. Jin, and G. Cooperman. 2013. Checkpoint-restart for a network of virtual machines. In 2013 IEEE International Conference on Cluster Computing (CLUSTER). 1--8.Google Scholar
- William Gropp, Rajeev Thakur, and Ewing Lusk. 1999. Using MPI-2: Advanced Features of the Message Passing Interface (2nd ed.). MIT Press, Cambridge, MA, USA. Google ScholarDigital Library
- Paul H Hargrove and Jason C Duell. 2006. Berkeley lab checkpoint/restart (BLCR) for Linux clusters. Journal of Physics: Conference Series 46, 1 (2006), 494. http://stacks.iop.org/1742-6596/46/i=1/a=067Google ScholarCross Ref
- Daniel Holmes, Kathryn Mohror, Ryan E Grant, Anthony Skjellum, Martin Schulz, Wesley Bland, and Jeffrey M Squyres. 2016. MPI Sessions: Leveraging Runtime Infrastructure to Increase Scalability of Applications at Exascale. In Proceedings of the 23rd European MPI Users' Group Meeting. ACM, 121--129. Google ScholarDigital Library
- Joshua Hursey, Richard L. Graham, Greg Bronevetsky, Darius Buntinas, Howard Pritchard, and David G. Solt. 2011. Run-Through Stabilization: An MPI Proposal for Process Fault Tolerance. In Recent Advances in the Message Passing Interface, Yiannis Cotronis, Anthony Danalis, Dimitrios S. Nikolopoulos, and Jack Dongarra (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 329--332. Google ScholarDigital Library
- Laxmikant V. Kale and Sanjeev Krishnan. 1993. CHARM++: A Portable Concurrent Object Oriented System Based on C++. In Proceedings of the Eighth Annual Conference on Object-oriented Programming Systems, Languages, and Applications (OOPSLA '93). ACM, New York, NY, USA, 91--108. Google ScholarDigital Library
- Laxmikant V Kale and Gengbin Zheng. 2009. Charm++ and AMPI: Adaptive runtime strategies via migratable objects. Advanced Computational Infrastructures for Parallel and Distributed Applications (2009), 265--282.Google Scholar
- Ian Karlin, Jeff Keasler, and JR Neely. 2013. Lulesh 2.0 updates and changes. Technical Report. Lawrence Livermore National Lab.(LLNL), Livermore, CA (United States).Google Scholar
- Adam Moody, Greg Bronevetsky, Kathryn Mohror, and Bronis R De Supinski. 2010. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In High Performance Computing, Networking, Storage and Analysis (SC), 2010 International Conference for. IEEE, 1--11. Google ScholarDigital Library
- Xiang Ni, Esteban Meneses, Nikhil Jain, and Laxmikant V Kalé. 2013. ACR: Automatic checkpoint/restart for soft and hard error protection. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. ACM, 7. Google ScholarDigital Library
- Marc Pérache, Patrick Carribault, and Hervé Jourdren. 2009. MPC-MPI: An MPI implementation reducing the overall memory consumption. In European Parallel Virtual Machine/Message Passing Interface Users' Group Meeting. Springer, 94--103. Google ScholarDigital Library
- Michael Rieker, Jason Ansel, and Gene Cooperman. 2006. Transparent User-Level Checkpointing for the Native Posix Thread Library for Linux.. In PDPTA, Vol. 6. 492--498.Google Scholar
- Keita Teranishi and Michael A. Heroux. 2014. Toward Local Failure Local Recovery Resilience Model Using MPI-ULFM. In Proceedings of the 21st European MPI Users' Group Meeting (EuroMPI/ASIA '14). ACM, New York, NY, USA, Article 51, 6 pages. Google ScholarDigital Library
- Rajeev Thakur, William Gropp, and Ewing Lusk. 1999. On Implementing MPI-IO Portably and with High Performance. In Proceedings of the Sixth Workshop on I/O in Parallel and Distributed Systems (IOPADS '99). ACM, New York, NY, USA, 23--32. Google ScholarDigital Library
- Gengbin Zheng, Chao Huang, and Laxmikant V. Kalé. 2006. Performance Evaluation of Automatic Checkpoint-based Fault Tolerance for AMPI and Charm++. SIGOPS Oper. Syst. Rev. 40, 2 (April 2006), 90--99. Google ScholarDigital Library
- Gengbin Zheng, Lixia Shi, and L. V. Kale. 2004. FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI. In 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935). 93--103. Google ScholarDigital Library
Recommendations
MANA for MPI: MPI-Agnostic Network-Agnostic Transparent Checkpointing
HPDC '19: Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed ComputingTransparently checkpointing MPI for fault tolerance and load balancing is a long-standing problem in HPC. The problem has been complicated by the need to provide checkpoint-restart services for all combinations of an MPI implementation over all network ...
Checkpoint/restart approaches for a thread-based MPI runtime
Highlights- Transparent checkpoint restart can be applied to high-speed networks with collaboration from the MPI runtime (particularly network modularity).
AbstractFault-tolerance has always been an important topic when it comes to running massively parallel programs at scale. Statistically, hardware and software failures are expected to occur more often on systems gathering millions of computing ...
The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing
As high performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability. To address these issues, we present the design and implementation of a system for ...
Comments