skip to main content
10.1145/3236367.3236383acmotherconferencesArticle/Chapter ViewAbstractPublication PageseurompiConference Proceedingsconference-collections
research-article

Transparent High-Speed Network Checkpoint/Restart in MPI

Published:23 September 2018Publication History

ABSTRACT

Fault-tolerance has always been an important topic when it comes to running massively parallel programs at scale. Statistically, hardware and software failures are expected to occur more often on systems gathering millions of computing units. Moreover, the larger jobs are, the more computing hours would be wasted by a crash. In this paper, we describe the work done in our MPI runtime to enable transparent checkpointing mechanism. Unlike the MPI 4.0 User-Level Failure Mitigation (ULFM) interface, our work targets solely Checkpoint/Restart (C/R) and ignores wider features such as resiliency. We show how existing transparent checkpointing methods can be practically applied to MPI implementations given a sufficient collaboration from the MPI runtime. Our C/R technique is then measured on MPI benchmarks such as IMB and Lulesh relying on Infiniband high-speed network, demonstrating that the chosen approach is sufficiently general and that performance is mostly preserved. We argue that enabling fault-tolerance without any modification inside target MPI applications is possible, and show how it could be the first step for more integrated resiliency combined with failure mitigation like ULFM.

References

  1. Jason Ansel, Kapil Arya, and Gene Cooperman. 2009. DMTCP: Transparent checkpointing for cluster computations and the desktop. In Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on. IEEE, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Brian W Barrett, Ronald Brightwell, Scott Hemmert, Kevin Pedretti, Kyle Wheeler, Keith Underwood, Rolf Riesen, Arthur B Maccabe, and Trammell Hudson. 2012. The Portals 4.0 network programming interface. Sandia National Laboratories, November 2012, Technical Report SAND2012-10087 (2012).Google ScholarGoogle Scholar
  3. L. Bautista-Gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama, and S. Matsuoka. 2011. FTI: High performance Fault Tolerance Interface for hybrid systems. In 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Fabrice Bellard. 2005. QEMU, a fast and portable dynamic translator.. In USENIX Annual Technical Conference, FREENIX Track, Vol. 41. 46. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Jean-Baptiste Besnard, Julien Adam, Sameer Shende, Marc Pérache, Patrick Carribault, Julien Jaeger, and Allen D Maloney. 2016. Introducing Task-Containers as an Alternative to Runtime-Stacking. In Proceedings of the 23rd European MPI Users' Group Meeting. ACM, 51--63. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Wesley Bland, Aurelien Bouteiller, Thomas Herault, George Bosilca, and Jack Dongarra. 2013. Post-failure recovery of MPI communication capability: Design and rationale. The International Journal of High Performance Computing Applications 27, 3 (2013), 244--254. arXiv:https://doi.org/10.1177/1094342013488238 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Wesley Bland, Aurelien Bouteiller, Thomas Herault, Joshua Hursey, George Bosilca, and Jack J Dongarra. 2012. An evaluation of user-level failure mitigation support in MPI. In European MPI Users' Group Meeting. Springer, 193--203. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Aurelien Bouteiller, George Bosilca, and Jack J. Dongarra. 2015. Plan B: Interruption of Ongoing MPI Operations to Support Failure Recovery. In Proceedings of the 22Nd European MPI Users' Group Meeting (EuroMPI '15). ACM, New York, NY, USA, Article 11, 9 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Darius Buntinas, Camille Coti, Thomas Herault, Pierre Lemarinier, Laurence Pilard, Ala Rezmerita, Eric Rodriguez, and Franck Cappello. 2008. Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI Protocols. Future Generation Computer Systems 24, 1 (2008), 73 -- 84.Google ScholarGoogle ScholarCross RefCross Ref
  10. Jiajun Cao, Gregory Kerr, Kapil Arya, and Gene Cooperman. 2014. Transparent Checkpoint-restart over Infiniband. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing (HPDC '14). ACM, New York, NY, USA, 13--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Derradji, T. Palfer-Sollier, J. P. Panziera, A. Poudes, and F. W. Atos. 2015. The BXI Interconnect Architecture. In 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects. 18--25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. James Dinan, Ryan E Grant, Pavan Balaji, David Goodell, Douglas Miller, Marc Snir, and Rajeev Thakur. 2014. Enabling communication concurrency through flexible MPI endpoints. The International Journal of High Performance Computing Applications 28, 4 (2014), 390--405. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P EMELYANOV. 2011. CRIU: Checkpoint/Restore In Userspace, July 2011. (2011). https://criu.org/Google ScholarGoogle Scholar
  14. Graham E. Fagg and Jack J. Dongarra. 2000. FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, Jack Dongarra, Peter Kacsuk, and Norbert Podhorszki (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 346--353. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Marc Gamell, Daniel S. Katz, Hemanth Kolla, Jacqueline Chen, Scott Klasky, and Manish Parashar. 2014. Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '14). IEEE Press, Piscataway, NJ, USA, 895--906. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. R. Garg, K. Sodha, Z. Jin, and G. Cooperman. 2013. Checkpoint-restart for a network of virtual machines. In 2013 IEEE International Conference on Cluster Computing (CLUSTER). 1--8.Google ScholarGoogle Scholar
  17. William Gropp, Rajeev Thakur, and Ewing Lusk. 1999. Using MPI-2: Advanced Features of the Message Passing Interface (2nd ed.). MIT Press, Cambridge, MA, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Paul H Hargrove and Jason C Duell. 2006. Berkeley lab checkpoint/restart (BLCR) for Linux clusters. Journal of Physics: Conference Series 46, 1 (2006), 494. http://stacks.iop.org/1742-6596/46/i=1/a=067Google ScholarGoogle ScholarCross RefCross Ref
  19. Daniel Holmes, Kathryn Mohror, Ryan E Grant, Anthony Skjellum, Martin Schulz, Wesley Bland, and Jeffrey M Squyres. 2016. MPI Sessions: Leveraging Runtime Infrastructure to Increase Scalability of Applications at Exascale. In Proceedings of the 23rd European MPI Users' Group Meeting. ACM, 121--129. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Joshua Hursey, Richard L. Graham, Greg Bronevetsky, Darius Buntinas, Howard Pritchard, and David G. Solt. 2011. Run-Through Stabilization: An MPI Proposal for Process Fault Tolerance. In Recent Advances in the Message Passing Interface, Yiannis Cotronis, Anthony Danalis, Dimitrios S. Nikolopoulos, and Jack Dongarra (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 329--332. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Laxmikant V. Kale and Sanjeev Krishnan. 1993. CHARM++: A Portable Concurrent Object Oriented System Based on C++. In Proceedings of the Eighth Annual Conference on Object-oriented Programming Systems, Languages, and Applications (OOPSLA '93). ACM, New York, NY, USA, 91--108. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Laxmikant V Kale and Gengbin Zheng. 2009. Charm++ and AMPI: Adaptive runtime strategies via migratable objects. Advanced Computational Infrastructures for Parallel and Distributed Applications (2009), 265--282.Google ScholarGoogle Scholar
  23. Ian Karlin, Jeff Keasler, and JR Neely. 2013. Lulesh 2.0 updates and changes. Technical Report. Lawrence Livermore National Lab.(LLNL), Livermore, CA (United States).Google ScholarGoogle Scholar
  24. Adam Moody, Greg Bronevetsky, Kathryn Mohror, and Bronis R De Supinski. 2010. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In High Performance Computing, Networking, Storage and Analysis (SC), 2010 International Conference for. IEEE, 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Xiang Ni, Esteban Meneses, Nikhil Jain, and Laxmikant V Kalé. 2013. ACR: Automatic checkpoint/restart for soft and hard error protection. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. ACM, 7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Marc Pérache, Patrick Carribault, and Hervé Jourdren. 2009. MPC-MPI: An MPI implementation reducing the overall memory consumption. In European Parallel Virtual Machine/Message Passing Interface Users' Group Meeting. Springer, 94--103. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Michael Rieker, Jason Ansel, and Gene Cooperman. 2006. Transparent User-Level Checkpointing for the Native Posix Thread Library for Linux.. In PDPTA, Vol. 6. 492--498.Google ScholarGoogle Scholar
  28. Keita Teranishi and Michael A. Heroux. 2014. Toward Local Failure Local Recovery Resilience Model Using MPI-ULFM. In Proceedings of the 21st European MPI Users' Group Meeting (EuroMPI/ASIA '14). ACM, New York, NY, USA, Article 51, 6 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Rajeev Thakur, William Gropp, and Ewing Lusk. 1999. On Implementing MPI-IO Portably and with High Performance. In Proceedings of the Sixth Workshop on I/O in Parallel and Distributed Systems (IOPADS '99). ACM, New York, NY, USA, 23--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Gengbin Zheng, Chao Huang, and Laxmikant V. Kalé. 2006. Performance Evaluation of Automatic Checkpoint-based Fault Tolerance for AMPI and Charm++. SIGOPS Oper. Syst. Rev. 40, 2 (April 2006), 90--99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Gengbin Zheng, Lixia Shi, and L. V. Kale. 2004. FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI. In 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935). 93--103. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    EuroMPI '18: Proceedings of the 25th European MPI Users' Group Meeting
    September 2018
    187 pages
    ISBN:9781450364928
    DOI:10.1145/3236367

    Copyright © 2018 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 23 September 2018

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

    Acceptance Rates

    Overall Acceptance Rate66of139submissions,47%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader