research-article

Transparent High-Speed Network Checkpoint/Restart in MPI

Authors:
Julien Adam

ParaTools SAS, Bruyères-le-Châtel, France

ParaTools SAS, Bruyères-le-Châtel, France
View Profile

,
Jean-Baptiste Besnard

ParaTools SAS, Bruyères-le-Châtel, France

ParaTools SAS, Bruyères-le-Châtel, France
View Profile

,
Allen D. Malony

ParaTools Inc., Eugene, USA

ParaTools Inc., Eugene, USA
View Profile

,
Sameer Shende

ParaTools Inc., Eugene, USA

ParaTools Inc., Eugene, USA
View Profile

,
Marc Pérache

CEA, DAM, DIF, Arpajon, France

CEA, DAM, DIF, Arpajon, France
View Profile

,
Patrick Carribault

CEA, DAM, DIF, Arpajon, France

CEA, DAM, DIF, Arpajon, France
View Profile

,
Julien Jaeger

CEA, DAM, DIF, Arpajon, France

CEA, DAM, DIF, Arpajon, France
View Profile

EuroMPI '18: Proceedings of the 25th European MPI Users' Group MeetingSeptember 2018Article No.: 12Pages 1–11https://doi.org/10.1145/3236367.3236383

Published:23 September 2018Publication History

EuroMPI '18: Proceedings of the 25th European MPI Users' Group Meeting

Pages 1–11

ABSTRACT

Fault-tolerance has always been an important topic when it comes to running massively parallel programs at scale. Statistically, hardware and software failures are expected to occur more often on systems gathering millions of computing units. Moreover, the larger jobs are, the more computing hours would be wasted by a crash. In this paper, we describe the work done in our MPI runtime to enable transparent checkpointing mechanism. Unlike the MPI 4.0 User-Level Failure Mitigation (ULFM) interface, our work targets solely Checkpoint/Restart (C/R) and ignores wider features such as resiliency. We show how existing transparent checkpointing methods can be practically applied to MPI implementations given a sufficient collaboration from the MPI runtime. Our C/R technique is then measured on MPI benchmarks such as IMB and Lulesh relying on Infiniband high-speed network, demonstrating that the chosen approach is sufficiently general and that performance is mostly preserved. We argue that enabling fault-tolerance without any modification inside target MPI applications is possible, and show how it could be the first step for more integrated resiliency combined with failure mitigation like ULFM.

References

Jason Ansel, Kapil Arya, and Gene Cooperman. 2009. DMTCP: Transparent checkpointing for cluster computations and the desktop. In Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on. IEEE, 1--12. Google ScholarDigital Library
Brian W Barrett, Ronald Brightwell, Scott Hemmert, Kevin Pedretti, Kyle Wheeler, Keith Underwood, Rolf Riesen, Arthur B Maccabe, and Trammell Hudson. 2012. The Portals 4.0 network programming interface. Sandia National Laboratories, November 2012, Technical Report SAND2012-10087 (2012).Google Scholar
L. Bautista-Gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama, and S. Matsuoka. 2011. FTI: High performance Fault Tolerance Interface for hybrid systems. In 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 1--12. Google ScholarDigital Library
Fabrice Bellard. 2005. QEMU, a fast and portable dynamic translator.. In USENIX Annual Technical Conference, FREENIX Track, Vol. 41. 46. Google ScholarDigital Library
Jean-Baptiste Besnard, Julien Adam, Sameer Shende, Marc Pérache, Patrick Carribault, Julien Jaeger, and Allen D Maloney. 2016. Introducing Task-Containers as an Alternative to Runtime-Stacking. In Proceedings of the 23rd European MPI Users' Group Meeting. ACM, 51--63. Google ScholarDigital Library
Wesley Bland, Aurelien Bouteiller, Thomas Herault, George Bosilca, and Jack Dongarra. 2013. Post-failure recovery of MPI communication capability: Design and rationale. The International Journal of High Performance Computing Applications 27, 3 (2013), 244--254. arXiv:https://doi.org/10.1177/1094342013488238 Google ScholarDigital Library
Wesley Bland, Aurelien Bouteiller, Thomas Herault, Joshua Hursey, George Bosilca, and Jack J Dongarra. 2012. An evaluation of user-level failure mitigation support in MPI. In European MPI Users' Group Meeting. Springer, 193--203. Google ScholarDigital Library
Aurelien Bouteiller, George Bosilca, and Jack J. Dongarra. 2015. Plan B: Interruption of Ongoing MPI Operations to Support Failure Recovery. In Proceedings of the 22Nd European MPI Users' Group Meeting (EuroMPI '15). ACM, New York, NY, USA, Article 11, 9 pages. Google ScholarDigital Library
Darius Buntinas, Camille Coti, Thomas Herault, Pierre Lemarinier, Laurence Pilard, Ala Rezmerita, Eric Rodriguez, and Franck Cappello. 2008. Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI Protocols. Future Generation Computer Systems 24, 1 (2008), 73 -- 84.Google ScholarCross Ref
Jiajun Cao, Gregory Kerr, Kapil Arya, and Gene Cooperman. 2014. Transparent Checkpoint-restart over Infiniband. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing (HPDC '14). ACM, New York, NY, USA, 13--24. Google ScholarDigital Library
S. Derradji, T. Palfer-Sollier, J. P. Panziera, A. Poudes, and F. W. Atos. 2015. The BXI Interconnect Architecture. In 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects. 18--25. Google ScholarDigital Library
James Dinan, Ryan E Grant, Pavan Balaji, David Goodell, Douglas Miller, Marc Snir, and Rajeev Thakur. 2014. Enabling communication concurrency through flexible MPI endpoints. The International Journal of High Performance Computing Applications 28, 4 (2014), 390--405. Google ScholarDigital Library
P EMELYANOV. 2011. CRIU: Checkpoint/Restore In Userspace, July 2011. (2011). https://criu.org/Google Scholar
Graham E. Fagg and Jack J. Dongarra. 2000. FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, Jack Dongarra, Peter Kacsuk, and Norbert Podhorszki (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 346--353. Google ScholarDigital Library
Marc Gamell, Daniel S. Katz, Hemanth Kolla, Jacqueline Chen, Scott Klasky, and Manish Parashar. 2014. Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '14). IEEE Press, Piscataway, NJ, USA, 895--906. Google ScholarDigital Library
R. Garg, K. Sodha, Z. Jin, and G. Cooperman. 2013. Checkpoint-restart for a network of virtual machines. In 2013 IEEE International Conference on Cluster Computing (CLUSTER). 1--8.Google Scholar
William Gropp, Rajeev Thakur, and Ewing Lusk. 1999. Using MPI-2: Advanced Features of the Message Passing Interface (2nd ed.). MIT Press, Cambridge, MA, USA. Google ScholarDigital Library
Paul H Hargrove and Jason C Duell. 2006. Berkeley lab checkpoint/restart (BLCR) for Linux clusters. Journal of Physics: Conference Series 46, 1 (2006), 494. http://stacks.iop.org/1742-6596/46/i=1/a=067Google ScholarCross Ref
Daniel Holmes, Kathryn Mohror, Ryan E Grant, Anthony Skjellum, Martin Schulz, Wesley Bland, and Jeffrey M Squyres. 2016. MPI Sessions: Leveraging Runtime Infrastructure to Increase Scalability of Applications at Exascale. In Proceedings of the 23rd European MPI Users' Group Meeting. ACM, 121--129. Google ScholarDigital Library
Joshua Hursey, Richard L. Graham, Greg Bronevetsky, Darius Buntinas, Howard Pritchard, and David G. Solt. 2011. Run-Through Stabilization: An MPI Proposal for Process Fault Tolerance. In Recent Advances in the Message Passing Interface, Yiannis Cotronis, Anthony Danalis, Dimitrios S. Nikolopoulos, and Jack Dongarra (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 329--332. Google ScholarDigital Library
Laxmikant V. Kale and Sanjeev Krishnan. 1993. CHARM++: A Portable Concurrent Object Oriented System Based on C++. In Proceedings of the Eighth Annual Conference on Object-oriented Programming Systems, Languages, and Applications (OOPSLA '93). ACM, New York, NY, USA, 91--108. Google ScholarDigital Library
Laxmikant V Kale and Gengbin Zheng. 2009. Charm++ and AMPI: Adaptive runtime strategies via migratable objects. Advanced Computational Infrastructures for Parallel and Distributed Applications (2009), 265--282.Google Scholar
Ian Karlin, Jeff Keasler, and JR Neely. 2013. Lulesh 2.0 updates and changes. Technical Report. Lawrence Livermore National Lab.(LLNL), Livermore, CA (United States).Google Scholar
Adam Moody, Greg Bronevetsky, Kathryn Mohror, and Bronis R De Supinski. 2010. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In High Performance Computing, Networking, Storage and Analysis (SC), 2010 International Conference for. IEEE, 1--11. Google ScholarDigital Library
Xiang Ni, Esteban Meneses, Nikhil Jain, and Laxmikant V Kalé. 2013. ACR: Automatic checkpoint/restart for soft and hard error protection. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. ACM, 7. Google ScholarDigital Library
Marc Pérache, Patrick Carribault, and Hervé Jourdren. 2009. MPC-MPI: An MPI implementation reducing the overall memory consumption. In European Parallel Virtual Machine/Message Passing Interface Users' Group Meeting. Springer, 94--103. Google ScholarDigital Library
Michael Rieker, Jason Ansel, and Gene Cooperman. 2006. Transparent User-Level Checkpointing for the Native Posix Thread Library for Linux.. In PDPTA, Vol. 6. 492--498.Google Scholar
Keita Teranishi and Michael A. Heroux. 2014. Toward Local Failure Local Recovery Resilience Model Using MPI-ULFM. In Proceedings of the 21st European MPI Users' Group Meeting (EuroMPI/ASIA '14). ACM, New York, NY, USA, Article 51, 6 pages. Google ScholarDigital Library
Rajeev Thakur, William Gropp, and Ewing Lusk. 1999. On Implementing MPI-IO Portably and with High Performance. In Proceedings of the Sixth Workshop on I/O in Parallel and Distributed Systems (IOPADS '99). ACM, New York, NY, USA, 23--32. Google ScholarDigital Library
Gengbin Zheng, Chao Huang, and Laxmikant V. Kalé. 2006. Performance Evaluation of Automatic Checkpoint-based Fault Tolerance for AMPI and Charm++. SIGOPS Oper. Syst. Rev. 40, 2 (April 2006), 90--99. Google ScholarDigital Library
Gengbin Zheng, Lixia Shi, and L. V. Kale. 2004. FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI. In 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935). 93--103. Google ScholarDigital Library

Recommendations

MANA for MPI: MPI-Agnostic Network-Agnostic Transparent Checkpointing
HPDC '19: Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing

Transparently checkpointing MPI for fault tolerance and load balancing is a long-standing problem in HPC. The problem has been complicated by the need to provide checkpoint-restart services for all combinations of an MPI implementation over all network ...
Read More
Checkpoint/restart approaches for a thread-based MPI runtime
Highlights
- Transparent checkpoint restart can be applied to high-speed networks with collaboration from the MPI runtime (particularly network modularity).
Abstract
Fault-tolerance has always been an important topic when it comes to running massively parallel programs at scale. Statistically, hardware and software failures are expected to occur more often on systems gathering millions of computing ...
Read More
The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing

As high performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability. To address these issues, we present the design and implementation of a system for ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

EuroMPI '18: Proceedings of the 25th European MPI Users' Group Meeting
September 2018
187 pages
ISBN:9781450364928
DOI:10.1145/3236367

Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 September 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Checkpoint-Restart
DMTCP
Fault-Tolerance
Infiniband
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate66of139submissions,47%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 155
  Total Downloads
- Downloads (Last 12 months)8
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Transparent High-Speed Network Checkpoint/Restart in MPI

EuroMPI '18: Proceedings of the 25th European MPI Users' Group Meeting

ABSTRACT

References

Cited By

Recommendations

MANA for MPI: MPI-Agnostic Network-Agnostic Transparent Checkpointing

Checkpoint/restart approaches for a thread-based MPI runtime

The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Transparent High-Speed Network Checkpoint/Restart in MPI

EuroMPI '18: Proceedings of the 25th European MPI Users' Group Meeting

ABSTRACT

References

Cited By

Recommendations

MANA for MPI: MPI-Agnostic Network-Agnostic Transparent Checkpointing

Checkpoint/restart approaches for a thread-based MPI runtime

The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media