skip to main content
article

User-level checkpoint and recovery for LAM/MPI

Published: 01 July 2005 Publication History

Abstract

As high-performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability. We integrated one user-level checkpointing and rollback recovery (CRR) library to LAM/MPI, a high performance implementation of the Message Passing Interface (MPI), to improve its availability. Compared with the current CRR implementation of LAM/MPI, our work supports file checkpointing and own higher portability, which can run on more platforms including IA32 and IA64 Linux. In addition, the test shows that less than 15% performance overhead is introduced by the CRR mechanism of our implementation.

References

[1]
A. Geist, W. Gropp, S. Huss-Lederman, A. Lumsdaine, E. Lusk, W. Saphir, T. Skjellum, and M. Snir. MPI-2: Extending the Message-Passing Interface. In L. Bouge, P. Fraigniaud, A. Mignotte, and Y. Robert, editors, Euro-Par'96 Parallel Processing, number 1123 in Lecture Notes in Computer Science, pages 128--135. Springer Verlag, 1996.
[2]
W. Gropp, S. Huss-Lederman, A. Lumsdaine, E. Lusk, B. Nitzberg, W. Saphir, and M. Snir. MPI --- The Complete Reference: Volume 2, the MPI-2 Extensions. MIT Press, 1998.
[3]
W. Gropp, E. Lusk, and A. Skjellum. Using MPI: Portable Parallel Programming with the Message Passing Interface. MIT Press, 1994.
[4]
W. Gropp, E. Lusk, and R. Thakur. Using MPI-2: Advanced Features of the Message Passing Interface. MIT Press, 1999.
[5]
Message Passing Interface Forum. MPI: A Message Passing Interface. In Proc. of Supercomputing '93, pages 878--883. IEEE Computer Society Press, November 1993.
[6]
M. Snir, S. W. Otto, S. Huss-Lederman, D. W. Walker, and J. Dongarra. MPI: The Complete Reference. MIT Press, Cambridge, MA, 1996.
[7]
G. Burns, R. Daoud, and J. Vaigl. LAM: An Open Cluster Environment for MPI. In J. W. Ross, editor, Proceedings of Supercomputing Symposium '94, pages 379--386. University of Toronto, 1994.
[8]
W Gropp, E. Lusk, N. Doss, and A. Skjellum. A highperformance, portable implementation of the MPI message passing interface standard. Parallel Computing, 22(6):789--828, Sept. 1996.
[9]
W. D. Gropp and E. Lusk. User's Guide for mpich, a Portable Implementation of MPI. Mathematics and Computer Science Division, Argonne National Laboratory, 1996. ANL-96/6.
[10]
The LAM Team. Getting Started with LAM/MPI. University of Notre Dame, Department of Computer Science, http://www.lam-mpi.org/, 1998.
[11]
Sriram Sankaran, Jeffrey M. Squyres, Brian Barrett, Andrew Lumsdaine, Jason Duell, Paul Hargrove, and Eric Roman. The LAM/MPI Checkpoint/Restart Framework: System-Initiated Checkpointing. In LACSI Symposium, October 2003.
[12]
E. N. Elnozahy, D. B. Johnson, and W. Zwaenepoel. The performance of consistent checkpointing. In Proceedings of the 11th Symposium on Reliable Distributed Systems, pages 39--47, Oct. 1992.
[13]
J. Duell, P. Hargrove, and E. Roman. The Design and Implementation of Berkeley Lab's Linux Checkpoint/Restart, 2002.
[14]
Pei Dan, Wang Dongsheng WOB: A Novel Approach to Checkpoint Active Files. Acta Electronica Sinica. 2000: Vol 28(5): pp9~12.
[15]
M. Litzkow and M. Solomon. The Evolution of Condor Checkpointing, 1998.
[16]
H. Zhong and J. Nieh. CRAK: Linux checkpoint / restart as a kernel module. Technical Report CUCS-014-01, Department of Computer Science, Columbia University, 2001.
[17]
W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A highperformance, portable implementation of the MPI message passing interface standard. Parallel Computing, 22(6):789--828, Sept. 1996.
[18]
W.-J. Li and J.-J. Tsay. Checkpointing Message-Passing Interface (MPI) Parallel Programs. In Proceedings of the Pacific Rim International Symposium on Fault-Tolerant Systems, 1997.
[19]
M. Elnozahy, L. Alvisi, Y. M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message passing systems. Technical Report CMU-CS-96-181, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA, 1996.
[20]
D. Briatico, A. Ciuffoletti, and L. Simoncini. A distributed domino-effect free recovery algorithm. In Proceedings of the Fourth International Symposium on Reliability in Distributed Software and Databases, pages 207--215, 1984.
[21]
Sriram Sankaran, Jeffrey M. Squyres, Brian Barrett, Andrew Lumsdaine. Checkpoint/Restart System Services Interface (SSI) Modules for LAM/MPI. http://www.lam-mpi.org/. Open Systems Laboratory Pervasive Technologies Labs Indiana University August 4, 2003.

Cited By

View all
  • (2022)Software approaches for resilience of high performance computing systems: a surveyFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-022-2096-317:4Online publication date: 12-Dec-2022
  • (2017)A Fuzzy Load Balancer for Adaptive Fault Tolerance Management in Cloud PlatformsService-Oriented and Cloud Computing10.1007/978-3-319-67262-5_9(109-124)Online publication date: 1-Sep-2017
  • (2011)BristleconeIEEE Transactions on Software Engineering10.1109/TSE.2010.2737:1(4-23)Online publication date: 1-Jan-2011
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGOPS Operating Systems Review
ACM SIGOPS Operating Systems Review  Volume 39, Issue 3
July 2005
93 pages
ISSN:0163-5980
DOI:10.1145/1075395
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 July 2005
Published in SIGOPS Volume 39, Issue 3

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)1
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Software approaches for resilience of high performance computing systems: a surveyFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-022-2096-317:4Online publication date: 12-Dec-2022
  • (2017)A Fuzzy Load Balancer for Adaptive Fault Tolerance Management in Cloud PlatformsService-Oriented and Cloud Computing10.1007/978-3-319-67262-5_9(109-124)Online publication date: 1-Sep-2017
  • (2011)BristleconeIEEE Transactions on Software Engineering10.1109/TSE.2010.2737:1(4-23)Online publication date: 1-Jan-2011
  • (2011)A Fault-Tolerant High Performance Cloud Strategy for Scientific ComputingProceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum10.1109/IPDPS.2011.306(1525-1532)Online publication date: 16-May-2011
  • (2011)Fault tolerance for data parallel programsConcurrency and Computation: Practice & Experience10.1002/cpe.166823:6(595-632)Online publication date: 1-Apr-2011
  • (2009)Replication-Based Fault Tolerance for MPI ApplicationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2008.17220:7(997-1010)Online publication date: 1-Jul-2009
  • (2009)DMTCPProceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing10.1109/IPDPS.2009.5161063(1-12)Online publication date: 23-May-2009
  • (2009)A fault-tolerant strategy for virtualized HPC clustersThe Journal of Supercomputing10.1007/s11227-008-0259-050:3(209-239)Online publication date: 1-Dec-2009
  • (2008)BristleconeProceedings of the 22nd European conference on Object-Oriented Programming10.1007/978-3-540-70592-5_21(490-515)Online publication date: 7-Jul-2008
  • (2007)A scalable asynchronous replication-based strategy for fault tolerant MPI applicationsProceedings of the 14th international conference on High performance computing10.5555/1782174.1782206(257-268)Online publication date: 18-Dec-2007
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media