skip to main content
10.1145/1188455.1188587acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
Article

Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI

Published: 11 November 2006 Publication History

Abstract

A long-term trend in high-performance computing is the increasing number of nodes in parallel computing platforms, which entails a higher failure probability. Fault tolerant programming environments should be used to guarantee the safe execution of critical applications. Research in fault tolerant MPI has led to the development of several fault tolerant MPI environments. Different approaches are being proposed using a variety of fault tolerant message passing protocols based on coordinated checkpointing or message logging. The most popular approach is with coordinated checkpointing. In the literature, two different concepts of coordinated checkpointing have been proposed: blocking and nonblocking. However they have never been compared quantitatively and their respective scalability remains unknown. The contribution of this paper is to provide the first comparison between these two approaches and a study of their scalability. We have implemented the two approaches within the MPICH environments and evaluate their performance using the NAS parallel benchmarks.

References

[1]
Alvisi, L., Elnozahy, E., Rao, S., Husain, S. A., and Mel, A. D. 1999. An analysis of communication induced checkpointing. In 29th Symposium on Fault-Tolerant Computing (FTCS'99), IEEE CS Press.
[2]
Bailey, D., Harris, T., Saphir, W., Wijngaart, R. V. D., Woo, A., and Yarrow, M. 1995. The NAS Parallel Benchmarks 2.0. Report NAS-95-020, Numerical Aerodynamic Simulation Facility, NASA Ames Research Center.
[3]
Bosilca, G., Bouteiller, A., Cappello, F., Djilali, S., Fédak, G., Germain, C., Hérault, T., Lemarinier, P., Lodygensky, O., Magniette, F., Néri, V., and Selikhov, A. 2002. MPICHV: Toward a scalable fault tolerant MPI for volatile nodes. In High Performance Networking and Computing (SC2002), IEEE/ACM, Baltimore USA.
[4]
Bronevetsky, G., Marques, D., Pingali, K., and Stodghill, P. 2003. Automated application-level checkpointing of MPI programs. In PPOPP, ACM, 84--94.
[5]
Burns, G., Daoud, R., and Vaigl, J. 1994. LAM: An Open Cluster Environment for MPI. In Proceedings of Supecomputing Symposium, 379--386.
[6]
Center, N. A. R., 1997. Nas parallel benchmarks. http://science.nas.nasa.gov/Software/NPB/.
[7]
Chandy, K. M., and Lamport, L. 1985. Distributed snapshots: Determining global states of distributed systems. In Transactions on Computer Systems, ACM, vol. 3(1), 63--75.
[8]
Elnozahy, E. N., Johnson, D. B., and Zwaenepoel, W. 1992. The performance of consistent checkpointing. In Symposium on Reliable Distributed Systems, 39--47.
[9]
Elnozahy, M., Alvisi, L., Wang, Y. M., and Johnson, D. B. 2002. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys (CSUR) 34, 3 (september), 375--408.
[10]
F. Cappello et al. 2005. Grid'5000: a large scale, reconfigurable, controlable and monitorable grid platform. In Proceedings of IEEE/ACM Crid'2005 workshop.
[11]
Gabriel, E., Fagg, G. E., Bosilca, G., Angskun, T., Dongarra, J. J., Squyres, J. M., Sahay, V., Kambadur, P., Barrett, B., Lumsdaine, A., Castain, R. H., Daniel, D. J., Graham, R. L., and Woodall, T. S. 2004. Open MPI: Goals, concept, and design of a next generation MPI implementation. In Proceedings, 11th European PVM/MPI Users' Group Meeting, 97--104.
[12]
Gropp, W., and Lusk, E. 2002. Fault tolerance in MPI programs. special issue of the Journal High Performance Computing Applications (IJHPCA).
[13]
Gropp, W., Lusk, E., Doss, N., and Skjellum, A. 1996. High-performance, portable implementation of the MPI message passing interface standard. Parallel Computing 22, 6 (September), 789--828.
[14]
Hélary, J.-M., Mostefaoui, A., and Raynal, M. 1999. Communication-induced determination of consistent snapshots. IEEE Transactions on Parallel and Distributed Systems 10, 9, 865--877.
[15]
J. Duell, P. Hargrove, E. R. 2003. The design and implementation of berkeley lab's linux checkpoint/restart. Tech. Rep. publication LBNL-54941, Berkeley Lab.
[16]
Lemarinier, P., Bouteiller, A., Herault, T., Krawezik, G., and Cappello, F. 2004. Improved message logging versus improved coordinated checkpointing for fault tolerant MPI. In IEEE International Conference on Cluster Computing (Cluster 2004), IEEE CS Press.
[17]
Litzkow, M., Tannenbaum, T., Basney, J., and Livny, M. 1997. Checkpoint and migration of UNIX processes in the condor distributed processing system. Tech. Rep. 1346, University of Wisconsin-Madison.
[18]
Randell, B. 1975. System structure for software fault tolerance. IEEE Transactions on Software Engineering SE-1, 2, 220--232.
[19]
Sankaran, S., Squyres, J. M., Barrett, B., Lumsdaine, A., Duell, J., Hargrove, P., and Roman, E. 2003. The LAM/MPI checkpoint/restart framework: System-initiated checkpointing. In Proceedings, LACSI Symposium.
[20]
Schulz, M., Bronevetsky, G., Fernandes, R., Marques, D., Pingali, K., and Stodghill, P. 2004. Implementation and evaluation of a scalable application-level checkpoint-recovery scheme for MPI programs. IEEE Computer Society, 38.
[21]
Snell, Q., Mikler, A., and Gustafson, J. 1996. Netpipe: A network protocol independent performance evaluator. In IASTED International Conference on Intelligent Information Management and Systems.
[22]
Snir, M., Otto, S., Huss-Lederman, S., Walker, D., and Dongarra, J. 1996. MPI: The Complete Reference. The MIT Press.
[23]
Strom, E., and Yemini, S. 1985. Optimistic recovery in distributed systems. In Transactions on Computer Systems, ACM, vol. 3(3), 204--226.
[24]
Zandy, V., 2005. libckpt http://www.cs.wisc.edu/~zandy/ckpt/.

Cited By

View all
  • (2022)Targeting a light-weight and multi-channel approach for distributed stream processingJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.04.022167:C(77-96)Online publication date: 1-Sep-2022
  • (2019)Transitioning scientific applications to using non-volatile memory for resilienceProceedings of the International Symposium on Memory Systems10.1145/3357526.3357563(114-125)Online publication date: 30-Sep-2019
  • (2019)GPU snapshotProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3330361(171-183)Online publication date: 26-Jun-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing
November 2006
746 pages
ISBN:0769527000
DOI:10.1145/1188455
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 November 2006

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

SC '06
Sponsor:

Acceptance Rates

SC '06 Paper Acceptance Rate 54 of 239 submissions, 23%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)1
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Targeting a light-weight and multi-channel approach for distributed stream processingJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.04.022167:C(77-96)Online publication date: 1-Sep-2022
  • (2019)Transitioning scientific applications to using non-volatile memory for resilienceProceedings of the International Symposium on Memory Systems10.1145/3357526.3357563(114-125)Online publication date: 30-Sep-2019
  • (2019)GPU snapshotProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3330361(171-183)Online publication date: 26-Jun-2019
  • (2018)Efficient Execution of Smart City’s Assets Through a Massive Parallel Computational ModelSmart Societies, Infrastructure, Technologies and Applications10.1007/978-3-319-94180-6_6(44-51)Online publication date: 22-Jul-2018
  • (2018) ER einit : Scalable and efficient fault‐tolerance for bulk‐synchronous MPI applications Concurrency and Computation: Practice and Experience10.1002/cpe.486332:3Online publication date: 14-Aug-2018
  • (2016)Fault Tolerance Techniques for Distributed, Parallel ApplicationsInnovative Research and Applications in Next-Generation High Performance Computing10.4018/978-1-5225-0287-6.ch009(221-252)Online publication date: 2016
  • (2015)Local recovery and failure masking for stencil-based applications at extreme scalesProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/2807591.2807672(1-12)Online publication date: 15-Nov-2015
  • (2014)Exploring automatic, online failure recovery for scientific applications at extreme scalesProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC.2014.78(895-906)Online publication date: 16-Nov-2014
  • (2014)Contention management in federated virtualized distributed systemsSoftware—Practice & Experience10.1002/spe.222144:3(353-368)Online publication date: 1-Mar-2014
  • (2013)An Application-Level Synchronous Checkpoint-Recover Method for Parallel CFD SimulationProceedings of the 2013 IEEE 16th International Conference on Computational Science and Engineering10.1109/CSE.2013.19(58-65)Online publication date: 3-Dec-2013
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media