Skip to main content

Checkpointing

  • Reference work entry
Encyclopedia of Parallel Computing
  • 546 Accesses

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 1,600.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 1,799.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Bibliography

  1. Vadhiyar S, Dongarra J (2003) SRS – a framework for developing malleable and migratable parallel software. Parallel Process Lett 13(2):291–312

    Article  MathSciNet  Google Scholar 

  2. Beck M, Plank JS, Kingsley G, Kingsley G (1994) Compiler-assisted checkpointing. In: Technical report CS-94-269, department of computer science, University of Tennessee, Knoxville, December 1994

    Google Scholar 

  3. Chung chi Jim Li, Stewart EM, Fuchs WK (1994) Compiler-assisted full checkpointing. Pract Exper 24(10):871–886

    Article  Google Scholar 

  4. University of Mannheim, University of Tennessee, and NERSC/LBNL. TOP500 Supercomputing Sites. http://www.top500.org/

  5. Lawrence Livermore National Laboratory. NNSA awards IBM contract to build next generation supercomputer, press release. https://publica _airs.llnl.gov/news/newsreleases/2009/NR-09-02-01.html. Accessed Feb 2009

  6. Bronevetsky G, Pingali K, Stodghill P (2006) Experimental evaluation of application-level checkpointing for OpenMP programs. In: International conference on supercomputing (ICS), Queensland, June 2006

    Google Scholar 

  7. Chandy M, Lamport L (1985) Distributed snapshots: determining global states of distributed systems. ACM Transact Comput Syst 3(1):63–75

    Article  Google Scholar 

  8. Schulz M, Bronevetsky G, Fernandes R, Marques D, Pingali K, Stodghil l P (2004) Implementation and evaluation of a scalable application-level checkpoint-recovery scheme for MPI programs. In: Proceedings of IEEE/ACM supercomputing ’04, Washington, DC, November 2004

    Google Scholar 

  9. Silva LM, Silva JG (1998) An experimental study about diskless checkpointing. EUROMICRO Conf 1:10395

    Google Scholar 

  10. Plank JS, Li K, Puening MA (1998) Diskless checkpointing. IEEE Trans Parallel Distrib Syst 9(10):972–986

    Article  Google Scholar 

  11. Zheng G, Shi L, Kale LV (2004) FTC-Charm++: an In-Memory checkpoint-based fault tolerant runtime for Charm + + and MPI. In: 2004 IEEE international conference on cluster computing, pp 93–103, San Diego, September 2004

    Google Scholar 

  12. Moody A, Bronevetsky G, Mohror K, de Supinski BR (2010) Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings of IEEE/ACM supercomputing ’10, New Orleans, LA, 2010

    Book  Google Scholar 

  13. Agarwal S, Garg R, Gupta MS, Moreira JE (2004) Adaptive incremental checkpointing for massively parallel systems. In: ICS ’04: proceedings of the 18th annual international conference on supercomputing. ACM, New York, pp 277–286

    Google Scholar 

  14. Sancho JC, Petrini F, Johnson G, Fernndez J, Frachtenberg E (2004) On the feasibility of incremental checkpointing for scientific computing. Parallel Distrib Process Symp Int 1:58b

    Google Scholar 

  15. Litzkow JBM, Tannenbaum T, Livny M2 (1997). Checkpoint and migration of UNIX processes in the condor distributed processing system. In: Technical report 1346, University of Wisconsin, Madison, 1997

    Google Scholar 

  16. Condor. http://www.cs.wisc.edu/condor/manual

  17. CHARM research group. http://charm.cs.uiuc.edu/

  18. Kale LV, Krishnan S (1993) CHARM++: a portable concurrent object oriented system based on C++. Parallel Process Lett 28(10):91–108

    Google Scholar 

  19. Elnozahy M, Alvisi L, Wang YM, Johnson DB (1996) A survey of rollback-recovery protocols in message passing systems. In: Technical report CMU-CS-96-181, school of computer science, Carnegie Mellon University, Pittsburgh, October 1996

    Google Scholar 

  20. Librato. Availability Services (AvS). http://www.librato.com/products/availability.services

  21. Plank JS, Beck M, Kingsley G, Li K (1994) Libckpt: transparent checkpointing under UNIX. In: Technical report UT-CS-94-242, Department of Computer Science, University of Tennessee, Princeton University

    Google Scholar 

  22. Duell J The design and implementation of Berkeley lab’s linux checkpoint/restart. http://www.nersc.gov/research/FTG/checkpoint/reports.html

  23. Stellner G (1996) CoCheck: checkpointing and process migration for MPI. In: Proceedings of the 10th international parallel processing symposium (IPPS ’96), Honolulu, 1996

    Google Scholar 

  24. Bouteiller A, Cappello F, Herault T, Krawezik G, Lemarnier P, Magniette F (2003) MPICH-V2: a fault tolerant MPI for volatile nodes based on pessimistic sender based message logging. In: Proceedings of IEEE/ACM supercomputing ’03, Phoenix, November 2003

    Google Scholar 

  25. Wang YM, Fuchs WK (1992) Optimistic message logging for independent checkpointing in message-passing systems. In: Proceedings of the 11th symposium on reliable distributed systems, Houston, October 1992, pp 147–154

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer Science+Business Media, LLC

About this entry

Cite this entry

Schulz, M. (2011). Checkpointing. In: Padua, D. (eds) Encyclopedia of Parallel Computing. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-09766-4_62

Download citation

Publish with us

Policies and ethics