Skip to main content
Log in

Optimization of checkpointing-related I/O for high-performance parallel and distributed computing

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Checkpointing, the process of saving program/application state, usually to a stable storage, has been the most common fault-tolerance methodology for high-performance applications. The rate of checkpointing (how often) is primarily driven by the failure rate of the system. If the checkpointing rate is low, fewer resources are consumed but the chance of high computational loss is increased and vice versa if the checkpointing rate is high. It is important to strike a balance, and an optimum rate of checkpointing is required. In this paper, we analytically model the process of checkpointing in terms of mean-time-between-failure of the system, amount of memory being checkpointed, sustainable I/O bandwidth to the stable storage, and frequency of checkpointing. We identify the optimum frequency of checkpointing to be used on systems with given specifications thereby making way for efficient use of available resources and maximum performance of the system without compromising on the fault-tolerance aspects. Further, we develop discrete-event models simulating the checkpointing process to verify the analytical model for optimum checkpointing. Using the analytical model, we also investigate the optimum rate of checkpointing for systems of varying resource levels ranging from small embedded cluster systems to large supercomputers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. 73.4 GB 3.6MS/15000 (ULTRA 320 80PIN) 8192K 3.5″/HH, http://www.spartantech.com/product.asp?PID=ST373453LC&m1=pg (accessed: April 23, 2006)

  2. ASCI purple statement of work, Lawrence Livermore National Laboratory, http://www.llnl.gov/asci/purple/Attachment_02_PurpleSOWV09.pdf (accessed: April 23, 2006)

  3. Cheetah 15K.3-ST336753LC, http://www.seagate.com/cda/products/discsales/marketing/detail/0,1081,552,00.html (accessed: April 23, 2006)

  4. Cramming more components onto integrated circuits. Electronics 37(8), April 19, 1965

  5. Dongarra J, Luszczek P, Petitet A (2003) The LINPACK benchmark: past, present, and future. Concurr Comput Pract Experience 15:1–18

    Article  Google Scholar 

  6. Fixed point iteration, http://pathfinder.scar.utoronto.ca/~dyer/csca57/book_P/node34.html (accessed July 3, 2006)

  7. HITACHI eyes 1 TB desktop drives, http://www.pcworld.com/news/article/0,aid,120279,00.asp (accessed: April 23, 2006)

  8. Kavanaugh GP, Sanders WH (1997) Performance analysis of two time-based coordinated checkpointing protocols. In: Pacific Rim international symposium on fault-tolerant systems, Taipei, Taiwan, December 15–16, 1997

  9. LINPACK, http://www.netlib.org/linpack/ (accessed: April 23, 2006)

  10. Los Alamos/Liv 3D simulations, Publication of Los Alamos National Laboratory, vol 3, No 6, April 4, 2002

  11. Plank J, Beck M, Kingsley G, Li K (1995) Libckpt: transparent checkpointing under Unix. In: Usenix winter 1995 technical conference, New Orleans, LA, January, 1995

  12. Plank JS, Kim Y, Dongarra J (1997) Fault tolerant matrix operations for networks of workstations using diskless checkpointing. J Parallel Distributed Comput 43(2):125–138

    Article  Google Scholar 

  13. Schocht G, Troxel I, Farhangian K, Unger P, Zinn D, Mick C, George A, Salzwedel H (2003) System-level simulation modeling with MLDesigner. In: 11th IEEE/ACM international symposium on modeling, analysis, and simulation of computer and telecommunication systems (MASCOTS), Orlando, FL, October 2003

  14. Seagate Barracuda 7200.8 400 GB 3.5″ IDE Ultra ATA100 Hard Drive–OEM, http://www.newegg.com/Product/Product.asp?Item=N82E16822148060 (accessed: April 23, 2006)

  15. Stanat DF, Weiss SF (2006) Systematic programming. Online book resources, http://www.cs.unc.edu/~weiss/COMP114/BOOK/BookChapters.html (accessed: June 1, 2006)

  16. Thain D, Tannenbaum T, Livny M (2005) Distributed computing in practice: the Condor experience. Concurr Comput Pract Experience 17(2–4):323–356

    Article  Google Scholar 

  17. Top 500 supercomputer sites, http://www.top500.org/ (accessed: April 23, 2006)

  18. Vaidya NH (1995) A case for two-level distributed recovery schemes. In: ACM SIGMETRICS conference on measurement and modeling of computer systems, Ottawa, May 1995

  19. Vaidya NH (1997) Impact of checkpoint latency on overhead ratio of a checkpointing scheme. IEEE Trans Comput 46(8):942–947

    Article  Google Scholar 

  20. Wong KF, Franklin M (1996) Checkpointing in distributed systems. J Parallel Distributed Syst 35(1):67–75

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rajagopal Subramaniyan.

Additional information

An earlier version of this paper appeared in Proceedings of the 2006 International Conference on Parallel and Distributed Processing Techniques and Applications, June 2006.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Subramaniyan, R., Grobelny, E., Studham, S. et al. Optimization of checkpointing-related I/O for high-performance parallel and distributed computing. J Supercomput 46, 150–180 (2008). https://doi.org/10.1007/s11227-007-0162-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-007-0162-0

Keywords

Navigation