skip to main content
10.1145/1066650.1066663acmotherconferencesArticle/Chapter ViewAbstractPublication PageslcrConference Proceedingsconference-collections
Article

Compiler-generated staggered checkpointing

Published: 22 October 2004 Publication History

Abstract

To minimize work lost due to system failures, large parallel applications perform periodic checkpoints. These checkpoints are typically inserted manually by application programmers, resulting in synchronous checkpoints, or checkpoints that occur at the same program point in all processes. While this solution is tenable for current systems, it will become problematic for future supercomputers that have many tens of thousands of nodes, because contention for both the network and file system will grow. This paper shows that staggered checkpoints---globally consistent checkpoints in which processes perform checkpoints at different points in the code---can significantly reduce network and file system contention. We describe a compiler-based approach for inserting staggered checkpoints, and we show, using trace-driven simulation, that staggered checkpointing is 23 times faster that synchronous checkpointing.

References

[1]
Micah Beck, James S. Plank, and Gerry Kingsley. Compiler-assisted checkpointing. Technical Report UT-CS-94-269, 1994.
[2]
Greg Bronevetsky, Daniel Marques, Keshav Pingali, and Paul Stodghill. Automated application-level checkpointing of MPI programs. In Principles and Practice of Parallel Programming, June 2003.
[3]
NASA Ames Research Center. NAS parallel benchmarks. http://www.nas.nasa.gov/Software/NPB.
[4]
Sung-Eun Choi and Steven J. Deitz. Compiler support for automatic checkpointing. In The 16th Annual International Symposium on High Performance Computing Systems and Applications, June 2002.
[5]
E. Elnozahy, D. Johnson, and Y. Wang. A survey of rollback-recovery protocols in message-passing systems. Technical Report CMU-CS-96-181, Carnegie Mellon University, October 1996.
[6]
Samuel Z. Guyer and Calvin Lin. Broadway: A software architecture for scientific computing. In R. F. Boisvert and P. T. P. Tang, editors, The Architecture of Scientific Software. Kluwer Academic Press, 2000.
[7]
P. B. Ladkin and B. B. Simons. Compile-time analysis of communicating processes. pages 248--259. ACM Press, 1992.
[8]
Peter B. Ladkin and Stefan Leue. Interpreting message flow graphs. Formal Aspects of Computing, 7(5):473--509, 1995.
[9]
Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558--565, July 1978.
[10]
Özalp Babaoǧlu and Keith Marzullo. Consistent global states of distributed systems: Fundamental concepts and mechanisms. Technical Report UBLCS-93-1, Laboratory for Computer Science, University of Bologna, Italy, January 1993.
[11]
James S. Plank. Efficient Checkpointing on MIMD Architectures. PhD thesis, Princeton University, June 1993.
[12]
Nitin H. Vaidya. On staggered checkpointing. In Symposium on Parallel and Distributed Processing, 1996.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
LCR '04: Proceedings of the 7th workshop on Workshop on languages, compilers, and run-time support for scalable systems
October 2004
134 pages
ISBN:9781450377997
DOI:10.1145/1066650
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

  • The Texas Learning & Computation Center
  • University of Houston

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 October 2004

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

LCR04
Sponsor:

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2011)Staggered Checkpointing and Recovery in Cluster Based Mobile Ad Hoc NetworksAdvances in Parallel Distributed Computing10.1007/978-3-642-24037-9_13(122-134)Online publication date: 2011
  • (2010)Using Redundant Threads for Fault Tolerance of OpenMP Programs2010 International Conference on Information Science and Applications10.1109/ICISA.2010.5480321(1-8)Online publication date: Apr-2010
  • (2010)TH-1Frontiers of Computer Science in China10.1007/s11704-010-0383-x4:4(445-455)Online publication date: 1-Dec-2010
  • (2009)FTPAIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2008.23120:10(1471-1486)Online publication date: 1-Oct-2009
  • (2007)Choosing Method of the Most Effective Nested Loop Shearing for ParallelismProceedings of the Eighth International Conference on Parallel and Distributed Computing, Applications and Technologies10.1109/PDCAT.2007.26(267-276)Online publication date: 3-Dec-2007
  • (2006)Cooperative checkpointingProceedings of the 20th annual international conference on Supercomputing10.1145/1183401.1183406(14-23)Online publication date: 28-Jun-2006

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media