Article

Application-level checkpointing for shared memory programs

Authors:
Greg Bronevetsky

Cornell University, Ithaca, NY

Cornell University, Ithaca, NY
View Profile

,
Daniel Marques

Cornell University, Ithaca, NY

Cornell University, Ithaca, NY
View Profile

,
Keshav Pingali

Cornell University, Ithaca, NY

Cornell University, Ithaca, NY
View Profile

,
Peter Szwed

Cornell University, Ithaca, NY

Cornell University, Ithaca, NY
View Profile

,
Martin Schulz

University of California, Livermore, CA

University of California, Livermore, CA
View Profile

ASPLOS XI: Proceedings of the 11th international conference on Architectural support for programming languages and operating systemsOctober 2004Pages 235–247https://doi.org/10.1145/1024393.1024421

Published:07 October 2004Publication History

ASPLOS XI: Proceedings of the 11th international conference on Architectural support for programming languages and operating systems

Pages 235–247

ABSTRACT

Trends in high-performance computing are making it necessary for long-running applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart (CPR) - the state of the computation is saved periodically on disk, and when a failure occurs, the computation is restarted from the last saved state. At present, it is the responsibility of the programmer to instrument applications for CPR.Our group is investigating the use of compiler technology to instrument codes to make them self-checkpointing and self-restarting, thereby providing an automatic solution to the problem of making long-running scientific applications resilient to hardware faults. Our previous work focused on message-passing programs.In this paper, we describe such a system for shared-memory programs running on symmetric multiprocessors. This system has two components: (i) a pre-compiler for source-to-source modification of applications, and (ii) a runtime system that implements a protocol for coordinating CPR among the threads of the parallel application. For the sake of concreteness, we focus on a non-trivial subset of OpenMP that includes barriers and locks.One of the advantages of this approach is that the ability to tolerate faults becomes embedded within the application itself, so applications become self-checkpointing and self-restarting on any platform. We demonstrate this by showing that our transformed benchmarks can checkpoint and restart on three different platforms (Windows/x86, Linux/x86, and Tru64/Alpha). Our experiments show that the overhead introduced by this approach is usually quite small; they also suggest ways in which the current implementation can be tuned to reduced overheads further.

References

A. Beguelin, E. Seligman and P. Stephan. Application level fault tolerance in heterogeneous networks of workstations. Journal of Parallel and Distributed Computing 43(2):147--155, 1997. Google ScholarDigital Library
C. Amza, A. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajam ny, W. Yu, and W. Zwaenep el. TreadMarks: Shared memory computing on networks of workstations. IEEE Computer 29(2):18--28, February 1995. Google ScholarDigital Library
Adam Beguelin, Erik Seligman, and Peter Stephan. Application level fault tolerance in heterogeneous networks of workstations. Journal of Parallel and Distributed Computing 43(2):147--155, 1997. Also available as http://citeseer.nj.nec.com/beguelin97application.html Google ScholarDigital Library
G. Brnevetsky, D. Marques, K. Pingali, and P. Stodghill. Collective perations in an application-level fault tolerant MPI system. In Proceedings of the 2003 International Conference on Supercomputing pages 234--243, June 2003. Google ScholarDigital Library
Greg Bronevetsky, Daniel Marques, Keshav Pingali, and Paul Stodghill. Automated application-level checkpointing of MPI programs. In Principles and Practice of Parallel Programming (PPoPP), pages 84--94,June 2003. Google ScholarDigital Library
M. Chandy and L. Lamport. Distributed snapshots: Determining global states of distributed systems. IEEE Transactions on Computing Systems 3(1):63--75, 1985. Google ScholarDigital Library
R. Christodoulopoulou, R. Azimi, and A. Bilas. Dynamic data replication: an approach to providing fault-tolerant shared memory clusters. In Proceedings of the Ninth Annual Symposium on High Performance Computer Architecture February 2003. Google ScholarDigital Library
Condor.http://www.cs.wisc.edu/condor/manual.Google Scholar
W. Dieter and Jr. J. Lumpp. A user-level checkpointing library for POSIX threads programs. In Proceedings of 1999 Symposium on Fault-Tolerant Computing Systems (FTCS), June 1999. Google ScholarDigital Library
J. Duell. The Design and Implementation of Berkeley Lab's Linux Checkp int/Restart. http://www.nersc.gov/research/FTG/checkpoint/rep rts.html.Google Scholar
M. Elnozahy, L. Alvisi, Y. M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message passing systems. Technical Report CMU-CS-96--181, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA, October 1996.Google Scholar
P. Guedes and M. Castro. Distributed shared object memory. In Proceedings of WWOS 1993.Google ScholarCross Ref
D. Hecht and C. Katsinis. Fault-Tolerant Distributed Shared Memory on a Broadcast-Based Interconnection Network.In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS 2000), May 2000. Google ScholarDigital Library
T. Tannenbaum J. B. M. Litzkow and M. Livny. Checkpoint and Migration of Unix Processes in the Condor Distributed Processing System. Technical Report Technical Report 1346, University of Wisconsin-Madison, 1997.Google Scholar
Angkul Kongmunvattan, S. Tanchatchawal, and N. Tzeng. Coherence-based coordinated checkpointing for software distributed shared memory systems. In Proceedings of the International Conference on Distributed Computer Systems (ICDCS 2000), 2000. Google ScholarDigital Library
Nancy Lynch. Distributed Algorithms Morgan Kaufmann, San Francisco, California,. first edition, 1996. Google ScholarDigital Library
J. S. Plank M. Beck and G. Kingsley. Compiler-Assisted Checkpointing. Technical Report Technical Report CS-94-269, University of Tennessee, December 1994. Google ScholarDigital Library
Y. M. Wang M. Elnozahy, L. Alvisi and D. B. Johnson. A survey of rollback-recovery protocols in message passing systems. Technical Report Technical Report CMU-CS-96-181, Carnegie Mellon University, October 1996.Google Scholar
Z. Zhang M. Prvulovic and J. Torrellas. ReVive: Cost-effective architectural support for rollback recovery in shared memory multiprocessors. In International Conference on Computer Architecture 2002. Google ScholarDigital Library
K. Kusan M. Sato, S. Satoh and Y. Tanaka. Design of OpenMP compiler for an SMP cluster. In EWOMP '99 pages 32--39, September 1999.Google Scholar
Message Passing Interface Forum (MPIF). MPI: A message-passing interface standard. Technical Report, University of Tennessee, Knoxville, June 1995.Google Scholar
N. Stone, J. Kochmar, R. Reddy, J. R. Scott, J. Sommerfeld, C. Vizino. A checkpoint and recovery system for the pittsburgh supercomputing center terascale computing system. http://www.psc.edu/publications/tech_reports/chkp_rcvry/checkpoint-recovery-1.0.html.Google Scholar
N. Neves, M. Castro, and P. Guedes. A checkpoint protocol for an entry consistent shared memory system. In Proceedings of the Symposium on Principles of Distributed Computing Systems (PDCS), 1994. Google ScholarDigital Library
OpenMP Architecture Review Board. OpenMP C and C++ Application, Program Interface Version 1.0, Document Number 004-2229-01 edition, October 1998. Available from http://www.openmp.org/.Google Scholar
D. Sorin, M. Martin, M. Hill, and D. Wood. SafetyNet: Improving the availability of shared memory multiprocessors with global checkpoint/recovery. In Proceedings of the International Symposium on Computer Architecture (ISCA 2002), July 2002. Google ScholarDigital Library
G. Stellner. CoCheck: Checkpointing and Process Migration for MPI. In Proceedings of International Parallel Processing Symposium(IPPS), 1996. Google ScholarDigital Library
George Stellner. CoCheck: Checkpointing and Process Migration for MPI. In Proceedings of the 10th International Parallel Processing Symposium (IPPS '96), Honolulu, Hawaii, 1996. Also available at http://citeseer.nj.nec.com/stellner96cocheck.html Google ScholarDigital Library
F. Sultan, T. D. Nguyen, and L. Iftode. Scalable fault-tolerant distributed shared memory. In Proceedings of Supercomputing 2000. November 2000. Google ScholarDigital Library
S. Woo, M. Ohara, E. Torrie, J. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the International Symposium on Computer Architecture 1995 pages 24--36, June 1995. Google ScholarDigital Library

Index Terms

Application-level checkpointing for shared memory programs
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

Automated application-level checkpointing of MPI programs
PPoPP '03: Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming

The running times of many computational science applications, such as protein-folding using ab initio methods, are much longer than the mean-time-to-failure of high-performance computing platforms. To run to completion, therefore, these applications ...
Read More
Application-level checkpointing for shared memory programs
ASPLOS '04

Trends in high-performance computing are making it necessary for long-running applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart (CPR) - the state of the computation is saved periodically on disk, and ...
Read More
Application-level checkpointing for shared memory programs
ASPLOS 2004

Trends in high-performance computing are making it necessary for long-running applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart (CPR) - the state of the computation is saved periodically on disk, and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ASPLOS XI: Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
October 2004
296 pages
ISBN:1581138040
DOI:10.1145/1024393
General Chair:
Shubu Mukherjee
Intel Corporation
,
Program Chair:
Kathryn S. McKinley
University of Texas at Austin
ACM SIGOPS Operating Systems Review Volume 38, Issue 5
ASPLOS '04
December 2004
283 pages
ISSN:0163-5980
DOI:10.1145/1037949
Issue’s Table of Contents
ACM SIGPLAN Notices Volume 39, Issue 11
ASPLOS '04
November 2004
283 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/1037187
Issue’s Table of Contents
ACM SIGARCH Computer Architecture News Volume 32, Issue 5
ASPLOS 2004
December 2004
283 pages
ISSN:0163-5964
DOI:10.1145/1037947
Issue’s Table of Contents
Copyright © 2004 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 October 2004
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
checkpointing
fault-tolerance
openMP
shared-memory programs
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate535of2,713submissions,20%
Upcoming Conference
ASPLOS '24

Sponsor:

sigarch

sigarch

sigarch

29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

April 27 - May 1, 2024

La Jolla , CA , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 80
  Total Citations
  View Citations
- 1,391
  Total Downloads
- Downloads (Last 12 months)24
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Application-level checkpointing for shared memory programs

ASPLOS XI: Proceedings of the 11th international conference on Architectural support for programming languages and operating systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Automated application-level checkpointing of MPI programs

Application-level checkpointing for shared memory programs

Application-level checkpointing for shared memory programs