ABSTRACT
Trends in high-performance computing are making it necessary for long-running applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart (CPR) - the state of the computation is saved periodically on disk, and when a failure occurs, the computation is restarted from the last saved state. At present, it is the responsibility of the programmer to instrument applications for CPR.Our group is investigating the use of compiler technology to instrument codes to make them self-checkpointing and self-restarting, thereby providing an automatic solution to the problem of making long-running scientific applications resilient to hardware faults. Our previous work focused on message-passing programs.In this paper, we describe such a system for shared-memory programs running on symmetric multiprocessors. This system has two components: (i) a pre-compiler for source-to-source modification of applications, and (ii) a runtime system that implements a protocol for coordinating CPR among the threads of the parallel application. For the sake of concreteness, we focus on a non-trivial subset of OpenMP that includes barriers and locks.One of the advantages of this approach is that the ability to tolerate faults becomes embedded within the application itself, so applications become self-checkpointing and self-restarting on any platform. We demonstrate this by showing that our transformed benchmarks can checkpoint and restart on three different platforms (Windows/x86, Linux/x86, and Tru64/Alpha). Our experiments show that the overhead introduced by this approach is usually quite small; they also suggest ways in which the current implementation can be tuned to reduced overheads further.
- A. Beguelin, E. Seligman and P. Stephan. Application level fault tolerance in heterogeneous networks of workstations. Journal of Parallel and Distributed Computing 43(2):147--155, 1997. Google ScholarDigital Library
- C. Amza, A. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajam ny, W. Yu, and W. Zwaenep el. TreadMarks: Shared memory computing on networks of workstations. IEEE Computer 29(2):18--28, February 1995. Google ScholarDigital Library
- Adam Beguelin, Erik Seligman, and Peter Stephan. Application level fault tolerance in heterogeneous networks of workstations. Journal of Parallel and Distributed Computing 43(2):147--155, 1997. Also available as http://citeseer.nj.nec.com/beguelin97application.html Google ScholarDigital Library
- G. Brnevetsky, D. Marques, K. Pingali, and P. Stodghill. Collective perations in an application-level fault tolerant MPI system. In Proceedings of the 2003 International Conference on Supercomputing pages 234--243, June 2003. Google ScholarDigital Library
- Greg Bronevetsky, Daniel Marques, Keshav Pingali, and Paul Stodghill. Automated application-level checkpointing of MPI programs. In Principles and Practice of Parallel Programming (PPoPP), pages 84--94,June 2003. Google ScholarDigital Library
- M. Chandy and L. Lamport. Distributed snapshots: Determining global states of distributed systems. IEEE Transactions on Computing Systems 3(1):63--75, 1985. Google ScholarDigital Library
- R. Christodoulopoulou, R. Azimi, and A. Bilas. Dynamic data replication: an approach to providing fault-tolerant shared memory clusters. In Proceedings of the Ninth Annual Symposium on High Performance Computer Architecture February 2003. Google ScholarDigital Library
- Condor.http://www.cs.wisc.edu/condor/manual.Google Scholar
- W. Dieter and Jr. J. Lumpp. A user-level checkpointing library for POSIX threads programs. In Proceedings of 1999 Symposium on Fault-Tolerant Computing Systems (FTCS), June 1999. Google ScholarDigital Library
- J. Duell. The Design and Implementation of Berkeley Lab's Linux Checkp int/Restart. http://www.nersc.gov/research/FTG/checkpoint/rep rts.html.Google Scholar
- M. Elnozahy, L. Alvisi, Y. M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message passing systems. Technical Report CMU-CS-96--181, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA, October 1996.Google Scholar
- P. Guedes and M. Castro. Distributed shared object memory. In Proceedings of WWOS 1993.Google ScholarCross Ref
- D. Hecht and C. Katsinis. Fault-Tolerant Distributed Shared Memory on a Broadcast-Based Interconnection Network.In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS 2000), May 2000. Google ScholarDigital Library
- T. Tannenbaum J. B. M. Litzkow and M. Livny. Checkpoint and Migration of Unix Processes in the Condor Distributed Processing System. Technical Report Technical Report 1346, University of Wisconsin-Madison, 1997.Google Scholar
- Angkul Kongmunvattan, S. Tanchatchawal, and N. Tzeng. Coherence-based coordinated checkpointing for software distributed shared memory systems. In Proceedings of the International Conference on Distributed Computer Systems (ICDCS 2000), 2000. Google ScholarDigital Library
- Nancy Lynch. Distributed Algorithms Morgan Kaufmann, San Francisco, California,. first edition, 1996. Google ScholarDigital Library
- J. S. Plank M. Beck and G. Kingsley. Compiler-Assisted Checkpointing. Technical Report Technical Report CS-94-269, University of Tennessee, December 1994. Google ScholarDigital Library
- Y. M. Wang M. Elnozahy, L. Alvisi and D. B. Johnson. A survey of rollback-recovery protocols in message passing systems. Technical Report Technical Report CMU-CS-96-181, Carnegie Mellon University, October 1996.Google Scholar
- Z. Zhang M. Prvulovic and J. Torrellas. ReVive: Cost-effective architectural support for rollback recovery in shared memory multiprocessors. In International Conference on Computer Architecture 2002. Google ScholarDigital Library
- K. Kusan M. Sato, S. Satoh and Y. Tanaka. Design of OpenMP compiler for an SMP cluster. In EWOMP '99 pages 32--39, September 1999.Google Scholar
- Message Passing Interface Forum (MPIF). MPI: A message-passing interface standard. Technical Report, University of Tennessee, Knoxville, June 1995.Google Scholar
- N. Stone, J. Kochmar, R. Reddy, J. R. Scott, J. Sommerfeld, C. Vizino. A checkpoint and recovery system for the pittsburgh supercomputing center terascale computing system. http://www.psc.edu/publications/tech_reports/chkp_rcvry/checkpoint-recovery-1.0.html.Google Scholar
- N. Neves, M. Castro, and P. Guedes. A checkpoint protocol for an entry consistent shared memory system. In Proceedings of the Symposium on Principles of Distributed Computing Systems (PDCS), 1994. Google ScholarDigital Library
- OpenMP Architecture Review Board. OpenMP C and C++ Application, Program Interface Version 1.0, Document Number 004-2229-01 edition, October 1998. Available from http://www.openmp.org/.Google Scholar
- D. Sorin, M. Martin, M. Hill, and D. Wood. SafetyNet: Improving the availability of shared memory multiprocessors with global checkpoint/recovery. In Proceedings of the International Symposium on Computer Architecture (ISCA 2002), July 2002. Google ScholarDigital Library
- G. Stellner. CoCheck: Checkpointing and Process Migration for MPI. In Proceedings of International Parallel Processing Symposium(IPPS), 1996. Google ScholarDigital Library
- George Stellner. CoCheck: Checkpointing and Process Migration for MPI. In Proceedings of the 10th International Parallel Processing Symposium (IPPS '96), Honolulu, Hawaii, 1996. Also available at http://citeseer.nj.nec.com/stellner96cocheck.html Google ScholarDigital Library
- F. Sultan, T. D. Nguyen, and L. Iftode. Scalable fault-tolerant distributed shared memory. In Proceedings of Supercomputing 2000. November 2000. Google ScholarDigital Library
- S. Woo, M. Ohara, E. Torrie, J. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the International Symposium on Computer Architecture 1995 pages 24--36, June 1995. Google ScholarDigital Library
Index Terms
- Application-level checkpointing for shared memory programs
Recommendations
Automated application-level checkpointing of MPI programs
PPoPP '03: Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programmingThe running times of many computational science applications, such as protein-folding using ab initio methods, are much longer than the mean-time-to-failure of high-performance computing platforms. To run to completion, therefore, these applications ...
Application-level checkpointing for shared memory programs
ASPLOS '04Trends in high-performance computing are making it necessary for long-running applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart (CPR) - the state of the computation is saved periodically on disk, and ...
Application-level checkpointing for shared memory programs
ASPLOS 2004Trends in high-performance computing are making it necessary for long-running applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart (CPR) - the state of the computation is saved periodically on disk, and ...
Comments