skip to main content
10.1145/1024393.1024421acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
Article

Application-level checkpointing for shared memory programs

Published:07 October 2004Publication History

ABSTRACT

Trends in high-performance computing are making it necessary for long-running applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart (CPR) - the state of the computation is saved periodically on disk, and when a failure occurs, the computation is restarted from the last saved state. At present, it is the responsibility of the programmer to instrument applications for CPR.Our group is investigating the use of compiler technology to instrument codes to make them self-checkpointing and self-restarting, thereby providing an automatic solution to the problem of making long-running scientific applications resilient to hardware faults. Our previous work focused on message-passing programs.In this paper, we describe such a system for shared-memory programs running on symmetric multiprocessors. This system has two components: (i) a pre-compiler for source-to-source modification of applications, and (ii) a runtime system that implements a protocol for coordinating CPR among the threads of the parallel application. For the sake of concreteness, we focus on a non-trivial subset of OpenMP that includes barriers and locks.One of the advantages of this approach is that the ability to tolerate faults becomes embedded within the application itself, so applications become self-checkpointing and self-restarting on any platform. We demonstrate this by showing that our transformed benchmarks can checkpoint and restart on three different platforms (Windows/x86, Linux/x86, and Tru64/Alpha). Our experiments show that the overhead introduced by this approach is usually quite small; they also suggest ways in which the current implementation can be tuned to reduced overheads further.

References

  1. A. Beguelin, E. Seligman and P. Stephan. Application level fault tolerance in heterogeneous networks of workstations. Journal of Parallel and Distributed Computing 43(2):147--155, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. C. Amza, A. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajam ny, W. Yu, and W. Zwaenep el. TreadMarks: Shared memory computing on networks of workstations. IEEE Computer 29(2):18--28, February 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Adam Beguelin, Erik Seligman, and Peter Stephan. Application level fault tolerance in heterogeneous networks of workstations. Journal of Parallel and Distributed Computing 43(2):147--155, 1997. Also available as http://citeseer.nj.nec.com/beguelin97application.html Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. G. Brnevetsky, D. Marques, K. Pingali, and P. Stodghill. Collective perations in an application-level fault tolerant MPI system. In Proceedings of the 2003 International Conference on Supercomputing pages 234--243, June 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Greg Bronevetsky, Daniel Marques, Keshav Pingali, and Paul Stodghill. Automated application-level checkpointing of MPI programs. In Principles and Practice of Parallel Programming (PPoPP), pages 84--94,June 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Chandy and L. Lamport. Distributed snapshots: Determining global states of distributed systems. IEEE Transactions on Computing Systems 3(1):63--75, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. Christodoulopoulou, R. Azimi, and A. Bilas. Dynamic data replication: an approach to providing fault-tolerant shared memory clusters. In Proceedings of the Ninth Annual Symposium on High Performance Computer Architecture February 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Condor.http://www.cs.wisc.edu/condor/manual.Google ScholarGoogle Scholar
  9. W. Dieter and Jr. J. Lumpp. A user-level checkpointing library for POSIX threads programs. In Proceedings of 1999 Symposium on Fault-Tolerant Computing Systems (FTCS), June 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Duell. The Design and Implementation of Berkeley Lab's Linux Checkp int/Restart. http://www.nersc.gov/research/FTG/checkpoint/rep rts.html.Google ScholarGoogle Scholar
  11. M. Elnozahy, L. Alvisi, Y. M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message passing systems. Technical Report CMU-CS-96--181, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA, October 1996.Google ScholarGoogle Scholar
  12. P. Guedes and M. Castro. Distributed shared object memory. In Proceedings of WWOS 1993.Google ScholarGoogle ScholarCross RefCross Ref
  13. D. Hecht and C. Katsinis. Fault-Tolerant Distributed Shared Memory on a Broadcast-Based Interconnection Network.In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS 2000), May 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. T. Tannenbaum J. B. M. Litzkow and M. Livny. Checkpoint and Migration of Unix Processes in the Condor Distributed Processing System. Technical Report Technical Report 1346, University of Wisconsin-Madison, 1997.Google ScholarGoogle Scholar
  15. Angkul Kongmunvattan, S. Tanchatchawal, and N. Tzeng. Coherence-based coordinated checkpointing for software distributed shared memory systems. In Proceedings of the International Conference on Distributed Computer Systems (ICDCS 2000), 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Nancy Lynch. Distributed Algorithms Morgan Kaufmann, San Francisco, California,. first edition, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. S. Plank M. Beck and G. Kingsley. Compiler-Assisted Checkpointing. Technical Report Technical Report CS-94-269, University of Tennessee, December 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Y. M. Wang M. Elnozahy, L. Alvisi and D. B. Johnson. A survey of rollback-recovery protocols in message passing systems. Technical Report Technical Report CMU-CS-96-181, Carnegie Mellon University, October 1996.Google ScholarGoogle Scholar
  19. Z. Zhang M. Prvulovic and J. Torrellas. ReVive: Cost-effective architectural support for rollback recovery in shared memory multiprocessors. In International Conference on Computer Architecture 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. K. Kusan M. Sato, S. Satoh and Y. Tanaka. Design of OpenMP compiler for an SMP cluster. In EWOMP '99 pages 32--39, September 1999.Google ScholarGoogle Scholar
  21. Message Passing Interface Forum (MPIF). MPI: A message-passing interface standard. Technical Report, University of Tennessee, Knoxville, June 1995.Google ScholarGoogle Scholar
  22. N. Stone, J. Kochmar, R. Reddy, J. R. Scott, J. Sommerfeld, C. Vizino. A checkpoint and recovery system for the pittsburgh supercomputing center terascale computing system. http://www.psc.edu/publications/tech_reports/chkp_rcvry/checkpoint-recovery-1.0.html.Google ScholarGoogle Scholar
  23. N. Neves, M. Castro, and P. Guedes. A checkpoint protocol for an entry consistent shared memory system. In Proceedings of the Symposium on Principles of Distributed Computing Systems (PDCS), 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. OpenMP Architecture Review Board. OpenMP C and C++ Application, Program Interface Version 1.0, Document Number 004-2229-01 edition, October 1998. Available from http://www.openmp.org/.Google ScholarGoogle Scholar
  25. D. Sorin, M. Martin, M. Hill, and D. Wood. SafetyNet: Improving the availability of shared memory multiprocessors with global checkpoint/recovery. In Proceedings of the International Symposium on Computer Architecture (ISCA 2002), July 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. G. Stellner. CoCheck: Checkpointing and Process Migration for MPI. In Proceedings of International Parallel Processing Symposium(IPPS), 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. George Stellner. CoCheck: Checkpointing and Process Migration for MPI. In Proceedings of the 10th International Parallel Processing Symposium (IPPS '96), Honolulu, Hawaii, 1996. Also available at http://citeseer.nj.nec.com/stellner96cocheck.html Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. F. Sultan, T. D. Nguyen, and L. Iftode. Scalable fault-tolerant distributed shared memory. In Proceedings of Supercomputing 2000. November 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. S. Woo, M. Ohara, E. Torrie, J. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the International Symposium on Computer Architecture 1995 pages 24--36, June 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Application-level checkpointing for shared memory programs

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        ASPLOS XI: Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
        October 2004
        296 pages
        ISBN:1581138040
        DOI:10.1145/1024393
        • cover image ACM SIGOPS Operating Systems Review
          ACM SIGOPS Operating Systems Review  Volume 38, Issue 5
          ASPLOS '04
          December 2004
          283 pages
          ISSN:0163-5980
          DOI:10.1145/1037949
          Issue’s Table of Contents
        • cover image ACM SIGPLAN Notices
          ACM SIGPLAN Notices  Volume 39, Issue 11
          ASPLOS '04
          November 2004
          283 pages
          ISSN:0362-1340
          EISSN:1558-1160
          DOI:10.1145/1037187
          Issue’s Table of Contents
        • cover image ACM SIGARCH Computer Architecture News
          ACM SIGARCH Computer Architecture News  Volume 32, Issue 5
          ASPLOS 2004
          December 2004
          283 pages
          ISSN:0163-5964
          DOI:10.1145/1037947
          Issue’s Table of Contents

        Copyright © 2004 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 7 October 2004

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        Overall Acceptance Rate535of2,713submissions,20%

        Upcoming Conference

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader