skip to main content
10.1145/509593.509626acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
Article
Free Access

CLIP: a checkpointing tool for message-passing parallel programs

Published:15 November 1997Publication History

ABSTRACT

Checkpointing is a useful technique for rollback recovery of parallel applications. While extensive research has been performed on checkpointing in parallel environments, there are few checkpointers available to application users on commercial parallel computers. This paper presents one such checkpointer: CLIP. CLIP is a user-level library that provides semi-transparent check-pointing for parallel programs on the Intel Paragon multicomputer. It is publicly available to Paragon users at no cost.Conceptually, checkpointing a multicomputer is quite straightforward. However, when creating an actual tool for checkpointing a complex machine like the Paragon, many more issues arise that require careful design decisions to be made. Sometimes ease-of-use must be sacrificed for efficiency and/or correctness. This paper details what these decisions are, and how they were made in CLIP.We also present performance data when checkpointing several long-running Paragon applications with CLIP. The bottom line is that a convenient, general-purpose checkpointing tool like CLIP can provide fault-tolerance on a massively parallel multicomputer like the Paragon with very good performance.

References

  1. A. Baratloo, P. Dasgupta, and Z. M. Kedem. CALYPSO: A novel software system for fault-tolerant parallel processing on distributed platforms. In 4th IEEE International Symposium on High Performance Distributed Computing, August 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Beguelin, E. Seligman, and P. Stephan. Application level fault tolerance in heterogeneous networks of workstations. Technical Report CMU-CS-96-157, Carnegie Mellon University, to appear in Journal of Parallel and Distributed Computing, August 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. Casas, D. L. Clark, P. S. Galbiati, R. Konuru, S. W. Otto, R. M. Prouty, and J. Walpole. MIST: PVM with transparent migration and checkpointing. In 3rd Annual PVM Users' Group Meeting, Pittsburgh, PA, May 1995.Google ScholarGoogle Scholar
  4. T. Chiueh and P. Deng. Efficient checkpoint mechanisms for massively parallel machines. In 26th International Symposium on Fault- Tolerant Computing, Sendai, June 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. E. N. Elnozahy, D. B. Johnson, and Y. M. Wang. A survey of rollback-recovery protocols in message-passing systems. Technical Report CMU-CS-96-181, Carnegie Mellon University, October 1996.Google ScholarGoogle Scholar
  6. E. N. Elnozahy, D. B. Johnson, and W. Zwaenepoel. The performance of consistent checkpointing. In 11th Symposium on Reliable Distributed Systems, pages 39-47, October 1992.Google ScholarGoogle ScholarCross RefCross Ref
  7. E. N. Elnozahy and W. Zwaenepoel. On the use and implementation of message logging. In 24th International Symposium on Fault-Tolerant Computing, pages 298-307, Austin, TX, June 1994.Google ScholarGoogle ScholarCross RefCross Ref
  8. David Bailey et al. The nas parallel benchmarks 2.0. Technical Report NAS-95-020, NASA Ames Research Center, December 1995.Google ScholarGoogle Scholar
  9. S. I. Feldman and C. B. Brown. Igor: A system for program debugging via reversible execution. ACM SIGPLAN Notices, Workshop on Parallel and Distributed Debugging, 24(1):112-123, January 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Ronald D. Henderson and George Em Karniadakis. Unstructured spectral element methods for simulation of turbulent flows. Journal of Computation Physics, 122(2):191-217, December 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Y. Huang and C. Kintala. Software implemented fault tolerance: Technologies and experience. In 23rd International Symposium on Fault-Tolerant Computing, pages 2-9, July 1993.Google ScholarGoogle Scholar
  12. B. A. Kingsbury and J. T. Kline. Job and process recovery in a UNIX-based operating system. In Usenix Winter 1989 Technical Conference, pages 355-364, San Diego, CA, January 1989.Google ScholarGoogle Scholar
  13. C. R. Landau. The checkpoint mechanism in keykos. In Proceedings of the 2nd International Workshop on Object Orientation in Operating Systems, pages 86-91. IEEE, September 1992.Google ScholarGoogle ScholarCross RefCross Ref
  14. K. Li, J. F. Naughton, and J. S. Plank. Low-latency, concurrent checkpointing for parallel programs. IEEE Transactions on Parallel and Distributed Systems, 5(8):874-879, August 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Message Passing Interface Forum. MPI: A Message-Passing Interface Standard, 1.0 edition, May 1994.Google ScholarGoogle Scholar
  16. A. Chien P. Crandall, R. Aydt and D. Reed. Input/output characteristics of scalable parallel applications. In Proceedings of SuperComputing 95, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Paul Pierce. The Paragon implementation of the NX message passing interface. In Proceedings of Scalable High-Performance Computing Conference (SHPCC) 94, 1994.Google ScholarGoogle ScholarCross RefCross Ref
  18. J. S. Plank, M. Beck, and G. Kingsley. Compiler-assisted memory exclusion for fast checkpointing. IEEE Technical Committee on Operating Systems and Application Environments, 7(4):10-14, Winter 1995.Google ScholarGoogle Scholar
  19. J. S. Plank, M. Beck, G. Kingsley, and K. Li. Libckpt: Transparent checkpointing under unix. In Usenix Winter 1995 Technical Conference, pages 213-223, January 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. S. Plank and K. Li. Ickp --- a consistent checkpointer for multicomputers. IEEE Parallel & Distributed Technology, 2(2):62-67, Summer 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Russinovich and Z. Segall. Fault-tolerance for off-the-shelf applications and hardware. In 25th International Symposium on Fault-Tolerant Computing, pages 67-71, Pasadena, CA, June 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. L. M. Silva, B. Veer, and J. G. Silva. Checkpointing SPMD applications on transputer networks. In Scalable High Performance Computing Conference, pages 694-701, Knoxville, TN, May 1994.Google ScholarGoogle ScholarCross RefCross Ref
  23. G. Stellner. CoCheck: Checkpointing and process migration for MPI. In 10th International Parallel Processing Symposium, April 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. T. Tannenbaum and M. Litzkow. The Condor distributed processing system. Dr. Dobb's Journal, #227:40-48, February 1995.Google ScholarGoogle Scholar
  25. P. H. Worley and I. T. Foster. Parallel spectral transform shallow water model: A runtime-tunable parallel benchmark code. In Proceedings of Scalable High-Performance Computing Conference (SHPCC) 94, 1994.Google ScholarGoogle ScholarCross RefCross Ref
  1. CLIP: a checkpointing tool for message-passing parallel programs

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SC '97: Proceedings of the 1997 ACM/IEEE conference on Supercomputing
        November 1997
        921 pages
        ISBN:0897919858
        DOI:10.1145/509593

        Copyright © 1997 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 15 November 1997

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        Overall Acceptance Rate1,516of6,373submissions,24%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader