ABSTRACT
Checkpointing is a useful technique for rollback recovery of parallel applications. While extensive research has been performed on checkpointing in parallel environments, there are few checkpointers available to application users on commercial parallel computers. This paper presents one such checkpointer: CLIP. CLIP is a user-level library that provides semi-transparent check-pointing for parallel programs on the Intel Paragon multicomputer. It is publicly available to Paragon users at no cost.Conceptually, checkpointing a multicomputer is quite straightforward. However, when creating an actual tool for checkpointing a complex machine like the Paragon, many more issues arise that require careful design decisions to be made. Sometimes ease-of-use must be sacrificed for efficiency and/or correctness. This paper details what these decisions are, and how they were made in CLIP.We also present performance data when checkpointing several long-running Paragon applications with CLIP. The bottom line is that a convenient, general-purpose checkpointing tool like CLIP can provide fault-tolerance on a massively parallel multicomputer like the Paragon with very good performance.
- A. Baratloo, P. Dasgupta, and Z. M. Kedem. CALYPSO: A novel software system for fault-tolerant parallel processing on distributed platforms. In 4th IEEE International Symposium on High Performance Distributed Computing, August 1995. Google ScholarDigital Library
- A. Beguelin, E. Seligman, and P. Stephan. Application level fault tolerance in heterogeneous networks of workstations. Technical Report CMU-CS-96-157, Carnegie Mellon University, to appear in Journal of Parallel and Distributed Computing, August 1996. Google ScholarDigital Library
- J. Casas, D. L. Clark, P. S. Galbiati, R. Konuru, S. W. Otto, R. M. Prouty, and J. Walpole. MIST: PVM with transparent migration and checkpointing. In 3rd Annual PVM Users' Group Meeting, Pittsburgh, PA, May 1995.Google Scholar
- T. Chiueh and P. Deng. Efficient checkpoint mechanisms for massively parallel machines. In 26th International Symposium on Fault- Tolerant Computing, Sendai, June 1996. Google ScholarDigital Library
- E. N. Elnozahy, D. B. Johnson, and Y. M. Wang. A survey of rollback-recovery protocols in message-passing systems. Technical Report CMU-CS-96-181, Carnegie Mellon University, October 1996.Google Scholar
- E. N. Elnozahy, D. B. Johnson, and W. Zwaenepoel. The performance of consistent checkpointing. In 11th Symposium on Reliable Distributed Systems, pages 39-47, October 1992.Google ScholarCross Ref
- E. N. Elnozahy and W. Zwaenepoel. On the use and implementation of message logging. In 24th International Symposium on Fault-Tolerant Computing, pages 298-307, Austin, TX, June 1994.Google ScholarCross Ref
- David Bailey et al. The nas parallel benchmarks 2.0. Technical Report NAS-95-020, NASA Ames Research Center, December 1995.Google Scholar
- S. I. Feldman and C. B. Brown. Igor: A system for program debugging via reversible execution. ACM SIGPLAN Notices, Workshop on Parallel and Distributed Debugging, 24(1):112-123, January 1989. Google ScholarDigital Library
- Ronald D. Henderson and George Em Karniadakis. Unstructured spectral element methods for simulation of turbulent flows. Journal of Computation Physics, 122(2):191-217, December 1995. Google ScholarDigital Library
- Y. Huang and C. Kintala. Software implemented fault tolerance: Technologies and experience. In 23rd International Symposium on Fault-Tolerant Computing, pages 2-9, July 1993.Google Scholar
- B. A. Kingsbury and J. T. Kline. Job and process recovery in a UNIX-based operating system. In Usenix Winter 1989 Technical Conference, pages 355-364, San Diego, CA, January 1989.Google Scholar
- C. R. Landau. The checkpoint mechanism in keykos. In Proceedings of the 2nd International Workshop on Object Orientation in Operating Systems, pages 86-91. IEEE, September 1992.Google ScholarCross Ref
- K. Li, J. F. Naughton, and J. S. Plank. Low-latency, concurrent checkpointing for parallel programs. IEEE Transactions on Parallel and Distributed Systems, 5(8):874-879, August 1994. Google ScholarDigital Library
- Message Passing Interface Forum. MPI: A Message-Passing Interface Standard, 1.0 edition, May 1994.Google Scholar
- A. Chien P. Crandall, R. Aydt and D. Reed. Input/output characteristics of scalable parallel applications. In Proceedings of SuperComputing 95, 1995. Google ScholarDigital Library
- Paul Pierce. The Paragon implementation of the NX message passing interface. In Proceedings of Scalable High-Performance Computing Conference (SHPCC) 94, 1994.Google ScholarCross Ref
- J. S. Plank, M. Beck, and G. Kingsley. Compiler-assisted memory exclusion for fast checkpointing. IEEE Technical Committee on Operating Systems and Application Environments, 7(4):10-14, Winter 1995.Google Scholar
- J. S. Plank, M. Beck, G. Kingsley, and K. Li. Libckpt: Transparent checkpointing under unix. In Usenix Winter 1995 Technical Conference, pages 213-223, January 1995. Google ScholarDigital Library
- J. S. Plank and K. Li. Ickp --- a consistent checkpointer for multicomputers. IEEE Parallel & Distributed Technology, 2(2):62-67, Summer 1994. Google ScholarDigital Library
- M. Russinovich and Z. Segall. Fault-tolerance for off-the-shelf applications and hardware. In 25th International Symposium on Fault-Tolerant Computing, pages 67-71, Pasadena, CA, June 1995. Google ScholarDigital Library
- L. M. Silva, B. Veer, and J. G. Silva. Checkpointing SPMD applications on transputer networks. In Scalable High Performance Computing Conference, pages 694-701, Knoxville, TN, May 1994.Google ScholarCross Ref
- G. Stellner. CoCheck: Checkpointing and process migration for MPI. In 10th International Parallel Processing Symposium, April 1996. Google ScholarDigital Library
- T. Tannenbaum and M. Litzkow. The Condor distributed processing system. Dr. Dobb's Journal, #227:40-48, February 1995.Google Scholar
- P. H. Worley and I. T. Foster. Parallel spectral transform shallow water model: A runtime-tunable parallel benchmark code. In Proceedings of Scalable High-Performance Computing Conference (SHPCC) 94, 1994.Google ScholarCross Ref
- CLIP: a checkpointing tool for message-passing parallel programs
Recommendations
OpenMP for Networks of SMPs
In this paper, we present the first system that implements OpenMP on a network of shared-memory multiprocessors. This system enables the programmer to rely on a single, standard, shared-memory API for parallelization within a multiprocessor and between ...
SPMD OpenMP versus MPI on a IBM SMP for 3 Kernels of the NAS Benchmarks
ISHPC '02: Proceedings of the 4th International Symposium on High Performance ComputingShared Memory Multiprocessors are becoming more popular since they are used to deploy large parallel computers. The current trend is to enlarge the number of processors inside such multiprocessor nodes. However a lot of existing applications are using ...
SPMD OpenMP versus MPI on a IBM SMP for 3 Kernels of the NAS Benchmarks
ISHPC '02: Proceedings of the 4th International Symposium on High Performance ComputingShared Memory Multiprocessors are becoming more popular since they are used to deploy large parallel computers. The current trend is to enlarge the number of processors inside such multiprocessor nodes. However a lot of existing applications are using ...
Comments