Abstract
A consistent checkpointing algorithm with short freezing time (SFT) is presented in this paper. It supports fault-tolerance in distributed systems. The algorithm has shorter freezing time, lower overhead, and simplicity of recovery. To make checkpoint time shorter, a special control message (Munblock) is used to ensure that a process can respond the checkpoint event quickly at any given time. Moreover, main memory algorithm is used to improve the concurrency of checkpointing. By using SFT, the freezing time resulted by checkpointing is less than 0.03s. Furthermore, the control message number of SFT is onlyO(n).
Similar content being viewed by others
References
Leon Jet al. Fail-safe PVM: A portable package for distributed programming with transparent recovery. School of Computer Science, Carnegic Mellon University: Technical Report CMU-CS-93-124, 1993.
Casas Jet al. Mist: PVM with transparent migration and checkpointing. InProc. 3rd Annual PVM User’s Group Meeting, Pittsburg, 1995.
Casas Jet al. MPVM: A migration transparent version of PVM.Dept. Computer Science and Engineering, Oregon Graduate Institute of Science & Technology: Technical Report CSE-95-002, Feb. 1995.
Stellner G. Cocheck: Checkpointing and process migration for MPI. InProceedings of the International Parallel Processing Symposium, IEEE, April 1996.
Stellner G. Consistent checkpoints of PVM applications. In1st European PVM User’s Group Meeting, Rome, 1994.
Manivannan D, Mukesh Singhal. A low-overhead recovery technique using quasi-synchronous check-pointing.IEEE Proceedings of the 16th ICDCS, 1996, pp.100–107.
James S. Plank. Efficient checkpointing on MIMD architectures. Doctoral Dissertation.Department of Computer Sciences, Princeton University, Princeton, 1993.
Sunderam V S. PVM: A framework for parallel distributed computing.Concurrency: Practice and Experience, 1990, 2(4): 315–339.
William Groupet al. Using MPI: Portable Parallel Programming with the Message-Passing Interface. Cambridge, MA: MIT Press, 1994.
Elnozahy E N, Zwaenepoel W. Manetho: Transparent rollback-recovery with low overhead, limited roll-back and fast output commit.IEEE Trans. Computers, May 1995, 41(5): 526–531.
Richard III G, Singhao M. Complete process recovery in distributed systems using vector time. Department of Computer and Information Science, The Ohio State University: Technical Report OSU-CISRC-7/94-TR39, 1994.
Mani K. Chandy, Leslie Lamport. Distributed snapshorts: Determining global states of distributed systems.ACM Transactions on Computer Systems, Feb. 1985, 3(1): 3–75.
James S Plank, Micah Beck, Gerry Kingsley, Kai Li. Libckpt: Transparent checkpointing under Unix. InProceedings of Usenix Winter 1995 Technical Conference, New Orleans, LA, January, 1995, pp.213–223.
James S Plank, Kai Li. Ickp — A consistent checkpointer for multicomputers.IEEE Parallel and Distributed Technologies, 1994, 2(2): 62–67.
James S Plank, Kai Li. Low-latency, concurrent checkpointing for parallel programs.IEEE Parallel and Distributed Technologies, 1994, 5(8): 874–879.
Author information
Authors and Affiliations
Additional information
This work is supported by the National Natural Science Foundation of China (Grant No. 69673012).
WEI Xiaohui was born in May 1972. He is a Ph.D. candidate of Computer Science Department, Jilin University, and his main research interests are distributed systems.
JU Jiubin was born in August 1935. He is a Professor of Computer Science Department, Jilin University, and his main research interests are distributed systems.
Rights and permissions
About this article
Cite this article
Wei, X., Ju, J. SFT: A consistent checkpointing algorithm with short freezing time. J. Comput. Sci. & Technol. 15, 169–175 (2000). https://doi.org/10.1007/BF02948801
Received:
Revised:
Issue Date:
DOI: https://doi.org/10.1007/BF02948801