Skip to main content
Log in

SFT: A consistent checkpointing algorithm with short freezing time

  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

A consistent checkpointing algorithm with short freezing time (SFT) is presented in this paper. It supports fault-tolerance in distributed systems. The algorithm has shorter freezing time, lower overhead, and simplicity of recovery. To make checkpoint time shorter, a special control message (Munblock) is used to ensure that a process can respond the checkpoint event quickly at any given time. Moreover, main memory algorithm is used to improve the concurrency of checkpointing. By using SFT, the freezing time resulted by checkpointing is less than 0.03s. Furthermore, the control message number of SFT is onlyO(n).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Leon Jet al. Fail-safe PVM: A portable package for distributed programming with transparent recovery. School of Computer Science, Carnegic Mellon University: Technical Report CMU-CS-93-124, 1993.

  2. Casas Jet al. Mist: PVM with transparent migration and checkpointing. InProc. 3rd Annual PVM User’s Group Meeting, Pittsburg, 1995.

  3. Casas Jet al. MPVM: A migration transparent version of PVM.Dept. Computer Science and Engineering, Oregon Graduate Institute of Science & Technology: Technical Report CSE-95-002, Feb. 1995.

  4. Stellner G. Cocheck: Checkpointing and process migration for MPI. InProceedings of the International Parallel Processing Symposium, IEEE, April 1996.

  5. Stellner G. Consistent checkpoints of PVM applications. In1st European PVM User’s Group Meeting, Rome, 1994.

  6. Manivannan D, Mukesh Singhal. A low-overhead recovery technique using quasi-synchronous check-pointing.IEEE Proceedings of the 16th ICDCS, 1996, pp.100–107.

  7. James S. Plank. Efficient checkpointing on MIMD architectures. Doctoral Dissertation.Department of Computer Sciences, Princeton University, Princeton, 1993.

    Google Scholar 

  8. Sunderam V S. PVM: A framework for parallel distributed computing.Concurrency: Practice and Experience, 1990, 2(4): 315–339.

    Article  Google Scholar 

  9. William Groupet al. Using MPI: Portable Parallel Programming with the Message-Passing Interface. Cambridge, MA: MIT Press, 1994.

    Google Scholar 

  10. Elnozahy E N, Zwaenepoel W. Manetho: Transparent rollback-recovery with low overhead, limited roll-back and fast output commit.IEEE Trans. Computers, May 1995, 41(5): 526–531.

    Article  Google Scholar 

  11. Richard III G, Singhao M. Complete process recovery in distributed systems using vector time. Department of Computer and Information Science, The Ohio State University: Technical Report OSU-CISRC-7/94-TR39, 1994.

  12. Mani K. Chandy, Leslie Lamport. Distributed snapshorts: Determining global states of distributed systems.ACM Transactions on Computer Systems, Feb. 1985, 3(1): 3–75.

    Article  Google Scholar 

  13. James S Plank, Micah Beck, Gerry Kingsley, Kai Li. Libckpt: Transparent checkpointing under Unix. InProceedings of Usenix Winter 1995 Technical Conference, New Orleans, LA, January, 1995, pp.213–223.

  14. James S Plank, Kai Li. Ickp — A consistent checkpointer for multicomputers.IEEE Parallel and Distributed Technologies, 1994, 2(2): 62–67.

    Article  Google Scholar 

  15. James S Plank, Kai Li. Low-latency, concurrent checkpointing for parallel programs.IEEE Parallel and Distributed Technologies, 1994, 5(8): 874–879.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Additional information

This work is supported by the National Natural Science Foundation of China (Grant No. 69673012).

WEI Xiaohui was born in May 1972. He is a Ph.D. candidate of Computer Science Department, Jilin University, and his main research interests are distributed systems.

JU Jiubin was born in August 1935. He is a Professor of Computer Science Department, Jilin University, and his main research interests are distributed systems.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wei, X., Ju, J. SFT: A consistent checkpointing algorithm with short freezing time. J. Comput. Sci. & Technol. 15, 169–175 (2000). https://doi.org/10.1007/BF02948801

Download citation

  • Received:

  • Revised:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF02948801

Keywords

Navigation