Skip to main content
Log in

Efficient algorithms for optimistic crash recovery

  • Published:
Distributed Computing Aims and scope Submit manuscript

Summary

Recovery from transient processor failures can be achieved by using optimistic message logging and checkpointing. The faulty processorsroll back, and some/all of the non-faulty processors also may have to roll back. This paper formulates the rollback problem as a closure problem. A centralized closure algorithm is presented together with two efficient distributed implementations. Several related problems are also considered and distributed algorithms are presented for solving them.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Alvisi L, Hoppe B, Marzullo K: Nonblocking and orphan-free message logging protocols. In: The 23rd Annual International Symposium on Fault-Tolerant Computing: Digest of Papers. IEEE, pp 145–154, 1993

  2. Borg A, Baumbach J, Glazer S: A message system supporting fault tolerance. In: Proceedings of ACM Symposium on Operating Systems Principles, pp 90–99 (1983)

  3. Chandy K, Lamport L: Distributed snapshots: determining global states of distributed systems. ACM Trans Comput Syst 3: 63–75 (1985)

    Google Scholar 

  4. Elnozahy E, Zwaenepoel W: Manetho: transparent rollback-recovery with low overhead, limited rollback, and fast output commit. IEEE Trans Comput 41(5): 526–531 (1992)

    Google Scholar 

  5. Gallager R, Humblet P, Spira P: A distributed algorithm for minimum weight spanning trees. ACM Trans Program Lang Syst 5 (1): 66–77 (1983)

    Google Scholar 

  6. Johnson D: Distributed system fault tolerance using message logging and checkpointing. PhD thesis, Rice University, Houston, Texas, 1989

    Google Scholar 

  7. Johnson D, Zwaenepoel W: Sender-based message logging. In: The Seventeenth Annual International Symposium on Fault-Tolerant Computing: Digest of Papers. IEEE, pp 14–19, 1987

  8. Johnson D, Zwaenepoel W: Recovery in distributed systems using optimistic message logging and checkpointing. J Algorithms 11(3): 462–491 (1990)

    Google Scholar 

  9. Juang T-Y, Venkatesan S: Crash recovery with little overhead. In: Proceedings of the 11th International Confernce on Distributed Computing Systems. IEEE, pp 454–461, 1991

  10. Kim K: Programmer-transparent coordination of recovering concurrent processes: philosophy and rules for efficient implementation. IEEE Trans Softw Eng 14(6): 810–821 (1988)

    Google Scholar 

  11. Koo R, Toueg S: Checkpointing and rollback-recovery for distributed systems. IEEE Trans Softw Eng SE-13(1): 23–31 (1987)

    Google Scholar 

  12. Lamport, L: Time, clocks, and the ordering of events in a distributed system. Commun ACM 21(7): 558–565 (1978)

    Google Scholar 

  13. Lampson B, Sturgis H: Crash recovery in a distributed data storage system. Technical report, Xerox Palo Alto Research Center, 1979.

  14. Powell M, Presotto D: Publishing: a reliable broadcast communication mechanism. In: Proceedings of the ninth ACM Symposium on Operating System Principles. ACM, pp 100–109, 1983

  15. Ramarao K, Venkatesan S: On finding and updating shortest paths distributively. J Algorithms 13(2): 235–257 (1992)

    Google Scholar 

  16. Randell B: System structure for software fault tolerance. IEEE Trans Softw Eng SE-1 (2): 220–232 (1975)

    Google Scholar 

  17. Schneider F: Byzantine generals in action: implementing fail-stop processors. ACM Trans Comput Syst 2(2): 145–154 (1984)

    Google Scholar 

  18. Sistla A, Welch J: Efficient distributed recovery using message logging. In: Proceedings of ACM Symposium on Principles of Distributed Computing, pp 223–238, 1989

  19. Strom R, Bacon D, Yemini S: Volatile logging inn-fault-tolerant distributed systems. In: The Eighteenth Annual International Symposium on Fault-Tolerant Computing: Digest of Papers. IEEE, pp 44–49, 1988

  20. Strom R, Yemini S: Optimistic recovery in distributed systems. ACM Trans Comput Syst 3(3): 204–226 (1985)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to S. Venkatesan.

Additional information

S. Venkatesan received the B. Tech. and M. Tech degrees from the Indian Institute of Technology, Madras in 1981 and 1983, respectively and the M.S. and Ph.D. degrees in Computer Science from the University of Pittsburgh in 1985 and 1988. He joined the University of Texas at Dallas in January 1989, where he is currently an Assistant Professor of Computer Science. His research interests are in fault-tolerant distributed systems, distributed algorithms, testing and debugging distributed programs, fault-tolerant telecommunication networks, and mobile computing.

Tony Tony-Ying Juang is an Associate Professor of Computer Science at the Chung-Hwa Polytechnic Institute. He received the B.S. degree in Naval Architecture from the National Taiwan University in 1983 and his M.S. and Ph.D. degrees in Computer Science from the University of Texas at Dallas in 1989 and 1992, respectively. His research interests include distributed algorithms, fault-tolerant distributed computing, distributed operating systems and computer communications.

This research was supported in part by NSF under Grant No. CCR-9110177 and by the Texas Advanced Technology Program under Grant No. 9741-036

Rights and permissions

Reprints and permissions

About this article

Cite this article

Venkatesan, S., Juang, T.T.Y. Efficient algorithms for optimistic crash recovery. Distrib Comput 8, 105–114 (1994). https://doi.org/10.1007/BF02280832

Download citation

  • Received:

  • Accepted:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF02280832

Key words

Navigation