Summary
Recovery from transient processor failures can be achieved by using optimistic message logging and checkpointing. The faulty processorsroll back, and some/all of the non-faulty processors also may have to roll back. This paper formulates the rollback problem as a closure problem. A centralized closure algorithm is presented together with two efficient distributed implementations. Several related problems are also considered and distributed algorithms are presented for solving them.
Similar content being viewed by others
References
Alvisi L, Hoppe B, Marzullo K: Nonblocking and orphan-free message logging protocols. In: The 23rd Annual International Symposium on Fault-Tolerant Computing: Digest of Papers. IEEE, pp 145–154, 1993
Borg A, Baumbach J, Glazer S: A message system supporting fault tolerance. In: Proceedings of ACM Symposium on Operating Systems Principles, pp 90–99 (1983)
Chandy K, Lamport L: Distributed snapshots: determining global states of distributed systems. ACM Trans Comput Syst 3: 63–75 (1985)
Elnozahy E, Zwaenepoel W: Manetho: transparent rollback-recovery with low overhead, limited rollback, and fast output commit. IEEE Trans Comput 41(5): 526–531 (1992)
Gallager R, Humblet P, Spira P: A distributed algorithm for minimum weight spanning trees. ACM Trans Program Lang Syst 5 (1): 66–77 (1983)
Johnson D: Distributed system fault tolerance using message logging and checkpointing. PhD thesis, Rice University, Houston, Texas, 1989
Johnson D, Zwaenepoel W: Sender-based message logging. In: The Seventeenth Annual International Symposium on Fault-Tolerant Computing: Digest of Papers. IEEE, pp 14–19, 1987
Johnson D, Zwaenepoel W: Recovery in distributed systems using optimistic message logging and checkpointing. J Algorithms 11(3): 462–491 (1990)
Juang T-Y, Venkatesan S: Crash recovery with little overhead. In: Proceedings of the 11th International Confernce on Distributed Computing Systems. IEEE, pp 454–461, 1991
Kim K: Programmer-transparent coordination of recovering concurrent processes: philosophy and rules for efficient implementation. IEEE Trans Softw Eng 14(6): 810–821 (1988)
Koo R, Toueg S: Checkpointing and rollback-recovery for distributed systems. IEEE Trans Softw Eng SE-13(1): 23–31 (1987)
Lamport, L: Time, clocks, and the ordering of events in a distributed system. Commun ACM 21(7): 558–565 (1978)
Lampson B, Sturgis H: Crash recovery in a distributed data storage system. Technical report, Xerox Palo Alto Research Center, 1979.
Powell M, Presotto D: Publishing: a reliable broadcast communication mechanism. In: Proceedings of the ninth ACM Symposium on Operating System Principles. ACM, pp 100–109, 1983
Ramarao K, Venkatesan S: On finding and updating shortest paths distributively. J Algorithms 13(2): 235–257 (1992)
Randell B: System structure for software fault tolerance. IEEE Trans Softw Eng SE-1 (2): 220–232 (1975)
Schneider F: Byzantine generals in action: implementing fail-stop processors. ACM Trans Comput Syst 2(2): 145–154 (1984)
Sistla A, Welch J: Efficient distributed recovery using message logging. In: Proceedings of ACM Symposium on Principles of Distributed Computing, pp 223–238, 1989
Strom R, Bacon D, Yemini S: Volatile logging inn-fault-tolerant distributed systems. In: The Eighteenth Annual International Symposium on Fault-Tolerant Computing: Digest of Papers. IEEE, pp 44–49, 1988
Strom R, Yemini S: Optimistic recovery in distributed systems. ACM Trans Comput Syst 3(3): 204–226 (1985)
Author information
Authors and Affiliations
Corresponding author
Additional information
S. Venkatesan received the B. Tech. and M. Tech degrees from the Indian Institute of Technology, Madras in 1981 and 1983, respectively and the M.S. and Ph.D. degrees in Computer Science from the University of Pittsburgh in 1985 and 1988. He joined the University of Texas at Dallas in January 1989, where he is currently an Assistant Professor of Computer Science. His research interests are in fault-tolerant distributed systems, distributed algorithms, testing and debugging distributed programs, fault-tolerant telecommunication networks, and mobile computing.
Tony Tony-Ying Juang is an Associate Professor of Computer Science at the Chung-Hwa Polytechnic Institute. He received the B.S. degree in Naval Architecture from the National Taiwan University in 1983 and his M.S. and Ph.D. degrees in Computer Science from the University of Texas at Dallas in 1989 and 1992, respectively. His research interests include distributed algorithms, fault-tolerant distributed computing, distributed operating systems and computer communications.
This research was supported in part by NSF under Grant No. CCR-9110177 and by the Texas Advanced Technology Program under Grant No. 9741-036
Rights and permissions
About this article
Cite this article
Venkatesan, S., Juang, T.T.Y. Efficient algorithms for optimistic crash recovery. Distrib Comput 8, 105–114 (1994). https://doi.org/10.1007/BF02280832
Received:
Accepted:
Issue Date:
DOI: https://doi.org/10.1007/BF02280832