Efficient algorithms for optimistic crash recovery

Venkatesan, S.; Juang, Tony T -Y.

doi:10.1007/BF02280832

Efficient algorithms for optimistic crash recovery

Published: October 1994

Volume 8, pages 105–114, (1994)
Cite this article

Distributed Computing Aims and scope Submit manuscript

S. Venkatesan¹ &
Tony T -Y. Juang¹

69 Accesses
12 Citations
Explore all metrics

Summary

Recovery from transient processor failures can be achieved by using optimistic message logging and checkpointing. The faulty processorsroll back, and some/all of the non-faulty processors also may have to roll back. This paper formulates the rollback problem as a closure problem. A centralized closure algorithm is presented together with two efficient distributed implementations. Several related problems are also considered and distributed algorithms are presented for solving them.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fault Tolerance Techniques for High-Performance Computing

A Fault-Model-Relevant Classification of Consensus Mechanisms for MPI and HPC

Article 12 December 2022

Using Replication for Resilience on Exascale Systems

References

Alvisi L, Hoppe B, Marzullo K: Nonblocking and orphan-free message logging protocols. In: The 23rd Annual International Symposium on Fault-Tolerant Computing: Digest of Papers. IEEE, pp 145–154, 1993
Borg A, Baumbach J, Glazer S: A message system supporting fault tolerance. In: Proceedings of ACM Symposium on Operating Systems Principles, pp 90–99 (1983)
Chandy K, Lamport L: Distributed snapshots: determining global states of distributed systems. ACM Trans Comput Syst 3: 63–75 (1985)
Google Scholar
Elnozahy E, Zwaenepoel W: Manetho: transparent rollback-recovery with low overhead, limited rollback, and fast output commit. IEEE Trans Comput 41(5): 526–531 (1992)
Google Scholar
Gallager R, Humblet P, Spira P: A distributed algorithm for minimum weight spanning trees. ACM Trans Program Lang Syst 5 (1): 66–77 (1983)
Google Scholar
Johnson D: Distributed system fault tolerance using message logging and checkpointing. PhD thesis, Rice University, Houston, Texas, 1989
Google Scholar
Johnson D, Zwaenepoel W: Sender-based message logging. In: The Seventeenth Annual International Symposium on Fault-Tolerant Computing: Digest of Papers. IEEE, pp 14–19, 1987
Johnson D, Zwaenepoel W: Recovery in distributed systems using optimistic message logging and checkpointing. J Algorithms 11(3): 462–491 (1990)
Google Scholar
Juang T-Y, Venkatesan S: Crash recovery with little overhead. In: Proceedings of the 11th International Confernce on Distributed Computing Systems. IEEE, pp 454–461, 1991
Kim K: Programmer-transparent coordination of recovering concurrent processes: philosophy and rules for efficient implementation. IEEE Trans Softw Eng 14(6): 810–821 (1988)
Google Scholar
Koo R, Toueg S: Checkpointing and rollback-recovery for distributed systems. IEEE Trans Softw Eng SE-13(1): 23–31 (1987)
Google Scholar
Lamport, L: Time, clocks, and the ordering of events in a distributed system. Commun ACM 21(7): 558–565 (1978)
Google Scholar
Lampson B, Sturgis H: Crash recovery in a distributed data storage system. Technical report, Xerox Palo Alto Research Center, 1979.
Powell M, Presotto D: Publishing: a reliable broadcast communication mechanism. In: Proceedings of the ninth ACM Symposium on Operating System Principles. ACM, pp 100–109, 1983
Ramarao K, Venkatesan S: On finding and updating shortest paths distributively. J Algorithms 13(2): 235–257 (1992)
Google Scholar
Randell B: System structure for software fault tolerance. IEEE Trans Softw Eng SE-1 (2): 220–232 (1975)
Google Scholar
Schneider F: Byzantine generals in action: implementing fail-stop processors. ACM Trans Comput Syst 2(2): 145–154 (1984)
Google Scholar
Sistla A, Welch J: Efficient distributed recovery using message logging. In: Proceedings of ACM Symposium on Principles of Distributed Computing, pp 223–238, 1989
Strom R, Bacon D, Yemini S: Volatile logging inn-fault-tolerant distributed systems. In: The Eighteenth Annual International Symposium on Fault-Tolerant Computing: Digest of Papers. IEEE, pp 44–49, 1988
Strom R, Yemini S: Optimistic recovery in distributed systems. ACM Trans Comput Syst 3(3): 204–226 (1985)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Program, University of Texas at Dallas, 75083-0688, Richardson, TX, USA
S. Venkatesan & Tony T -Y. Juang

Authors

S. Venkatesan
View author publications
You can also search for this author in PubMed Google Scholar
Tony T -Y. Juang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to S. Venkatesan.

Additional information

S. Venkatesan received the B. Tech. and M. Tech degrees from the Indian Institute of Technology, Madras in 1981 and 1983, respectively and the M.S. and Ph.D. degrees in Computer Science from the University of Pittsburgh in 1985 and 1988. He joined the University of Texas at Dallas in January 1989, where he is currently an Assistant Professor of Computer Science. His research interests are in fault-tolerant distributed systems, distributed algorithms, testing and debugging distributed programs, fault-tolerant telecommunication networks, and mobile computing.

Tony Tony-Ying Juang is an Associate Professor of Computer Science at the Chung-Hwa Polytechnic Institute. He received the B.S. degree in Naval Architecture from the National Taiwan University in 1983 and his M.S. and Ph.D. degrees in Computer Science from the University of Texas at Dallas in 1989 and 1992, respectively. His research interests include distributed algorithms, fault-tolerant distributed computing, distributed operating systems and computer communications.

This research was supported in part by NSF under Grant No. CCR-9110177 and by the Texas Advanced Technology Program under Grant No. 9741-036

Rights and permissions

Reprints and permissions

About this article

Cite this article

Venkatesan, S., Juang, T.T.Y. Efficient algorithms for optimistic crash recovery. Distrib Comput 8, 105–114 (1994). https://doi.org/10.1007/BF02280832

Download citation

Received: 15 November 1992
Accepted: 15 April 1994
Issue Date: October 1994
DOI: https://doi.org/10.1007/BF02280832

Key words

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient algorithms for optimistic crash recovery

Summary

Access this article

Similar content being viewed by others

Fault Tolerance Techniques for High-Performance Computing

A Fault-Model-Relevant Classification of Consensus Mechanisms for MPI and HPC

Using Replication for Resilience on Exascale Systems

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Key words

Navigation

Efficient algorithms for optimistic crash recovery

Summary

Access this article

Similar content being viewed by others

Fault Tolerance Techniques for High-Performance Computing

A Fault-Model-Relevant Classification of Consensus Mechanisms for MPI and HPC

Using Replication for Resilience on Exascale Systems

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

Search

Navigation