Efficient algorithms for crash recovery in distributed systems

Juang, Tony T-Y.; Venkatesan, S.

doi:10.1007/3-540-53487-3_56

Efficient algorithms for crash recovery in distributed systems

Tony T-Y. Juang¹ &
S. Venkatesan¹

Distributed Computing
Conference paper
First Online: 01 January 2005

139 Accesses
20 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 472))

Abstract

We consider the problem of recovering from processor failures efficiently in distributed systems. Each message received is logged in volatile storage when it is processed. At irregular intervals, each processor independently saves the contents of its volatile storage in stable storage. By appending only O(1) extra information to each message, we show that for recovery in general networks O(n ²) messages are sufficient and in ring networks Θ(n) messages are necessary and sufficient when an arbitrary number of processors fail. By appending O(n) extra information to each message that is sent, we show that O(kn) messages are sufficient for rollingback all of the processors to the maximum consistent states when there are k failures.

This is a preview of subscription content, log in via an institution.

Preview

Unable to display preview. Download preview PDF.

References

Afek, Y., Awerbuch, B., and Gafni, E., “Applying static network protocols to dynamic networks,” Proceedings of the twenty eighth Annual Symposium on Foundations of Computer Science, pp. 358–370, 1987.
Google Scholar
Chandy, K.M. and Lamport, L., “Distributed snapshots: Determining global states of distributed systems,” ACM Transactions on Computer Systems, vol. 3, no. 1, pp. 63–75, 1985.
Google Scholar
Gallager, R. G., Humblet, P.A., and Spira, P.M., “A distributed algorithm for minimum weight spanning trees,” ACM Transactions on Programming Languages and Systems, vol. 5, no. 1, pp. 66–77, 1983.
Google Scholar
Gray, J., “Notes on database operating systems: Operating Systems: An advanced course:,” Lecture notes in computer science, 60, Springer-Verlag, pp. 393–481, 1978.
Google Scholar
Johnson, D. and Zwaenepoel, W., “Recovery in distributed systems using optimistic message logging and checkpointing,” Proceedings of ACM Symposium on Principles of Distributed Computing, pp. 171–180, 1988.
Google Scholar
Koo, R. and Toueg, S., “Checkpointing and Rollback-Recovery for Distributed Systems,” IEEE Transactions on Software Engineering, vol. SE-13, no. 1, pp. 23–31, 1987.
Google Scholar
L'Ecuyer, P and Malenfant, J., “Computing Optimal Checkpointing Strategies for Rollback and Recovery Systems,” IEEE Transactions on Computers, vol. 37, no. 4, pp. 491–496, 1988.
Google Scholar
Lamport, L., “Time, clocks, and the ordering of events in a distributed system,” Communication of the Association for Computing Machinery, vol. 21, no. 7, pp. 558–565, 1978.
Google Scholar
Powell, M. and Presotto, D., “Publishing: a reliable broadcast communication mechanism,” Proceedings of the ninth ACM Symposium on Operating System Principles, pp. 100–109, 1983.
Google Scholar
Sistla, A.P. and Welch, J., “Efficient distributed recovery using message logging,” Proceedings of Principles of Distributed Computing, 1989.
Google Scholar
Son, S.H. and Agrawala, A.K., “Distributed Checkpointing for Globally Consistent States of Databases,” IEEE Transactions on Software Engineering, vol. 15, no. 10, pp. 1157–1167, 1989.
Google Scholar
Strom, R.E. and Yemini, S., “Optimistic recovery in distributed systems,” ACM Transactions on Computer Systems, vol. 3, no. 3, pp. 204–226, 1985.
Google Scholar
Venkatesan, K., Radhakrishnan, T., and Li, H., “Optimal Checkpointing and Local Recording for Domino-Free Rollback Recovery,” Information Processing Letters, vol. 25, no. 5, pp. 295–304, 1987.
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Program, MP 31, University of Texas at Dallas, 75083-0688, Richardson, TX
Tony T-Y. Juang & S. Venkatesan

Authors

Tony T-Y. Juang
View author publications
You can also search for this author in PubMed Google Scholar
S. Venkatesan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Kesav V. Nori C. E. Veni Madhavan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Juang, T.TY., Venkatesan, S. (1990). Efficient algorithms for crash recovery in distributed systems. In: Nori, K.V., Veni Madhavan, C.E. (eds) Foundations of Software Technology and Theoretical Computer Science. FSTTCS 1990. Lecture Notes in Computer Science, vol 472. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-53487-3_56

Download citation

DOI: https://doi.org/10.1007/3-540-53487-3_56
Published: 01 June 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-53487-7
Online ISBN: 978-3-540-46313-9
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics