Skip to main content

Efficient algorithms for crash recovery in distributed systems

  • Distributed Computing
  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 472))

Abstract

We consider the problem of recovering from processor failures efficiently in distributed systems. Each message received is logged in volatile storage when it is processed. At irregular intervals, each processor independently saves the contents of its volatile storage in stable storage. By appending only O(1) extra information to each message, we show that for recovery in general networks O(n 2) messages are sufficient and in ring networks Θ(n) messages are necessary and sufficient when an arbitrary number of processors fail. By appending O(n) extra information to each message that is sent, we show that O(kn) messages are sufficient for rollingback all of the processors to the maximum consistent states when there are k failures.

This is a preview of subscription content, log in via an institution.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Afek, Y., Awerbuch, B., and Gafni, E., “Applying static network protocols to dynamic networks,” Proceedings of the twenty eighth Annual Symposium on Foundations of Computer Science, pp. 358–370, 1987.

    Google Scholar 

  2. Chandy, K.M. and Lamport, L., “Distributed snapshots: Determining global states of distributed systems,” ACM Transactions on Computer Systems, vol. 3, no. 1, pp. 63–75, 1985.

    Google Scholar 

  3. Gallager, R. G., Humblet, P.A., and Spira, P.M., “A distributed algorithm for minimum weight spanning trees,” ACM Transactions on Programming Languages and Systems, vol. 5, no. 1, pp. 66–77, 1983.

    Google Scholar 

  4. Gray, J., “Notes on database operating systems: Operating Systems: An advanced course:,” Lecture notes in computer science, 60, Springer-Verlag, pp. 393–481, 1978.

    Google Scholar 

  5. Johnson, D. and Zwaenepoel, W., “Recovery in distributed systems using optimistic message logging and checkpointing,” Proceedings of ACM Symposium on Principles of Distributed Computing, pp. 171–180, 1988.

    Google Scholar 

  6. Koo, R. and Toueg, S., “Checkpointing and Rollback-Recovery for Distributed Systems,” IEEE Transactions on Software Engineering, vol. SE-13, no. 1, pp. 23–31, 1987.

    Google Scholar 

  7. L'Ecuyer, P and Malenfant, J., “Computing Optimal Checkpointing Strategies for Rollback and Recovery Systems,” IEEE Transactions on Computers, vol. 37, no. 4, pp. 491–496, 1988.

    Google Scholar 

  8. Lamport, L., “Time, clocks, and the ordering of events in a distributed system,” Communication of the Association for Computing Machinery, vol. 21, no. 7, pp. 558–565, 1978.

    Google Scholar 

  9. Powell, M. and Presotto, D., “Publishing: a reliable broadcast communication mechanism,” Proceedings of the ninth ACM Symposium on Operating System Principles, pp. 100–109, 1983.

    Google Scholar 

  10. Sistla, A.P. and Welch, J., “Efficient distributed recovery using message logging,” Proceedings of Principles of Distributed Computing, 1989.

    Google Scholar 

  11. Son, S.H. and Agrawala, A.K., “Distributed Checkpointing for Globally Consistent States of Databases,” IEEE Transactions on Software Engineering, vol. 15, no. 10, pp. 1157–1167, 1989.

    Google Scholar 

  12. Strom, R.E. and Yemini, S., “Optimistic recovery in distributed systems,” ACM Transactions on Computer Systems, vol. 3, no. 3, pp. 204–226, 1985.

    Google Scholar 

  13. Venkatesan, K., Radhakrishnan, T., and Li, H., “Optimal Checkpointing and Local Recording for Domino-Free Rollback Recovery,” Information Processing Letters, vol. 25, no. 5, pp. 295–304, 1987.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Kesav V. Nori C. E. Veni Madhavan

Rights and permissions

Reprints and permissions

Copyright information

© 1990 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Juang, T.TY., Venkatesan, S. (1990). Efficient algorithms for crash recovery in distributed systems. In: Nori, K.V., Veni Madhavan, C.E. (eds) Foundations of Software Technology and Theoretical Computer Science. FSTTCS 1990. Lecture Notes in Computer Science, vol 472. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-53487-3_56

Download citation

  • DOI: https://doi.org/10.1007/3-540-53487-3_56

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-53487-7

  • Online ISBN: 978-3-540-46313-9

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics