Abstract
We consider the problem of recovering from processor failures efficiently in distributed systems. Each message received is logged in volatile storage when it is processed. At irregular intervals, each processor independently saves the contents of its volatile storage in stable storage. By appending only O(1) extra information to each message, we show that for recovery in general networks O(n 2) messages are sufficient and in ring networks Θ(n) messages are necessary and sufficient when an arbitrary number of processors fail. By appending O(n) extra information to each message that is sent, we show that O(kn) messages are sufficient for rollingback all of the processors to the maximum consistent states when there are k failures.
This is a preview of subscription content, log in via an institution.
Preview
Unable to display preview. Download preview PDF.
References
Afek, Y., Awerbuch, B., and Gafni, E., “Applying static network protocols to dynamic networks,” Proceedings of the twenty eighth Annual Symposium on Foundations of Computer Science, pp. 358–370, 1987.
Chandy, K.M. and Lamport, L., “Distributed snapshots: Determining global states of distributed systems,” ACM Transactions on Computer Systems, vol. 3, no. 1, pp. 63–75, 1985.
Gallager, R. G., Humblet, P.A., and Spira, P.M., “A distributed algorithm for minimum weight spanning trees,” ACM Transactions on Programming Languages and Systems, vol. 5, no. 1, pp. 66–77, 1983.
Gray, J., “Notes on database operating systems: Operating Systems: An advanced course:,” Lecture notes in computer science, 60, Springer-Verlag, pp. 393–481, 1978.
Johnson, D. and Zwaenepoel, W., “Recovery in distributed systems using optimistic message logging and checkpointing,” Proceedings of ACM Symposium on Principles of Distributed Computing, pp. 171–180, 1988.
Koo, R. and Toueg, S., “Checkpointing and Rollback-Recovery for Distributed Systems,” IEEE Transactions on Software Engineering, vol. SE-13, no. 1, pp. 23–31, 1987.
L'Ecuyer, P and Malenfant, J., “Computing Optimal Checkpointing Strategies for Rollback and Recovery Systems,” IEEE Transactions on Computers, vol. 37, no. 4, pp. 491–496, 1988.
Lamport, L., “Time, clocks, and the ordering of events in a distributed system,” Communication of the Association for Computing Machinery, vol. 21, no. 7, pp. 558–565, 1978.
Powell, M. and Presotto, D., “Publishing: a reliable broadcast communication mechanism,” Proceedings of the ninth ACM Symposium on Operating System Principles, pp. 100–109, 1983.
Sistla, A.P. and Welch, J., “Efficient distributed recovery using message logging,” Proceedings of Principles of Distributed Computing, 1989.
Son, S.H. and Agrawala, A.K., “Distributed Checkpointing for Globally Consistent States of Databases,” IEEE Transactions on Software Engineering, vol. 15, no. 10, pp. 1157–1167, 1989.
Strom, R.E. and Yemini, S., “Optimistic recovery in distributed systems,” ACM Transactions on Computer Systems, vol. 3, no. 3, pp. 204–226, 1985.
Venkatesan, K., Radhakrishnan, T., and Li, H., “Optimal Checkpointing and Local Recording for Domino-Free Rollback Recovery,” Information Processing Letters, vol. 25, no. 5, pp. 295–304, 1987.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1990 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Juang, T.TY., Venkatesan, S. (1990). Efficient algorithms for crash recovery in distributed systems. In: Nori, K.V., Veni Madhavan, C.E. (eds) Foundations of Software Technology and Theoretical Computer Science. FSTTCS 1990. Lecture Notes in Computer Science, vol 472. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-53487-3_56
Download citation
DOI: https://doi.org/10.1007/3-540-53487-3_56
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-53487-7
Online ISBN: 978-3-540-46313-9
eBook Packages: Springer Book Archive