Skip to main content

Basic concepts and issues in fault-tolerant distributed systems

  • Conference paper
  • First Online:
Operating Systems of the 90s and Beyond

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 563))

Abstract

The dependability of computing services will become increasingly important in the 90s and beyond. This paper proposes a small number of basic concepts that can be used to explain the architecture of present and future fault-tolerant distributed systems and discusses a list of architectural issues that we find useful to consider when designing or examining such systems. For each issue we present known solutions and design alternatives, we discuss their relative merits and we give examples of systems which adopt one approach or the other. The aim is to introduce some order in the complex discipline of designing and understanding fault-tolerant distributed systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. T. Anderson, P. Lee: Fault-tolerance-Principles and Practice, Prentice Hall, 1981.

    Google Scholar 

  2. A. Avizienis: Software Fault Tolerance, IFIP Computer Congress, San Francisco, August 1989.

    Google Scholar 

  3. A. Avizienis, P. Gunningberg, J. Kelly, L. Strigini, P. Traverse, K. Tso, U. Voges: The UCLA Dedix System: a Distributed Testbed for Multi-version Software, 15th Int. Conf. on Fault-tolerant Computing, Ann Arbor, Mi 1985.

    Google Scholar 

  4. A. E. Abbadi, D. Skeen, F. Cristian: An Efficient Fault-Tolerant Protocol for Replicated Data Management, 4th ACM Conf. on Principles of Database Systems, 1985.

    Google Scholar 

  5. J. Bartlett: A NonStop Kernel, 8th Symp. on Operating System Principles, Dec. 1981.

    Google Scholar 

  6. Ph. Bernstein: Sequoia: a Fault-tolerant Tightly Coupled Multiprocessor for Transaction Processing, IEEE Computer, February 1988.

    Google Scholar 

  7. A. Borg, W. Blau, W. Graetsch, F. Herrmann, W. Oberle: Fault-Tolerance under Unix, ACM Trans. on Computer Systems, Vol. 7, No. 1, Feb 1989.

    Google Scholar 

  8. O. Babaoglu, R. Drumond: Streets of Byzantium: Network Architectures for Fast Reliable Broadcast, IEEE Tr. on Software Engineering, Vol. SE-11, No. 6, 1985.

    Google Scholar 

  9. D. Barbara, H. Garcia-Molina, A. Spauster: Increasing Availability under Mutual Exclusion Constraints with Dynamic Vote Reassignment, ACM Trans. on Computer Systems, Vol. 7, No. 4, Nov 1989.

    Google Scholar 

  10. Ph. Bernstein, V. Hadzilacos, N. Goodman: Concurrency Control and Recovery in Database Systems, Addison-Wesley, 1987.

    Google Scholar 

  11. K. Birman, T. Joseph: Reliable Communication in the Presence of Failures, ACM Trans. on Computer Systems, Vol. 5, No. 1, February 1987.

    Google Scholar 

  12. F. Cristian: A Rigorous Approach to Fault-tolerant Programming, IEEE Tr. on Software Eng., Vol. SE 11, No. 1, 1985.

    Google Scholar 

  13. F. Cristian: Agreeing on Who is Present and Who is Absent in a Synchronous Distributed System, 18th Int Conf on Fault-Tolerant Computing, Tokyo, June 1988.

    Google Scholar 

  14. F. Cristian: Exception Handling, in “Dependability of Resilient Computers”, T. Anderson, Ed., Blackwell Scientific Publications, Oxford, 1989.

    Google Scholar 

  15. R. Carr: The Tandem Global Update Protocol, Tandem Systems Review, Vol. 1, No. 2, June 1985.

    Google Scholar 

  16. F. Cristian, H. Aghili, R. Strong, D. Dolev: Atomic Broadcast: From Simple Diffusion to Byzantine Agreement, 15th Int. Conf. on Fault-tolerant Computing, Ann Arbor, Mi 1985.

    Google Scholar 

  17. F. Cristian, R. Dancey, J. Dehn: Fault-tolerance in the Advanced Automation System, 20th Int. Conf. on Fault-tolerant Computing, Newcastle upon Tyne, England, June 1990.

    Google Scholar 

  18. D. Clark: The Structuring of Systems using Up-calls, 10th ACM Symp. on Operating Systems Principles, 1985.

    Google Scholar 

  19. J.M. Chang, N. Maxemchuck: Reliable Broadcast Protocols, ACM Tr. on Computer Systems, Vol. 2, No. 3, August 1984.

    Google Scholar 

  20. E. Cooper: Replicated Distributed Programs, PhD thesis, UC Berkeley, 1985.

    Google Scholar 

  21. D. Comer, L. Peterson: Understanding Naming in Distributed Systems, Distributed Computing, Vol. 3, pp. 51–60, 1989.

    Article  Google Scholar 

  22. F. Cristian: Probabilistic Clock Synchronization, Distributed Computing, Vol. 3, pp. 146–158, 1989.

    Article  Google Scholar 

  23. F. Cristian: Synchronous Atomic Broadcast for Redundant Broadcast Channels, The Journal of Real-Time Systems, Vol. 2, pp. 195–212, 1990.

    Article  Google Scholar 

  24. D. Cheriton, W. Zwaenepoel: Distributed Process Groups in the V Kernel, ACM Tr. on Comp. Systems, Vol. 3, No. 2, May 1985.

    Google Scholar 

  25. E. Dijkstra: Hierarchical Ordering of Sequential Processes, Acta Informatica, Vol 1, pp. 115–138, 1971.

    Article  Google Scholar 

  26. P. Ezhilchelvan, S. Shrivastava: A Characterization of Faults in Systems, 5th Symp. on Reliability in Dist. Softw. and Database systems, Los Angeles, January 1986.

    Google Scholar 

  27. J. Gray: Notes on Database Operating Systems, Operating Systems — An Advanced Course, Lecture Notes in Computer Science, Springer Verlag, Vol 60, 1978.

    Google Scholar 

  28. J. Gray: Why do computers stop and what can be done about it? 5th Symp. on Reliability in Dist. Softw. and Database systems, Los Angeles, January 1986.

    Google Scholar 

  29. H. Garcia-Molina, A. Spauster: Message Ordering in a Multicast Environment, 9th Int. Conf. on Distributed Systems, Newport Beach, California, June 1989.

    Google Scholar 

  30. A. Hopkins, B. Smith, J. Lala: FTMP-A highly reliable fault-tolerant multi-processor for aircraft, Proceedings IEEE, Vol. 66, Oct 1978.

    Google Scholar 

  31. R. Harper, J. Lala, J. Deyst: Fault Tolerant Parallel Processor Architecture Overview, 18th Int Conf on Fault-Tolerant Computing, Tokyo, June 1988.

    Google Scholar 

  32. IBM International Technical Support Centers: IMS/VS Extended Recovery Facility (XRF): Technical Reference, 1987.

    Google Scholar 

  33. D. Johnson, W. Zwaenepoel: Sender Based Message Logging, 17th Int Conf on Fault-Tolerant Computing, Tokyo, June 1987.

    Google Scholar 

  34. J. Knight, P. Amann: Issues Influencing the Use of N-version Programming, Proceedings IFIP Congress, San Francisco, August 1989.

    Google Scholar 

  35. F. Kaashoek, A. Tanenbaum: Fault-tolerance Using Group Communication, 4th ACM SIGOPS European Workshop, Bologna, Sept 1990.

    Google Scholar 

  36. H. Kopetz, G. Grunsteidl, J. Reisinger: Fault-tolerant Membership in a Synchronous Real-time System, IFIP Working Conference on “Dependable Computing for Critical Applications”, Santa Barbara, August 1989.

    Google Scholar 

  37. N. Kronenberg, H. Levy, W. Strecker: VAXclusters: A Closely-Coupled Distributed System, ACM Transactions on Computer Systems, Vol. 4, No. 2, 1986.

    Google Scholar 

  38. R. Koo, S. Toueg: Check-pointing and Rollback Recovery for Distributed Systems, IEEE Transactions on Software Engineering, Vol. SE-13, No. 1, 1986.

    Google Scholar 

  39. L. Lamport: Using Time Instead of Time-outs in Fault-Tolerant Systems, ACM Trans on Programming Languages and Systems, vol. 6, no. 2, 1984.

    Google Scholar 

  40. L. Lamport: The Part Time Parliament, DEC SRC Report 49, Sept 1989.

    Google Scholar 

  41. J. C. Laprie: Dependability: a unifying concept for reliable computing and fault-tolerance, T. Anderson, Ed., Blackwell Scientific Publications, Oxford, 1989.

    Google Scholar 

  42. J. C. Laprie, J. Arlat, C. Beounes, K. Kanoun: Definition and Analysis of Hardware and Software-Fault-Tolerant Architectures, IEEE Computer, July 1990.

    Google Scholar 

  43. S. Luan, V. Gligor: A Fault-tolerant Protocol for Atomic Broadcast, 10th Int Conf on Distributed Computing Systems, Paris, May 1990.

    Google Scholar 

  44. G. Le Lann: Critical Issues in Distributed Real-Time Computing, Proceedings of “ESTEC Workshop on Communication Networks and Distributed Operating Systems within the Space Environment”, European Space Agency Report WPP-10, Noordwijk, Oct.24–26, 1989.

    Google Scholar 

  45. R. Ladin, B. Liskov, L. Shrira: “Lazy Replication: a Method for Managing Replicated Data” 9th Annual ACM Symposium on Principles of Distributed Computing, August 1990.

    Google Scholar 

  46. B. Lampson, H. Sturgis: Atomic Transactions, in “Distributed Systems: An Advanced Course”, Lecture Notes in Computer Science Vol. 105, Springer Verlag, 1981.

    Google Scholar 

  47. E. McCluskey: Fault-Tolerant Systems, Technical Report CSL-199 Stanford University, 1982.

    Google Scholar 

  48. M. Melliar-Smith, L. Moser, V. Agrawala: Broadcast Protocols for Distributed Systems, IEEE Tr on Parallel and Distributed Systems, Vol. 1, No. 1, Jan 1990.

    Google Scholar 

  49. B. Oki, B. Liskov: Viewstamped Replication: a New Primary Copy Method to Support Highly Available Distributed Systems, 7th ACM Symp. on Principles of Distributed Computing, August 1988.

    Google Scholar 

  50. D. Parnas: Designing Software for Ease of Extension and Contraction, IEEE Tr. on Software Engineering, Vol. SE-5, No. 2, March 1979.

    Google Scholar 

  51. D. Powell: La Tolerance aux Fautes Dans les Systemes Repartis: Les Hypotheses d'Erreur et leur Importance, LAAS Research report 89-258, September, 1989.

    Google Scholar 

  52. D. Palumbo, R. Butler: Measurement of SIFT operating system overhead, NASA Technical Memo 86322, 1985.

    Google Scholar 

  53. W. Peterson, E. Weldon: Error Correcting Codes, MIT Press, Massachusetts, 1972.

    Google Scholar 

  54. B. Randell: System Structure for Software Fault-Tolerance, IEEE Trans. on Software Eng., Vol. SE-1, No. 2, 1975.

    Google Scholar 

  55. D. Siewiorek: Fault-tolerance in Commercial Computers, IEEE Computer, July 1990.

    Google Scholar 

  56. F. Schneider: The State Machine Approach: a tutorial, TR 86-800 Cornell Univ., 1986.

    Google Scholar 

  57. F. Schmuck: The Use of Efficient Broadcast Protocols in Asynchronous Distributed Systems, PhD thesis, TR88-928 Cornell Univ., 1988.

    Google Scholar 

  58. J. Saltzer, D. Reed, D. Clark: End-to-end Arguments in System Design, ACM Trans. on Computer Systems, Vol. 2, No. 4, Nov, 1984.

    Google Scholar 

  59. R. Strong, D. Skeen, F. Cristian, H. Aghili: Handshake Protocols, 7th Int. Conf. on Distributed Computing Systems, Berlin, September 1987.

    Google Scholar 

  60. R. Strom, S. Yemini: Optimistic Recovery in Distributed Systems, ACM Transactions on Computer Systems, Vol. 3, No. 3, 1985.

    Google Scholar 

  61. A. Tanenbaum: Computer Networks, Prentice Hall, Englewood Cliffs, NJ, 1981.

    Google Scholar 

  62. K. Trivedi: Probability and Statistics with Reliability, Queuing and Computer Science Applications, Prentice Hall, 1982.

    Google Scholar 

  63. D. Taylor and G. Wilson: The Stratus System Architecture, in “Dependability of Resilient Computers”, T. Anderson, Ed., Blackwell Scientific Publications, Oxford, 1989.

    Google Scholar 

  64. P. Verissimo, L. Rodrigues, M. Baptista: AMp: A Highly Parallel Atomic Multicast Protocol, in Proceedings, ACM SIGCOM'89, Austin, Texas, Sept 89.

    Google Scholar 

  65. J. Wakerly: Error Detecting Codes, Self-checking Circuits, and Applications, Elsevier North-Holland, Inc., New York, 1978.

    Google Scholar 

  66. J. Wensley, L. Lamport, J. Goldberg, M. Green, K. Levitt, M. Melliar-Smith, R. Shostak, C. Weinstock, SIFT: Design and Analysis of a Fault tolerant Computer for Aircraft Control, Proc IEEE Vol. 66, Oct 1978.

    Google Scholar 

  67. W. Wulf: Reliable Hardware-Software Architecture, 1975 Int. Conf. on Reliable Software, SIGPLAN 10, No. 6, 1975.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Arthur Karshmer Jürgen Nehmer

Rights and permissions

Reprints and permissions

Copyright information

© 1991 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Cristian, F. (1991). Basic concepts and issues in fault-tolerant distributed systems. In: Karshmer, A., Nehmer, J. (eds) Operating Systems of the 90s and Beyond. Lecture Notes in Computer Science, vol 563. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0024534

Download citation

  • DOI: https://doi.org/10.1007/BFb0024534

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-54987-1

  • Online ISBN: 978-3-540-46630-7

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics