Basic concepts and issues in fault-tolerant distributed systems

Cristian, Flaviu

doi:10.1007/BFb0024534

Flaviu Cristian¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 563))

174 Accesses
1 Citations

Abstract

The dependability of computing services will become increasingly important in the 90s and beyond. This paper proposes a small number of basic concepts that can be used to explain the architecture of present and future fault-tolerant distributed systems and discusses a list of architectural issues that we find useful to consider when designing or examining such systems. For each issue we present known solutions and design alternatives, we discuss their relative merits and we give examples of systems which adopt one approach or the other. The aim is to introduce some order in the complex discipline of designing and understanding fault-tolerant distributed systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

T. Anderson, P. Lee: Fault-tolerance-Principles and Practice, Prentice Hall, 1981.
Google Scholar
A. Avizienis: Software Fault Tolerance, IFIP Computer Congress, San Francisco, August 1989.
Google Scholar
A. Avizienis, P. Gunningberg, J. Kelly, L. Strigini, P. Traverse, K. Tso, U. Voges: The UCLA Dedix System: a Distributed Testbed for Multi-version Software, 15th Int. Conf. on Fault-tolerant Computing, Ann Arbor, Mi 1985.
Google Scholar
A. E. Abbadi, D. Skeen, F. Cristian: An Efficient Fault-Tolerant Protocol for Replicated Data Management, 4th ACM Conf. on Principles of Database Systems, 1985.
Google Scholar
J. Bartlett: A NonStop Kernel, 8th Symp. on Operating System Principles, Dec. 1981.
Google Scholar
Ph. Bernstein: Sequoia: a Fault-tolerant Tightly Coupled Multiprocessor for Transaction Processing, IEEE Computer, February 1988.
Google Scholar
A. Borg, W. Blau, W. Graetsch, F. Herrmann, W. Oberle: Fault-Tolerance under Unix, ACM Trans. on Computer Systems, Vol. 7, No. 1, Feb 1989.
Google Scholar
O. Babaoglu, R. Drumond: Streets of Byzantium: Network Architectures for Fast Reliable Broadcast, IEEE Tr. on Software Engineering, Vol. SE-11, No. 6, 1985.
Google Scholar
D. Barbara, H. Garcia-Molina, A. Spauster: Increasing Availability under Mutual Exclusion Constraints with Dynamic Vote Reassignment, ACM Trans. on Computer Systems, Vol. 7, No. 4, Nov 1989.
Google Scholar
Ph. Bernstein, V. Hadzilacos, N. Goodman: Concurrency Control and Recovery in Database Systems, Addison-Wesley, 1987.
Google Scholar
K. Birman, T. Joseph: Reliable Communication in the Presence of Failures, ACM Trans. on Computer Systems, Vol. 5, No. 1, February 1987.
Google Scholar
F. Cristian: A Rigorous Approach to Fault-tolerant Programming, IEEE Tr. on Software Eng., Vol. SE 11, No. 1, 1985.
Google Scholar
F. Cristian: Agreeing on Who is Present and Who is Absent in a Synchronous Distributed System, 18th Int Conf on Fault-Tolerant Computing, Tokyo, June 1988.
Google Scholar
F. Cristian: Exception Handling, in “Dependability of Resilient Computers”, T. Anderson, Ed., Blackwell Scientific Publications, Oxford, 1989.
Google Scholar
R. Carr: The Tandem Global Update Protocol, Tandem Systems Review, Vol. 1, No. 2, June 1985.
Google Scholar
F. Cristian, H. Aghili, R. Strong, D. Dolev: Atomic Broadcast: From Simple Diffusion to Byzantine Agreement, 15th Int. Conf. on Fault-tolerant Computing, Ann Arbor, Mi 1985.
Google Scholar
F. Cristian, R. Dancey, J. Dehn: Fault-tolerance in the Advanced Automation System, 20th Int. Conf. on Fault-tolerant Computing, Newcastle upon Tyne, England, June 1990.
Google Scholar
D. Clark: The Structuring of Systems using Up-calls, 10th ACM Symp. on Operating Systems Principles, 1985.
Google Scholar
J.M. Chang, N. Maxemchuck: Reliable Broadcast Protocols, ACM Tr. on Computer Systems, Vol. 2, No. 3, August 1984.
Google Scholar
E. Cooper: Replicated Distributed Programs, PhD thesis, UC Berkeley, 1985.
Google Scholar
D. Comer, L. Peterson: Understanding Naming in Distributed Systems, Distributed Computing, Vol. 3, pp. 51–60, 1989.
Article Google Scholar
F. Cristian: Probabilistic Clock Synchronization, Distributed Computing, Vol. 3, pp. 146–158, 1989.
Article Google Scholar
F. Cristian: Synchronous Atomic Broadcast for Redundant Broadcast Channels, The Journal of Real-Time Systems, Vol. 2, pp. 195–212, 1990.
Article Google Scholar
D. Cheriton, W. Zwaenepoel: Distributed Process Groups in the V Kernel, ACM Tr. on Comp. Systems, Vol. 3, No. 2, May 1985.
Google Scholar
E. Dijkstra: Hierarchical Ordering of Sequential Processes, Acta Informatica, Vol 1, pp. 115–138, 1971.
Article Google Scholar
P. Ezhilchelvan, S. Shrivastava: A Characterization of Faults in Systems, 5th Symp. on Reliability in Dist. Softw. and Database systems, Los Angeles, January 1986.
Google Scholar
J. Gray: Notes on Database Operating Systems, Operating Systems — An Advanced Course, Lecture Notes in Computer Science, Springer Verlag, Vol 60, 1978.
Google Scholar
J. Gray: Why do computers stop and what can be done about it? 5th Symp. on Reliability in Dist. Softw. and Database systems, Los Angeles, January 1986.
Google Scholar
H. Garcia-Molina, A. Spauster: Message Ordering in a Multicast Environment, 9th Int. Conf. on Distributed Systems, Newport Beach, California, June 1989.
Google Scholar
A. Hopkins, B. Smith, J. Lala: FTMP-A highly reliable fault-tolerant multi-processor for aircraft, Proceedings IEEE, Vol. 66, Oct 1978.
Google Scholar
R. Harper, J. Lala, J. Deyst: Fault Tolerant Parallel Processor Architecture Overview, 18th Int Conf on Fault-Tolerant Computing, Tokyo, June 1988.
Google Scholar
IBM International Technical Support Centers: IMS/VS Extended Recovery Facility (XRF): Technical Reference, 1987.
Google Scholar
D. Johnson, W. Zwaenepoel: Sender Based Message Logging, 17th Int Conf on Fault-Tolerant Computing, Tokyo, June 1987.
Google Scholar
J. Knight, P. Amann: Issues Influencing the Use of N-version Programming, Proceedings IFIP Congress, San Francisco, August 1989.
Google Scholar
F. Kaashoek, A. Tanenbaum: Fault-tolerance Using Group Communication, 4th ACM SIGOPS European Workshop, Bologna, Sept 1990.
Google Scholar
H. Kopetz, G. Grunsteidl, J. Reisinger: Fault-tolerant Membership in a Synchronous Real-time System, IFIP Working Conference on “Dependable Computing for Critical Applications”, Santa Barbara, August 1989.
Google Scholar
N. Kronenberg, H. Levy, W. Strecker: VAXclusters: A Closely-Coupled Distributed System, ACM Transactions on Computer Systems, Vol. 4, No. 2, 1986.
Google Scholar
R. Koo, S. Toueg: Check-pointing and Rollback Recovery for Distributed Systems, IEEE Transactions on Software Engineering, Vol. SE-13, No. 1, 1986.
Google Scholar
L. Lamport: Using Time Instead of Time-outs in Fault-Tolerant Systems, ACM Trans on Programming Languages and Systems, vol. 6, no. 2, 1984.
Google Scholar
L. Lamport: The Part Time Parliament, DEC SRC Report 49, Sept 1989.
Google Scholar
J. C. Laprie: Dependability: a unifying concept for reliable computing and fault-tolerance, T. Anderson, Ed., Blackwell Scientific Publications, Oxford, 1989.
Google Scholar
J. C. Laprie, J. Arlat, C. Beounes, K. Kanoun: Definition and Analysis of Hardware and Software-Fault-Tolerant Architectures, IEEE Computer, July 1990.
Google Scholar
S. Luan, V. Gligor: A Fault-tolerant Protocol for Atomic Broadcast, 10th Int Conf on Distributed Computing Systems, Paris, May 1990.
Google Scholar
G. Le Lann: Critical Issues in Distributed Real-Time Computing, Proceedings of “ESTEC Workshop on Communication Networks and Distributed Operating Systems within the Space Environment”, European Space Agency Report WPP-10, Noordwijk, Oct.24–26, 1989.
Google Scholar
R. Ladin, B. Liskov, L. Shrira: “Lazy Replication: a Method for Managing Replicated Data” 9th Annual ACM Symposium on Principles of Distributed Computing, August 1990.
Google Scholar
B. Lampson, H. Sturgis: Atomic Transactions, in “Distributed Systems: An Advanced Course”, Lecture Notes in Computer Science Vol. 105, Springer Verlag, 1981.
Google Scholar
E. McCluskey: Fault-Tolerant Systems, Technical Report CSL-199 Stanford University, 1982.
Google Scholar
M. Melliar-Smith, L. Moser, V. Agrawala: Broadcast Protocols for Distributed Systems, IEEE Tr on Parallel and Distributed Systems, Vol. 1, No. 1, Jan 1990.
Google Scholar
B. Oki, B. Liskov: Viewstamped Replication: a New Primary Copy Method to Support Highly Available Distributed Systems, 7th ACM Symp. on Principles of Distributed Computing, August 1988.
Google Scholar
D. Parnas: Designing Software for Ease of Extension and Contraction, IEEE Tr. on Software Engineering, Vol. SE-5, No. 2, March 1979.
Google Scholar
D. Powell: La Tolerance aux Fautes Dans les Systemes Repartis: Les Hypotheses d'Erreur et leur Importance, LAAS Research report 89-258, September, 1989.
Google Scholar
D. Palumbo, R. Butler: Measurement of SIFT operating system overhead, NASA Technical Memo 86322, 1985.
Google Scholar
W. Peterson, E. Weldon: Error Correcting Codes, MIT Press, Massachusetts, 1972.
Google Scholar
B. Randell: System Structure for Software Fault-Tolerance, IEEE Trans. on Software Eng., Vol. SE-1, No. 2, 1975.
Google Scholar
D. Siewiorek: Fault-tolerance in Commercial Computers, IEEE Computer, July 1990.
Google Scholar
F. Schneider: The State Machine Approach: a tutorial, TR 86-800 Cornell Univ., 1986.
Google Scholar
F. Schmuck: The Use of Efficient Broadcast Protocols in Asynchronous Distributed Systems, PhD thesis, TR88-928 Cornell Univ., 1988.
Google Scholar
J. Saltzer, D. Reed, D. Clark: End-to-end Arguments in System Design, ACM Trans. on Computer Systems, Vol. 2, No. 4, Nov, 1984.
Google Scholar
R. Strong, D. Skeen, F. Cristian, H. Aghili: Handshake Protocols, 7th Int. Conf. on Distributed Computing Systems, Berlin, September 1987.
Google Scholar
R. Strom, S. Yemini: Optimistic Recovery in Distributed Systems, ACM Transactions on Computer Systems, Vol. 3, No. 3, 1985.
Google Scholar
A. Tanenbaum: Computer Networks, Prentice Hall, Englewood Cliffs, NJ, 1981.
Google Scholar
K. Trivedi: Probability and Statistics with Reliability, Queuing and Computer Science Applications, Prentice Hall, 1982.
Google Scholar
D. Taylor and G. Wilson: The Stratus System Architecture, in “Dependability of Resilient Computers”, T. Anderson, Ed., Blackwell Scientific Publications, Oxford, 1989.
Google Scholar
P. Verissimo, L. Rodrigues, M. Baptista: AMp: A Highly Parallel Atomic Multicast Protocol, in Proceedings, ACM SIGCOM'89, Austin, Texas, Sept 89.
Google Scholar
J. Wakerly: Error Detecting Codes, Self-checking Circuits, and Applications, Elsevier North-Holland, Inc., New York, 1978.
Google Scholar
J. Wensley, L. Lamport, J. Goldberg, M. Green, K. Levitt, M. Melliar-Smith, R. Shostak, C. Weinstock, SIFT: Design and Analysis of a Fault tolerant Computer for Aircraft Control, Proc IEEE Vol. 66, Oct 1978.
Google Scholar
W. Wulf: Reliable Hardware-Software Architecture, 1975 Int. Conf. on Reliable Software, SIGPLAN 10, No. 6, 1975.
Google Scholar

Download references

Author information

Authors and Affiliations

IBM Almaden Research Center, 95120-6099, San Jose, Ca
Flaviu Cristian

Authors

Flaviu Cristian
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Arthur Karshmer Jürgen Nehmer

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cristian, F. (1991). Basic concepts and issues in fault-tolerant distributed systems. In: Karshmer, A., Nehmer, J. (eds) Operating Systems of the 90s and Beyond. Lecture Notes in Computer Science, vol 563. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0024534

Download citation

DOI: https://doi.org/10.1007/BFb0024534
Published: 11 June 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-54987-1
Online ISBN: 978-3-540-46630-7
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics