Abstract
The dependability of computing services will become increasingly important in the 90s and beyond. This paper proposes a small number of basic concepts that can be used to explain the architecture of present and future fault-tolerant distributed systems and discusses a list of architectural issues that we find useful to consider when designing or examining such systems. For each issue we present known solutions and design alternatives, we discuss their relative merits and we give examples of systems which adopt one approach or the other. The aim is to introduce some order in the complex discipline of designing and understanding fault-tolerant distributed systems.
Preview
Unable to display preview. Download preview PDF.
References
T. Anderson, P. Lee: Fault-tolerance-Principles and Practice, Prentice Hall, 1981.
A. Avizienis: Software Fault Tolerance, IFIP Computer Congress, San Francisco, August 1989.
A. Avizienis, P. Gunningberg, J. Kelly, L. Strigini, P. Traverse, K. Tso, U. Voges: The UCLA Dedix System: a Distributed Testbed for Multi-version Software, 15th Int. Conf. on Fault-tolerant Computing, Ann Arbor, Mi 1985.
A. E. Abbadi, D. Skeen, F. Cristian: An Efficient Fault-Tolerant Protocol for Replicated Data Management, 4th ACM Conf. on Principles of Database Systems, 1985.
J. Bartlett: A NonStop Kernel, 8th Symp. on Operating System Principles, Dec. 1981.
Ph. Bernstein: Sequoia: a Fault-tolerant Tightly Coupled Multiprocessor for Transaction Processing, IEEE Computer, February 1988.
A. Borg, W. Blau, W. Graetsch, F. Herrmann, W. Oberle: Fault-Tolerance under Unix, ACM Trans. on Computer Systems, Vol. 7, No. 1, Feb 1989.
O. Babaoglu, R. Drumond: Streets of Byzantium: Network Architectures for Fast Reliable Broadcast, IEEE Tr. on Software Engineering, Vol. SE-11, No. 6, 1985.
D. Barbara, H. Garcia-Molina, A. Spauster: Increasing Availability under Mutual Exclusion Constraints with Dynamic Vote Reassignment, ACM Trans. on Computer Systems, Vol. 7, No. 4, Nov 1989.
Ph. Bernstein, V. Hadzilacos, N. Goodman: Concurrency Control and Recovery in Database Systems, Addison-Wesley, 1987.
K. Birman, T. Joseph: Reliable Communication in the Presence of Failures, ACM Trans. on Computer Systems, Vol. 5, No. 1, February 1987.
F. Cristian: A Rigorous Approach to Fault-tolerant Programming, IEEE Tr. on Software Eng., Vol. SE 11, No. 1, 1985.
F. Cristian: Agreeing on Who is Present and Who is Absent in a Synchronous Distributed System, 18th Int Conf on Fault-Tolerant Computing, Tokyo, June 1988.
F. Cristian: Exception Handling, in “Dependability of Resilient Computers”, T. Anderson, Ed., Blackwell Scientific Publications, Oxford, 1989.
R. Carr: The Tandem Global Update Protocol, Tandem Systems Review, Vol. 1, No. 2, June 1985.
F. Cristian, H. Aghili, R. Strong, D. Dolev: Atomic Broadcast: From Simple Diffusion to Byzantine Agreement, 15th Int. Conf. on Fault-tolerant Computing, Ann Arbor, Mi 1985.
F. Cristian, R. Dancey, J. Dehn: Fault-tolerance in the Advanced Automation System, 20th Int. Conf. on Fault-tolerant Computing, Newcastle upon Tyne, England, June 1990.
D. Clark: The Structuring of Systems using Up-calls, 10th ACM Symp. on Operating Systems Principles, 1985.
J.M. Chang, N. Maxemchuck: Reliable Broadcast Protocols, ACM Tr. on Computer Systems, Vol. 2, No. 3, August 1984.
E. Cooper: Replicated Distributed Programs, PhD thesis, UC Berkeley, 1985.
D. Comer, L. Peterson: Understanding Naming in Distributed Systems, Distributed Computing, Vol. 3, pp. 51–60, 1989.
F. Cristian: Probabilistic Clock Synchronization, Distributed Computing, Vol. 3, pp. 146–158, 1989.
F. Cristian: Synchronous Atomic Broadcast for Redundant Broadcast Channels, The Journal of Real-Time Systems, Vol. 2, pp. 195–212, 1990.
D. Cheriton, W. Zwaenepoel: Distributed Process Groups in the V Kernel, ACM Tr. on Comp. Systems, Vol. 3, No. 2, May 1985.
E. Dijkstra: Hierarchical Ordering of Sequential Processes, Acta Informatica, Vol 1, pp. 115–138, 1971.
P. Ezhilchelvan, S. Shrivastava: A Characterization of Faults in Systems, 5th Symp. on Reliability in Dist. Softw. and Database systems, Los Angeles, January 1986.
J. Gray: Notes on Database Operating Systems, Operating Systems — An Advanced Course, Lecture Notes in Computer Science, Springer Verlag, Vol 60, 1978.
J. Gray: Why do computers stop and what can be done about it? 5th Symp. on Reliability in Dist. Softw. and Database systems, Los Angeles, January 1986.
H. Garcia-Molina, A. Spauster: Message Ordering in a Multicast Environment, 9th Int. Conf. on Distributed Systems, Newport Beach, California, June 1989.
A. Hopkins, B. Smith, J. Lala: FTMP-A highly reliable fault-tolerant multi-processor for aircraft, Proceedings IEEE, Vol. 66, Oct 1978.
R. Harper, J. Lala, J. Deyst: Fault Tolerant Parallel Processor Architecture Overview, 18th Int Conf on Fault-Tolerant Computing, Tokyo, June 1988.
IBM International Technical Support Centers: IMS/VS Extended Recovery Facility (XRF): Technical Reference, 1987.
D. Johnson, W. Zwaenepoel: Sender Based Message Logging, 17th Int Conf on Fault-Tolerant Computing, Tokyo, June 1987.
J. Knight, P. Amann: Issues Influencing the Use of N-version Programming, Proceedings IFIP Congress, San Francisco, August 1989.
F. Kaashoek, A. Tanenbaum: Fault-tolerance Using Group Communication, 4th ACM SIGOPS European Workshop, Bologna, Sept 1990.
H. Kopetz, G. Grunsteidl, J. Reisinger: Fault-tolerant Membership in a Synchronous Real-time System, IFIP Working Conference on “Dependable Computing for Critical Applications”, Santa Barbara, August 1989.
N. Kronenberg, H. Levy, W. Strecker: VAXclusters: A Closely-Coupled Distributed System, ACM Transactions on Computer Systems, Vol. 4, No. 2, 1986.
R. Koo, S. Toueg: Check-pointing and Rollback Recovery for Distributed Systems, IEEE Transactions on Software Engineering, Vol. SE-13, No. 1, 1986.
L. Lamport: Using Time Instead of Time-outs in Fault-Tolerant Systems, ACM Trans on Programming Languages and Systems, vol. 6, no. 2, 1984.
L. Lamport: The Part Time Parliament, DEC SRC Report 49, Sept 1989.
J. C. Laprie: Dependability: a unifying concept for reliable computing and fault-tolerance, T. Anderson, Ed., Blackwell Scientific Publications, Oxford, 1989.
J. C. Laprie, J. Arlat, C. Beounes, K. Kanoun: Definition and Analysis of Hardware and Software-Fault-Tolerant Architectures, IEEE Computer, July 1990.
S. Luan, V. Gligor: A Fault-tolerant Protocol for Atomic Broadcast, 10th Int Conf on Distributed Computing Systems, Paris, May 1990.
G. Le Lann: Critical Issues in Distributed Real-Time Computing, Proceedings of “ESTEC Workshop on Communication Networks and Distributed Operating Systems within the Space Environment”, European Space Agency Report WPP-10, Noordwijk, Oct.24–26, 1989.
R. Ladin, B. Liskov, L. Shrira: “Lazy Replication: a Method for Managing Replicated Data” 9th Annual ACM Symposium on Principles of Distributed Computing, August 1990.
B. Lampson, H. Sturgis: Atomic Transactions, in “Distributed Systems: An Advanced Course”, Lecture Notes in Computer Science Vol. 105, Springer Verlag, 1981.
E. McCluskey: Fault-Tolerant Systems, Technical Report CSL-199 Stanford University, 1982.
M. Melliar-Smith, L. Moser, V. Agrawala: Broadcast Protocols for Distributed Systems, IEEE Tr on Parallel and Distributed Systems, Vol. 1, No. 1, Jan 1990.
B. Oki, B. Liskov: Viewstamped Replication: a New Primary Copy Method to Support Highly Available Distributed Systems, 7th ACM Symp. on Principles of Distributed Computing, August 1988.
D. Parnas: Designing Software for Ease of Extension and Contraction, IEEE Tr. on Software Engineering, Vol. SE-5, No. 2, March 1979.
D. Powell: La Tolerance aux Fautes Dans les Systemes Repartis: Les Hypotheses d'Erreur et leur Importance, LAAS Research report 89-258, September, 1989.
D. Palumbo, R. Butler: Measurement of SIFT operating system overhead, NASA Technical Memo 86322, 1985.
W. Peterson, E. Weldon: Error Correcting Codes, MIT Press, Massachusetts, 1972.
B. Randell: System Structure for Software Fault-Tolerance, IEEE Trans. on Software Eng., Vol. SE-1, No. 2, 1975.
D. Siewiorek: Fault-tolerance in Commercial Computers, IEEE Computer, July 1990.
F. Schneider: The State Machine Approach: a tutorial, TR 86-800 Cornell Univ., 1986.
F. Schmuck: The Use of Efficient Broadcast Protocols in Asynchronous Distributed Systems, PhD thesis, TR88-928 Cornell Univ., 1988.
J. Saltzer, D. Reed, D. Clark: End-to-end Arguments in System Design, ACM Trans. on Computer Systems, Vol. 2, No. 4, Nov, 1984.
R. Strong, D. Skeen, F. Cristian, H. Aghili: Handshake Protocols, 7th Int. Conf. on Distributed Computing Systems, Berlin, September 1987.
R. Strom, S. Yemini: Optimistic Recovery in Distributed Systems, ACM Transactions on Computer Systems, Vol. 3, No. 3, 1985.
A. Tanenbaum: Computer Networks, Prentice Hall, Englewood Cliffs, NJ, 1981.
K. Trivedi: Probability and Statistics with Reliability, Queuing and Computer Science Applications, Prentice Hall, 1982.
D. Taylor and G. Wilson: The Stratus System Architecture, in “Dependability of Resilient Computers”, T. Anderson, Ed., Blackwell Scientific Publications, Oxford, 1989.
P. Verissimo, L. Rodrigues, M. Baptista: AMp: A Highly Parallel Atomic Multicast Protocol, in Proceedings, ACM SIGCOM'89, Austin, Texas, Sept 89.
J. Wakerly: Error Detecting Codes, Self-checking Circuits, and Applications, Elsevier North-Holland, Inc., New York, 1978.
J. Wensley, L. Lamport, J. Goldberg, M. Green, K. Levitt, M. Melliar-Smith, R. Shostak, C. Weinstock, SIFT: Design and Analysis of a Fault tolerant Computer for Aircraft Control, Proc IEEE Vol. 66, Oct 1978.
W. Wulf: Reliable Hardware-Software Architecture, 1975 Int. Conf. on Reliable Software, SIGPLAN 10, No. 6, 1975.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1991 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Cristian, F. (1991). Basic concepts and issues in fault-tolerant distributed systems. In: Karshmer, A., Nehmer, J. (eds) Operating Systems of the 90s and Beyond. Lecture Notes in Computer Science, vol 563. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0024534
Download citation
DOI: https://doi.org/10.1007/BFb0024534
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-54987-1
Online ISBN: 978-3-540-46630-7
eBook Packages: Springer Book Archive