Automatic reconfiguration in the presence of failures
Automatic reconfiguration in the presence of failures
- Author(s): Flaviu Cristian
- DOI: 10.1049/sej.1993.0009
For access to this article, please select a purchase option:
Buy article PDF
Buy Knowledge Pack
IET members benefit from discounts to all IET publications and free access to E&T Magazine. If you are an IET member, log in to your account and the discounts will automatically be applied.
Thank you
Your recommendation has been sent to your librarian.
- Author(s): Flaviu Cristian 1
-
-
View affiliations
-
Affiliations:
1: Computer Science and Engineering Department, University of California at San Diego, La Jolla, USA
-
Affiliations:
1: Computer Science and Engineering Department, University of California at San Diego, La Jolla, USA
- Source:
Volume 8, Issue 2,
March 1993,
p.
53 – 60
DOI: 10.1049/sej.1993.0009 , Print ISSN 0268-6961, Online ISSN 2053-910X
We describe a new kind of distributed system service, the availability management service, responsible for ensuring that the critical services of a distributed system remain continuously available to users despite arbitrary numbers of concurrent node removals and node restarts caused by failures, maintenance, and growth. We stress the main ideas behind this new service, and outline a simple design that depends on the existence of synchronous membership and atomic broadcast group communication services. Extensions of this initial design to deal with asynchronous group communication services are also briefly discussed.
Inspec keywords: software reliability; distributed processing
Other keywords:
Subjects: Software engineering techniques; Distributed systems software
References
-
-
1)
- O. Babaoglu , R. Drumond . Streets of Byzantium: network architectures for fast reliable broadcasts. IEEE Trans. , 6
-
2)
- Powell, D., Bonn, G., Seaton, D., Verissimo, P., Waeselynck, F.: `The Delta-4 approach to dependability in open distributed computing systems', Proc. 18th Int. Symp. on Fault-tolerant Computing, 1988.
-
3)
- F. Cristian . Understanding fault-tolerant distributed systems. Commun. ACM , 2
-
4)
- Amir, V., Dolev, D., Kramer, S., Malki, D.: `Transis: a communication sub-system for high availability', 22nd Int. Symp. on Fault-tolerant Computing, 1992.
-
5)
- Mishra, S., Peterson, L., Schlichting, R.: `Implementing fault-tolerant objects using Psync', Proc. 8th Symp. on Reliable Distributed Systems, 1989.
-
6)
- Cristian, F., Aghili, H., Strong, R., Doley, D.: `Atomic broadcast: from simple message diffusion to Byzantine Agreement', 15th Int. Symp. on Fault-tolerant Computing, 1985.
-
7)
- Cristian, F.: `Reaching agreement on processor-group membership in synchronous distributed systems', 18th Int. Symp. on Fault-tolerant Computing, 1988.
-
8)
- K. Birman , A. Schiper , P. Stephenson . Light-weight causal and atomic group multicast. ACM Trans. Syst. , 3
-
9)
- F. Cristian . Probabilistic clock synchronization. Distrib. Comput. , 146 - 158
-
10)
- S. Shrivastava , P. Ezhilchelvan , N. Speirs , S. Tao , A. Tully . Principle features of the Voltan family of reliable node architectures for distributed systems. IEEE Trans. , 5
-
11)
- Lundelius, J., Lynch, N.: `A new fault-tolerant algorithm for clock synchronization', Proc. 3rd ACM PODS, 1984.
-
12)
- Cristian, F., Dehn, J., Dancey, B.: `Fault-tolerance in the advanced automation system', 20th Int. Symp. on Fault-tolerant Computing, 1990.
-
13)
- L. Lamport , M. Melliar-Smith . Synchronizing clocks in the presence of faults. J. ACM , 1
-
14)
- Gray, J.: `Why do computers stop and what can be done about it?', 5th Symp. on Reliability in Distributed Software and Database Systems, 1986.
-
15)
- D. Parnas . A technique for software module specification with examples. Commun. ACM , 5
-
16)
- H. Kopetz . Clock synchronization in distributed real-time systems. IEEE Trans. , 8
-
17)
- F. Cristian . A rigorous approach to fault-tolerant programming. IEEE Trans. , 1
-
18)
- Cristian, F., Aghili, H., Strong, R.: `Approximate clock synchronization despite omission and performance failures and processor joins', 16th Int. Symp. on Fault-tolerant Computing, 1986.
-
19)
- R. Carr . The Tandem global update protocols. Tandem Syst. Rev.
-
20)
- L. Lamport . Using time instead of timeout in fault-tolerant distributed systems. ACM Trans. Prog. Lang. Syst. , 2
-
21)
- J.M. Chang , N. Maxemchuck . Reliable broadcast protocols. ACM Trans. Comput. Syst. , 3
-
22)
- Ladin, R., Liskov, B., Shrira, L.: `Lazy replication: exploiting the semantics of distributed services', Proc. 9th ACM Symp. on Principles of Distributed Computing, 1990.
-
23)
- Birman, K., Joseph, T.: `Exploiting virtual synchrony in distributed systems', 11th ACM Symp. on Operating Systems Principles, 1987.
-
24)
- F. Cristian . Synchronous atomic broadcast for redundant broadcast channels. J. Real-time Syst. , 195 - 212
-
25)
- Halpern, J., Simons, B., Strong, R.: `Fault-tolerant clock synchronization', Proc. 3rd ACM PODS, 1984.
-
26)
- F. Schneider . Implementing fault-tolerant services using the state machine approach: a tutorial. Comput. Surv. , 4
-
27)
- Cristian, F.: `New asynchronous atomic broadcast protocols', 1st Workshop on Management of Replicated Data, November 1990, Houston, Texas.
-
28)
- T. Shrikanth , S. Toueg . Optimal clock synchronization. J. ACM , 3
-
29)
- F. Kaashoek , A. Tanenbaum , S. Hummel , H. Bal . An efficient reliable broadcast protocol. Oper. Syst. Rev. , 4
-
1)