A Formal Model for Fault-Tolerance in Distributed Systems

Hamid, Brahim; Mosbah, Mohamed

doi:10.1007/11563228_9

Brahim Hamid¹⁹ &
Mohamed Mosbah¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 3688))

Included in the following conference series:

International Conference on Computer Safety, Reliability, and Security

1316 Accesses
1 Citations

Abstract

We present a formal method based on graph rewriting systems for the specifications and the proofs of fault-tolerant distributed algorithms. Our method deals with crash failures. In a crash failure system the process can fail by crashing, i.e. by permanently halting. The faulty processes are the processes contaminated by the crashes. The methodology is formalized in two phases. In the first phase, we build the set of illegitimate configurations to specify the faults and the faulty processes. The second phase is devoted to the addition of correction rules in the initial graph rewriting system used to encode the distributed algorithm. These rules are able to detect and eliminate the faults locally during the computation. This method can be implemented under an asynchronous message passing system which notifies the faults. To illustrate this approach, we present examples of fault-tolerant distributed spanning tree algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Anagnostou, E., Hadzilacos, V.: Tolerating transient and permanent failures. In: Schiper, A. (ed.) WDAG 1993. LNCS, vol. 725, pp. 174–188. Springer, Heidelberg (1993)
Google Scholar
Arora, A., Gouda, M.: Closure and convergence: A foundation of fault-tolerant computing. IEEE Trans. Softw. Eng. 19(11), 1015–1027 (1993)
Article Google Scholar
Attie, P.C., Arora, A., Emerson, E.A.: Synthesis of fault-tolerant concurrent programs. ACM Trans. Program. Lang. Syst. 26(1), 125–185 (2004)
Article Google Scholar
Chandra, T.D., Toueg, S.: Unreliable failure detectors for reliable distributed system. Journal of the ACM 43(2), 225–267 (1996)
Article MATH MathSciNet Google Scholar
Dijkstra, E.W.: Self stabilizing systems in spite of distributed control. Communications of the ACM 17(11), 643–644 (1974)
Article MATH Google Scholar
Fischer, M.J., Lynch, N.A., Merritt, M.: Easy impossibility proofs for distributed consensus problems. In: PODC 1985: Proceedings of the fourth annual ACM symposium on Principles of distributed computing, pp. 59–70. ACM Press, New York (1985)
Chapter Google Scholar
Fisher, M.J., Lynch, N.A., Paterson, M.S.: Impossibility of distributed consensus with one faulty process. Journal of the ACM 32(2), 374–382 (1985)
Article Google Scholar
Gartner, F.: Fundamentals of fault-tolerant distributed computing in asynchronous environments. ACM Comput. Surv. 31(1), 1–26 (1999)
Article MathSciNet Google Scholar
Hamid, B., Mosbah, M.: An automatic approach to self-stabilization. In: 6th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing (SNPD 2005), Baltimore, USA, May 2005, pp. 129–132 (2005) (to appear)
Google Scholar
Hamid, B., Mosbah, M.: An implementation of a failure detector for local computations in graphs. In: Proccedings of the 23rd IASTED International multi-conference on parallel and distributed computing and networks (February 2005)
Google Scholar
Kulkarni, S.S., Arora, A.: Automating the addition of fault-tolerance. In: Proceedings of the 6th International Symposium on Formal Techniques in Real-Time and Fault-Tolerant Systems, pp. 82–93. Springer, Heidelberg (2000)
Chapter Google Scholar
Kutten, S., Peleg, D.: Tight fault locality. SIAM J. Comput. 30(1), 247–268 (2000)
Article MATH MathSciNet Google Scholar
Lamport, L., Shostak, R., Pease, M.: The byzantine generals problem. ACM Trans. Program. Lang. Syst. 4(3), 382–401 (1982)
Article MATH Google Scholar
Laprie, J.C.: Dependability—Basic Concepts and Terminology. Dependable Computing and Fault-tolerant Systems, vol. 5. Springer, Heidelberg (1992), IFIP WG 10.4
MATH Google Scholar
Litovsky, I., Métivier, Y., Sopena, E.: Graph relabeling systems and distributed algorithms. In: Ehrig, H., Kreowski, H.J., Montanari, U., Rozenberg, G. (eds.) Handbook of graph grammars and computing by graph transformation, vol. III, pp. 1–56. World Scientific Publishing, Singapore (1999)
Google Scholar
Métivier, Y., Mosbah, M., Sellami, A.: Proving distributed algorithmes by graph relabeling systems: Example of tree in networks with processor identities. In: Applied Graph Transformations (AGT 2002), Grenoble (April 2002)
Google Scholar
Porat, A.: Maintenance of a spanning tree in dynamic networks. In: PODC 1999: Proceedings of the eighteenth annual ACM symposium on Principles of distributed computing, p. 282. ACM Press, New York (1999)
Chapter Google Scholar
Schneider, F.B.: Implementing fault-tolerant services using the state machine approach: a tutorial. ACM Comput. Surv. 22(4), 299–319 (1990)
Article Google Scholar

Download references

Author information

Authors and Affiliations

LaBRI, ENSEIRB, University of Bordeaux-1, F-33405 Cedex, Talence, France
Brahim Hamid & Mohamed Mosbah

Authors

Brahim Hamid
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Mosbah
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Computer Sciences, Østfold University College, Os Allé, 11, 1757, Halden, Norway
Rune Winther
Insitute for Energy Technology, Software Engineering Laboratory, 1761, Halden, Norway
Bjørn Axel Gran
Software Engineering Laboratory, Institute for Engergy Technology, 1751, Halden, Norway
Gustav Dahll

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hamid, B., Mosbah, M. (2005). A Formal Model for Fault-Tolerance in Distributed Systems. In: Winther, R., Gran, B.A., Dahll, G. (eds) Computer Safety, Reliability, and Security. SAFECOMP 2005. Lecture Notes in Computer Science, vol 3688. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11563228_9

Download citation

DOI: https://doi.org/10.1007/11563228_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29200-5
Online ISBN: 978-3-540-32000-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics