Failure Detection with Booting in Partially Synchronous Systems

Widder, Josef; Le Lann, Gérard; Schmid, Ulrich

doi:10.1007/11408901_3

Josef Widder¹⁹,
Gérard Le Lann²⁰ &
Ulrich Schmid¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 3463))

Included in the following conference series:

European Dependable Computing Conference

802 Accesses

Abstract

Unreliable failure detectors are a well known means to enrich asynchronous distributed systems with time-free semantics that allow to solve consensus in the presence of crash failures. Implementing unreliable failure detectors requires a system that provides some synchrony, typically an upper bound on end-to-end message delays. Recently, we introduced an implementation of the perfect failure detector in a novel partially synchronous model, referred to as the Θ-Model, where only the ratio Θ of maximum vs. minimum end-to-end delay of messages that are simultaneously in transit must be known a priori (while the actual delays need not be known and not even be bounded). In this paper, we present an alternative failure detector algorithm, which is based on a clock synchronization algorithm for the Θ-Model. It not only surpasses our first implementation with respect to failure detection time, but also works during the system booting phase.

Supported by the Austrian START program Y41-MAT, the BM:vit FIT-IT project DCBA (proj. no. 808198), and by the FWF project Theta (proj. no. P17757-N04).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Brief Announcement: Byzantine-Tolerant Detection of Causality in Synchronous Systems

Timestamp system for causal broadcast communication

Article 22 May 2024

Fault-Tolerant Multiparty Session Types

References

Hermant, J.F., Le Lann, G.: Fast asynchronous uniform consensus in real-time distributed systems. IEEE Transactions on Computers 51, 931–944 (2002)
Article Google Scholar
Le Lann, G., Schmid, U.: How to maximize computing systems coverage. Technical Report 183/1-128, Department of Automation, Technische Universität Wien (2003)
Google Scholar
Fischer, M.J., Lynch, N.A., Paterson, M.S.: Impossibility of distributed consensus with one faulty processor. Journal of the ACM 32, 374–382 (1985)
Article MATH MathSciNet Google Scholar
Dolev, D., Dwork, C., Stockmeyer, L.: On the minimal synchronism needed for distributed consensus. Journal of the ACM 34, 77–97 (1987)
Article MATH MathSciNet Google Scholar
Chandra, T.D., Toueg, S.: Unreliable failure detectors for reliable distributed systems. Journal of the ACM 43, 225–267 (1996)
Article MATH MathSciNet Google Scholar
Chandra, T.D., Hadzilacos, V., Toueg, S.: The weakest failure detector for solving consensus. Journal of the ACM 43, 685–722 (1996)
Article MATH MathSciNet Google Scholar
Le Lann, G., Schmid, U.: How to implement a timer-free perfect failure detector in partially synchronous systems. Technical Report 183/1-127, Department of Automation, Technische Universität Wien (2003)
Google Scholar
Larrea, M., Fernandez, A., Arevalo, S.: On the implementation of unreliable failure detectors in partially synchronous systems. IEEE Transactions on Computers 53, 815–828 (2004)
Article Google Scholar
Dwork, C., Lynch, N., Stockmeyer, L.: Consensus in the presence of partial synchrony. Journal of the ACM 35, 288–323 (1988)
Article MathSciNet Google Scholar
Widder, J.: Distributed Computing in the Presence of Bounded Asynchrony. PhD thesis, Vienna University of Technology, Fakultät für Informatik (2004)
Google Scholar
Larrea, M., Fernández, A., Arévalo, S.: On the impossibility of implementing perpetual failure detectors in partially synchronous systems. In: Proceedings of the 10th Euromicro Workshop on Parallel, Distributed and Network-based Processing (PDP 2002), Gran Canaria Island, Spain (2002)
Google Scholar
Widder, J.: Booting clock synchronization in partially synchronous systems. In: Proceedings of the 17th International Symposium on Distributed Computing (DISC 2003), Sorrento, Italy, vol. 2848, pp. 121–135. Springer, Heidelberg (2003)
Google Scholar
Widder, J., Schmid, U.: Booting clock synchronization in partially synchronous systems with hybrid node and link failures. Technical Report 183/1-126, Department of Automation, Technische Universität Wien (2003) (submitted for publication)
Google Scholar
Srikanth, T.K., Toueg, S.: Optimal clock synchronization. Journal of the ACM 34, 626–645 (1987)
Article MathSciNet Google Scholar
Dolev, D., Friedman, R., Keidar, I., Malkhi, D.: Failure detectors in omission failure environments. In: Proc. 16th ACM Symposium on Principles of Distributed Computing, Santa Barbara, California, p. 286 (1997)
Google Scholar
Malkhi, D., Reiter, M.: Unreliable intrusion detection in distributed computations. In: Proceedings of the 10th Computer Security Foundations Workshop (CSFW 1997), Rockport, MA, USA, pp. 116–124 (1997)
Google Scholar
Kihlstrom, K.P., Moser, L.E., Melliar-Smith, P.M.: Solving consensus in a byzantine environment using an unreliable fault detector. In: Proceedings of the International Conference on Principles of Distributed Systems (OPODIS), Chantilly, France, pp. 61–75 (1997)
Google Scholar
Doudou, A., Garbinato, B., Guerraoui, R., Schiper, A.: Muteness failure detectors: Speci¯cation and implementation. In: Hlavicka, J., Maehle, E., Pataricza, A. (eds.) EDDC 1999. LNCS, vol. 1667, pp. 71–87. Springer, Heidelberg (1999)
Chapter Google Scholar
Doudou, A., Garbinato, B., Guerraoui, R.: Encapsulating failure detection: From crash to byzantine failures. In: Blieberger, J., Strohmeier, A. (eds.) Ada-Europe 2002. LNCS, vol. 2361, pp. 24–50. Springer, Heidelberg (2002)
Chapter Google Scholar
Basu, A., Charron-Bost, B., Toueg, S.: Simulating reliable links with unreliable links in the presence of process crashes. In: Babaoğlu, Ö., Marzullo, K. (eds.) WDAG 1996. LNCS, vol. 1151, pp. 105–122. Springer, Heidelberg (1996)
Google Scholar
Liu, J.W.S.: Real-Time Systems. Prentice Hall, Englewood Cliffs (2000)
Google Scholar
Stankovic, J.A., Spuri, M., Ramamritham, K., Buttazzo, G.C.: Deadline Scheduling for Real-Time Systems. Kluwer Academic Publishers, Dordrecht (1998)
MATH Google Scholar
Albeseder, D.: Experimentelle Verifikation von Synchronitäatsannahmen für Computernetzwerke. Diplomarbeit, Embedded Computing Systems Group, Technische Universitäat Wien (2004) (in German)
Google Scholar
Hadzilacos, V., Toueg, S.: Fault-tolerant broadcasts and related problems. In: Mullender, S. (ed.) Distributed Systems, 2nd edn., pp. 97–145. Addison-Wesley, Reading (1993)
Google Scholar
Schmid, U., Fetzer, C.: Randomized asynchronous consensus with imperfect communications. In: 22nd Symposium on Reliable Distributed Systems (SRDS 2003), Florence, Italy, pp. 361–370 (2003)
Google Scholar
Le Lann, G.: On real-time and non real-time distributed computing. In: Helary, J.-M., Raynal, M. (eds.) WDAG 1995. LNCS, vol. 972, pp. 51–70. Springer, Heidelberg (1995)
Chapter Google Scholar
Aguilera, M.K., Chen, W., Toueg, S.: Failure detection and consensus in the crashrecovery model. Distributed Computing 13, 99–125 (2000)
Article Google Scholar
Cristian, F., Fetzer, C.: The timed asynchronous distributed system model. IEEE Transactions on Parallel and Distributed Systems 10, 642–657 (1999)
Article Google Scholar
Veríssimo, P., Casimiro, A., Fetzer, C.: The timely computing base: Timely actions in the presence of uncertain timeliness. In: Proceedings IEEE International Conference on Dependable Systems and Networks (DSN’01 / FTCS’30), New York City, USA, pp. 533–542 (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

Embedded Computing Systems Group E182/2, Technische Universität Wien, Treitlstraße 3, A-1040, Vienna, Austria
Josef Widder & Ulrich Schmid
INRIA Rocquencourt, Projet Novaltis, BP 105, F-78153, Le Chesnay Cedex, France
Gérard Le Lann

Authors

Josef Widder
View author publications
You can also search for this author in PubMed Google Scholar
Gérard Le Lann
View author publications
You can also search for this author in PubMed Google Scholar
Ulrich Schmid
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute for Computer Sciences III, University of Erlangen-Nürnberg, Martensstr. 3, 91058, Erlangen, Germany
Mario Dal Cin
UPS, INSA, INP, ISAE; LAAS-CNRS, Université de Toulouse, Toulouse, France
Mohamed Kaâniche
Department of Measurement and Information Systems, Budapest University of Technology and Economics,
András Pataricza

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Widder, J., Le Lann, G., Schmid, U. (2005). Failure Detection with Booting in Partially Synchronous Systems. In: Dal Cin, M., Kaâniche, M., Pataricza, A. (eds) Dependable Computing - EDCC 5. EDCC 2005. Lecture Notes in Computer Science, vol 3463. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11408901_3

Download citation

DOI: https://doi.org/10.1007/11408901_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25723-3
Online ISBN: 978-3-540-32019-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics