Simulating crash failures with many faulty processors (extended abstract)

Bazzi, Rida; Neiger, Gil

doi:10.1007/3-540-56188-9_12

Simulating crash failures with many faulty processors (extended abstract)

Rida Bazzi¹ &
Gil Neiger¹

Conference paper
First Online: 01 January 2005

133 Accesses
6 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 647))

Abstract

The difficulty of designing fault-tolerant distributed algorithms increases with the severity of failures that an algorithm must tolerate. This paper considers methods that automatically translate algorithms tolerant of simple crash failures into ones tolerant of more severe omission failures. These translations simplify the design task by allowing algorithm designers to assume that processors fail only by stopping. Earlier results had suggested that these translations must, in general, have limited fault-tolerance: that crash failures could not be simulated unless a majority of processors remained correct throughout any execution. We show that this limitation does not apply when considering a broad range of distributed computing problems that includes most classical problems in the field. We do this by exhibiting a hierarchy of translations, each with different fault-tolerance and complexity; for any number of possible failures, we give an appropriate translation. Each of these translations is shown to be optimal with respect to the joint measures of fault-tolerance and round-complexity (the round-complexity of a translation is the number of communication rounds that the translation uses to simulate one round of the original algorithm). That is, the hierarchy of translations is matched by a corresponding hierarchy of impossibility results. Furthermore, this hierarchy has more structure than that seen for other failure models, indicating that the relationship between crash and omission failures is more complex than had been previously thought.

Partial support for this work was provided by the National Science Foundation under grants CCR-8909663 and CCR-9106627.

This author was supported in part by a scholarship from the Hariri Foundation.

This is a preview of subscription content, log in via an institution.

Preview

Unable to display preview. Download preview PDF.

References

Hagit Attiya, Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer. Bounds on the time to reach agreement in the presence of timing uncertainty. In Proceedings of the Twenty-Third ACM Symposium on Theory of Computing, pages 359–369, May 1991.
Google Scholar
Rida Bazzi and Gil Neiger. Optimally providing fault-tolerance in a Byzantine environment. In S. Toueg, P. G. Spirakis, and L. Kirousis, editors, Proceedings of the Fifth International Workshop on Distributed Algorithms, volume 579 of Lecture Notes on Computer Science, pages 108–128. Springer-Verlag, October 1991.
Google Scholar
Rida Bazzi and Gil Neiger. The complexity and impossibility of achieving fault-tolerant coordination. In Proceedings of the Eleventh ACM Symposium on Principles of Distributed Computing, August 1992. To appear.
Google Scholar
Brian A. Coan. A compiler that increases the fault-tolerance of asynchronous protocols. IEEE Transactions on Computers, 37(12):1541–1553, December 1988.
Google Scholar
Danny Dolev. The Byzantine generals strike again. Journal of Algorithms, 3(1):14–30, 1982.
Google Scholar
Vassos Hadzilacos. Byzantine agreement under restricted types of failures (not telling the truth is different from telling lies). Technical Report 18–83, Department of Computer Science, Harvard University, 1983. A revised version appears in Hadzilacos's Ph.D. dissertation [7].
Google Scholar
Vassos Hadzilacos. Issues of Fault Tolerance in Concurrent Computations. Ph.D. dissertation, Harvard University, June 1984. Technical Report 11–84, Department of Computer Science.
Google Scholar
Vassos Hadzilacos. Connectivity requirements for Byzantine agreement under restricted types of failures. Distributed Computing, 2(2):95–103, 1987.
Google Scholar
Joseph Y. Halpern and H. Raymond Strong, March 1986. Personal communication.
Google Scholar
Leslie Lamport, Robert Shostak, and Marshall Pease. The Byzantine generals problem. ACM Transactions on Programming Languages and Systems, 4(3):382–401, July 1982.
Article Google Scholar
Gil Neiger and Sam Toueg. Automatically increasing the fault-tolerance of distributed algorithms. Journal of Algorithms, 11(3):374–419, September 1990.
Google Scholar
Gil Neiger and Mark R. Tuttle. Common knowledge and consistent simultaneous coordination. In J. van Leeuwen and N. Santoro, editors, Proceedings of the Fourth International Workshop on Distributed Algorithms, volume 486 of Lecture Notes on Computer Science, pages 334–352. Springer-Verlag, September 1990. To appear in Distributed Computing.
Google Scholar
Kenneth J. Perry and Sam Toueg. Distributed agreement in the presence of processor and communication faults. IEEE Transactions on Software Engineering, 12(3):477–482, March 1986.
Google Scholar
Stephen Ponzio. Consensus in the presence of timing uncertainty: Omission and Byzantine faults. In Proceedings of the Tenth ACM Symposium on Principles of Distributed Computing, pages 125–138, August 1991.
Google Scholar
Richard D. Schlichting and Fred B. Schneider. Fail-stop processors: an approach to designing fault-tolerant computing systems. ACM Transactions on Computer Systems, 1(3):222–238, August 1983.
Google Scholar
T. K. Srikanth and Sam Toueg. Simulating authenticated broadcasts to derive simple fault-tolerant algorithms. Distributed Computing, 2(2):80–94, 1987.
Google Scholar

Download references

Author information

Authors and Affiliations

College of Computing, Georgia Institute of Technology, 30332-0280, Atlanta, Georgia, USA
Rida Bazzi & Gil Neiger

Authors

Rida Bazzi
View author publications
You can also search for this author in PubMed Google Scholar
Gil Neiger
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Adrian Segall Shmuel Zaks

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bazzi, R., Neiger, G. (1992). Simulating crash failures with many faulty processors (extended abstract). In: Segall, A., Zaks, S. (eds) Distributed Algorithms. WDAG 1992. Lecture Notes in Computer Science, vol 647. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-56188-9_12

Download citation

DOI: https://doi.org/10.1007/3-540-56188-9_12
Published: 04 June 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-56188-0
Online ISBN: 978-3-540-47484-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics