Abstract
The difficulty of designing fault-tolerant distributed algorithms increases with the severity of failures that an algorithm must tolerate. This paper considers methods that automatically translate algorithms tolerant of simple crash failures into ones tolerant of more severe omission failures. These translations simplify the design task by allowing algorithm designers to assume that processors fail only by stopping. Earlier results had suggested that these translations must, in general, have limited fault-tolerance: that crash failures could not be simulated unless a majority of processors remained correct throughout any execution. We show that this limitation does not apply when considering a broad range of distributed computing problems that includes most classical problems in the field. We do this by exhibiting a hierarchy of translations, each with different fault-tolerance and complexity; for any number of possible failures, we give an appropriate translation. Each of these translations is shown to be optimal with respect to the joint measures of fault-tolerance and round-complexity (the round-complexity of a translation is the number of communication rounds that the translation uses to simulate one round of the original algorithm). That is, the hierarchy of translations is matched by a corresponding hierarchy of impossibility results. Furthermore, this hierarchy has more structure than that seen for other failure models, indicating that the relationship between crash and omission failures is more complex than had been previously thought.
Partial support for this work was provided by the National Science Foundation under grants CCR-8909663 and CCR-9106627.
This author was supported in part by a scholarship from the Hariri Foundation.
This is a preview of subscription content, log in via an institution.
Preview
Unable to display preview. Download preview PDF.
References
Hagit Attiya, Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer. Bounds on the time to reach agreement in the presence of timing uncertainty. In Proceedings of the Twenty-Third ACM Symposium on Theory of Computing, pages 359–369, May 1991.
Rida Bazzi and Gil Neiger. Optimally providing fault-tolerance in a Byzantine environment. In S. Toueg, P. G. Spirakis, and L. Kirousis, editors, Proceedings of the Fifth International Workshop on Distributed Algorithms, volume 579 of Lecture Notes on Computer Science, pages 108–128. Springer-Verlag, October 1991.
Rida Bazzi and Gil Neiger. The complexity and impossibility of achieving fault-tolerant coordination. In Proceedings of the Eleventh ACM Symposium on Principles of Distributed Computing, August 1992. To appear.
Brian A. Coan. A compiler that increases the fault-tolerance of asynchronous protocols. IEEE Transactions on Computers, 37(12):1541–1553, December 1988.
Danny Dolev. The Byzantine generals strike again. Journal of Algorithms, 3(1):14–30, 1982.
Vassos Hadzilacos. Byzantine agreement under restricted types of failures (not telling the truth is different from telling lies). Technical Report 18–83, Department of Computer Science, Harvard University, 1983. A revised version appears in Hadzilacos's Ph.D. dissertation [7].
Vassos Hadzilacos. Issues of Fault Tolerance in Concurrent Computations. Ph.D. dissertation, Harvard University, June 1984. Technical Report 11–84, Department of Computer Science.
Vassos Hadzilacos. Connectivity requirements for Byzantine agreement under restricted types of failures. Distributed Computing, 2(2):95–103, 1987.
Joseph Y. Halpern and H. Raymond Strong, March 1986. Personal communication.
Leslie Lamport, Robert Shostak, and Marshall Pease. The Byzantine generals problem. ACM Transactions on Programming Languages and Systems, 4(3):382–401, July 1982.
Gil Neiger and Sam Toueg. Automatically increasing the fault-tolerance of distributed algorithms. Journal of Algorithms, 11(3):374–419, September 1990.
Gil Neiger and Mark R. Tuttle. Common knowledge and consistent simultaneous coordination. In J. van Leeuwen and N. Santoro, editors, Proceedings of the Fourth International Workshop on Distributed Algorithms, volume 486 of Lecture Notes on Computer Science, pages 334–352. Springer-Verlag, September 1990. To appear in Distributed Computing.
Kenneth J. Perry and Sam Toueg. Distributed agreement in the presence of processor and communication faults. IEEE Transactions on Software Engineering, 12(3):477–482, March 1986.
Stephen Ponzio. Consensus in the presence of timing uncertainty: Omission and Byzantine faults. In Proceedings of the Tenth ACM Symposium on Principles of Distributed Computing, pages 125–138, August 1991.
Richard D. Schlichting and Fred B. Schneider. Fail-stop processors: an approach to designing fault-tolerant computing systems. ACM Transactions on Computer Systems, 1(3):222–238, August 1983.
T. K. Srikanth and Sam Toueg. Simulating authenticated broadcasts to derive simple fault-tolerant algorithms. Distributed Computing, 2(2):80–94, 1987.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1992 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bazzi, R., Neiger, G. (1992). Simulating crash failures with many faulty processors (extended abstract). In: Segall, A., Zaks, S. (eds) Distributed Algorithms. WDAG 1992. Lecture Notes in Computer Science, vol 647. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-56188-9_12
Download citation
DOI: https://doi.org/10.1007/3-540-56188-9_12
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-56188-0
Online ISBN: 978-3-540-47484-5
eBook Packages: Springer Book Archive