Skip to main content

Simulating crash failures with many faulty processors (extended abstract)

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 647))

Abstract

The difficulty of designing fault-tolerant distributed algorithms increases with the severity of failures that an algorithm must tolerate. This paper considers methods that automatically translate algorithms tolerant of simple crash failures into ones tolerant of more severe omission failures. These translations simplify the design task by allowing algorithm designers to assume that processors fail only by stopping. Earlier results had suggested that these translations must, in general, have limited fault-tolerance: that crash failures could not be simulated unless a majority of processors remained correct throughout any execution. We show that this limitation does not apply when considering a broad range of distributed computing problems that includes most classical problems in the field. We do this by exhibiting a hierarchy of translations, each with different fault-tolerance and complexity; for any number of possible failures, we give an appropriate translation. Each of these translations is shown to be optimal with respect to the joint measures of fault-tolerance and round-complexity (the round-complexity of a translation is the number of communication rounds that the translation uses to simulate one round of the original algorithm). That is, the hierarchy of translations is matched by a corresponding hierarchy of impossibility results. Furthermore, this hierarchy has more structure than that seen for other failure models, indicating that the relationship between crash and omission failures is more complex than had been previously thought.

Partial support for this work was provided by the National Science Foundation under grants CCR-8909663 and CCR-9106627.

This author was supported in part by a scholarship from the Hariri Foundation.

This is a preview of subscription content, log in via an institution.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Hagit Attiya, Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer. Bounds on the time to reach agreement in the presence of timing uncertainty. In Proceedings of the Twenty-Third ACM Symposium on Theory of Computing, pages 359–369, May 1991.

    Google Scholar 

  2. Rida Bazzi and Gil Neiger. Optimally providing fault-tolerance in a Byzantine environment. In S. Toueg, P. G. Spirakis, and L. Kirousis, editors, Proceedings of the Fifth International Workshop on Distributed Algorithms, volume 579 of Lecture Notes on Computer Science, pages 108–128. Springer-Verlag, October 1991.

    Google Scholar 

  3. Rida Bazzi and Gil Neiger. The complexity and impossibility of achieving fault-tolerant coordination. In Proceedings of the Eleventh ACM Symposium on Principles of Distributed Computing, August 1992. To appear.

    Google Scholar 

  4. Brian A. Coan. A compiler that increases the fault-tolerance of asynchronous protocols. IEEE Transactions on Computers, 37(12):1541–1553, December 1988.

    Google Scholar 

  5. Danny Dolev. The Byzantine generals strike again. Journal of Algorithms, 3(1):14–30, 1982.

    Google Scholar 

  6. Vassos Hadzilacos. Byzantine agreement under restricted types of failures (not telling the truth is different from telling lies). Technical Report 18–83, Department of Computer Science, Harvard University, 1983. A revised version appears in Hadzilacos's Ph.D. dissertation [7].

    Google Scholar 

  7. Vassos Hadzilacos. Issues of Fault Tolerance in Concurrent Computations. Ph.D. dissertation, Harvard University, June 1984. Technical Report 11–84, Department of Computer Science.

    Google Scholar 

  8. Vassos Hadzilacos. Connectivity requirements for Byzantine agreement under restricted types of failures. Distributed Computing, 2(2):95–103, 1987.

    Google Scholar 

  9. Joseph Y. Halpern and H. Raymond Strong, March 1986. Personal communication.

    Google Scholar 

  10. Leslie Lamport, Robert Shostak, and Marshall Pease. The Byzantine generals problem. ACM Transactions on Programming Languages and Systems, 4(3):382–401, July 1982.

    Article  Google Scholar 

  11. Gil Neiger and Sam Toueg. Automatically increasing the fault-tolerance of distributed algorithms. Journal of Algorithms, 11(3):374–419, September 1990.

    Google Scholar 

  12. Gil Neiger and Mark R. Tuttle. Common knowledge and consistent simultaneous coordination. In J. van Leeuwen and N. Santoro, editors, Proceedings of the Fourth International Workshop on Distributed Algorithms, volume 486 of Lecture Notes on Computer Science, pages 334–352. Springer-Verlag, September 1990. To appear in Distributed Computing.

    Google Scholar 

  13. Kenneth J. Perry and Sam Toueg. Distributed agreement in the presence of processor and communication faults. IEEE Transactions on Software Engineering, 12(3):477–482, March 1986.

    Google Scholar 

  14. Stephen Ponzio. Consensus in the presence of timing uncertainty: Omission and Byzantine faults. In Proceedings of the Tenth ACM Symposium on Principles of Distributed Computing, pages 125–138, August 1991.

    Google Scholar 

  15. Richard D. Schlichting and Fred B. Schneider. Fail-stop processors: an approach to designing fault-tolerant computing systems. ACM Transactions on Computer Systems, 1(3):222–238, August 1983.

    Google Scholar 

  16. T. K. Srikanth and Sam Toueg. Simulating authenticated broadcasts to derive simple fault-tolerant algorithms. Distributed Computing, 2(2):80–94, 1987.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Adrian Segall Shmuel Zaks

Rights and permissions

Reprints and permissions

Copyright information

© 1992 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Bazzi, R., Neiger, G. (1992). Simulating crash failures with many faulty processors (extended abstract). In: Segall, A., Zaks, S. (eds) Distributed Algorithms. WDAG 1992. Lecture Notes in Computer Science, vol 647. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-56188-9_12

Download citation

  • DOI: https://doi.org/10.1007/3-540-56188-9_12

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-56188-0

  • Online ISBN: 978-3-540-47484-5

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics