Skip to main content
Log in

A Closer Look at Fault Tolerance

  • Published:
Theory of Computing Systems Aims and scope Submit manuscript

Abstract

The traditional notion of fault tolerance requires that all the correct participating processes eventually terminate, and thus, is not sensitive to the number of correct processes that should terminate as a result of failures. Intuitively, an algorithm that in the presence of any number of faults always guarantees that all the correct processes except maybe one terminate, is more resilient to faults than an algorithm that in the presence of a single fault does not even guarantee that a single correct process ever terminates. However, according to the standard notion of fault tolerance both algorithms are classified as algorithms that can not tolerate a single fault. To overcome this difficulty, we generalize the traditional notion of fault tolerance in a way which enables to capture more sensitive information about the resiliency of an algorithm. Then, we present several algorithms for solving classical problems which are resilient under the new notion. It is well known that, in an asynchronous systems where processes communicate either by reading and writing atomic registers or by sending and receiving messages, important problems such as, consensus, set-consensus, election, perfect renaming, implementations of a test-and-set bit, a shared stack, a swap object and a fetch-and-add object have no deterministic solutions which can tolerate even a single fault. We show that while, some of these problems have solutions which guarantee that in the presence of any number of faults most of the correct processes will terminate; other problems do not even have solutions which guarantee that in the presence of just one fault at least one correct process terminates. All our results are presented in the context of crash failures in asynchronous systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. A set of processes P is maximal with respect to property ϕ, if (1) P satisfies ϕ, and (2) there is not set Q, such that PQ and Q satisfies ϕ.

References

  1. Afek, Y., Attiya, H., Fouren, A., Stupp, G., Touitou, D.: Long-lived renaming made adaptive Proceedings 18th ACM Symp. on Principles of Distributed Computing, pp 91–103 (1999)

    Google Scholar 

  2. Afek, Y., Gafni, E., Morrison, A.: Common2 extended to stacks and unbounded concurrency Proceedings 25th ACM Symp. on Principles of Distributed Computing, pp 218–227 (2006)

    Google Scholar 

  3. Afek, Y., Weisberger, E., Weisman, H.: A completeness theorem for a class of synchronization objects Proceedings 12th ACM Symp. on Principles of Distributed Computing, pp 159–170 (1993)

    Google Scholar 

  4. Anderson, J.H., Moir, M.: Using k-exclusion to implement resilient, scalable shared objects Proceedings 14th ACM Symp. on Principles of Distributed Computing, pp 141–150 (1994)

    Google Scholar 

  5. Attiya, H., Bar-noy, A., Dolev, D., Koller, D., Peleg, D., Reischuk, R.: Achievable cases in an asynchronous environment Proceedings 28th IEEE Symp. on Foundations of Computer Science, pp 337–346 (1987)

    Google Scholar 

  6. Attiya, H., Bar-Noy, A., Dolev, D., Koller, D., Peleg, D., Reischuk, R.: Renaming in an asynchronous environment. J. Assoc. Comput. Mach. 37(3), 524–548 (1990)

    Article  MathSciNet  MATH  Google Scholar 

  7. Attiya, H., Fouren, A.: Polynomial and adaptive long-lived (2k − 1)-renaming Proceedings 14th International Symp. on Distributed Computing: Lecture Notes in Computer Science, vol. 1914, pp 149–163 (2000)

    Google Scholar 

  8. Attiya, H., Fouren, A.: Algorithms adapting to point contention. J. ACM 50(4), 144–468 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  9. Bar-Noy, A., Dolev, D.: Shared memory versus message-passing in an asynchronous distributed environment Proceedings 8th ACM Symp. on Principles of Distributed Computing, pp 307–318 (1989)

    Google Scholar 

  10. Borowsky, E., Gafni, E.: Generalizecl FLP impossibility result for t-resilient asynchronous computations Proceedings 25th ACM Symp. on Theory of Computing, pp 91–100 (1993)

    Google Scholar 

  11. Borowsky, E., Gafni, E., Lynch, N.A., Rajsbaum, S.: The BG distributed simulation algorithm. Distrib. Comput. 14(3), 127–146 (2001)

    Article  Google Scholar 

  12. Brodsky, A., Ellen, F., Woelfel, P.: Fully-adaptive algorithms for long-lived renaming. Distrib. Comput. 24(2), 119–134 (2011)

    Article  MATH  Google Scholar 

  13. Burns, J.E., Fischer, M.J., Jackson, P., Lynch, N.A., Peterson, G.L.: Shared data requirements for implementation of mutual exclusion using a test-and-set primitive Proceedings of the International Conf. on Parallel Processing, pp 79–87 (1978)

    Google Scholar 

  14. Burns, J.E., Jackson, P., Lynch, N.A., Fischer, M.J., Peterson, G.L.: Data requirements for implementation of N-process mutual exclusion using a single shared variable. J. Assoc. Comput. Mach. 29(1), 183–205 (1982)

    Article  MathSciNet  MATH  Google Scholar 

  15. Burns, J.E., Lynch, A.N.: Mutual exclusion using indivisible reads and writes 18th annual allerton conference on communication, control and computing, pp 833–842 (1980)

    Google Scholar 

  16. Burns, J.E., Peterson, G.L.: The ambiguity of choosing Proceedings 8th ACM Symp. on Principles of Distributed Computing, pp 145–158 (1989)

    Google Scholar 

  17. Burns, J.N., Lynch, N.A.: Bounds on shared-memory for mutual exclusion. Inf. Comput. 107(2), 171–184 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  18. Castaneda, A., Rajsbaum, S., Raynal, M.: The renaming problem in shared memory systems: an introduction. Computer Science Review 5(3), 229–251 (2011)

    Article  MATH  Google Scholar 

  19. Delporte-Gallet, C., Fauconnier, H., Guerraoui, R., Tielmanns, A.: The disagreement power of an adversary Proceedings 28th ACM Symp. on Principles of Distributed Computing, pp 288–289 (2009)

    Google Scholar 

  20. Dijkstra, E.W.: Solution of a problem in concurrent programming control. Commun. ACM 8(9), 569 (1965)

    Article  Google Scholar 

  21. Fischer, M.J., Lynch, N.A., Paterson, M.S.: Impossibility of distributed consensus with one faulty process. J. ACM 32(2), 374–382 (1985)

    Article  MathSciNet  MATH  Google Scholar 

  22. Gafni, E., Merritt, M., Taubenfeld, G.: The concurrency hierarchy, and algorithms for unbounded concurrency Proceedings 20th ACM Symp. on Principles of Distributed Computing, pp 161–169 (2001)

    Google Scholar 

  23. Herlihy, M.P.: Wait-free synchronization. ACM Trans. Program. Lang. Syst. 13(1), 124–149 (1991)

    Article  Google Scholar 

  24. Herlihy, M.P., Shavit, N.: The topological structure of asynchronous computability. J. ACM 46(6), 858–923 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  25. Herlihy, M.P., Wing, J.M.: Linearizability: a correctness condition for concurrent objects. ACM Trans. Program. Lang. Syst. 12(3), 463–492 (1990)

    Article  Google Scholar 

  26. Imbs, D., Raynal, M., Taubenfeld, G.: On asymmetric progress conditions Proceedings 29th ACM Symp. on Principles of Distributed Computing, pp 55–64 (2010)

    Google Scholar 

  27. Inoue, M., Umetani, S., Masuzawa, T., Fujiwara, H.: Adaptive long-lived O(k 2)-renaming with O(k 2) steps 15th international symposium on distributed computing (2001)

  28. Kushilevitz, E., Rabin, M.O.: Randomized mutual exclusion algorithms revisited Proceedings 11th ACM Symp. on Principles of Distributed Computing, pp 275–283 (1992)

    Google Scholar 

  29. Kuznetsov, P.: Understanding non-uniform failure models. Distributed computing column of the Bulletin of the European Association for Theoretical Computer Science (BEATCS) 106, 54–77 (2012)

    MathSciNet  MATH  Google Scholar 

  30. Lamport, L.: The part-time parliament. ACM Trans. Comput. Syst. 16(2), 133–169 (1998)

    Article  Google Scholar 

  31. Loui, M.C., Abu-Amara, H.: Memory requirements for agreement among unreliable asynchronous processes. Adv. Compet. Res. 4, 163–183 (1987)

    MathSciNet  Google Scholar 

  32. Moir, M., Anderson, J.H.: Wait-free algorithms for fast, long-lived renaming. Sci. Comput. Program. 25(1), 1–39 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  33. Moran, S., Wolfstahl, Y.: Extended impossibility results for asynchronous complete networks. Inf. Process. Lett. 26(3), 145–151 (1987)

    Article  MathSciNet  Google Scholar 

  34. Pease, M., Shostak, R., Lamport, L.: Reaching agreement in the presence of faults. J. ACM 27(2), 228–234 (1980)

    Article  MathSciNet  MATH  Google Scholar 

  35. Peterson, G.L.: New Bounds on Mutual Exclusion Problems. Technical Report TR68, University of Rochester, February 1980 (1994)

  36. Raynal, M.: Algorithms for Mutual Exclusion The MIT Press, 1986. Translation of Algorithmique du parallélisme (1984)

  37. Saks, M., Zaharoglou, F.: Wait-free k-set agreement is impossible: The topology of public knowledge. SIAM J. Comput. 29 (2000)

  38. Styer, E., Peterson, G.L.: Tight bounds for shared memory symmetric mutual exclusion problems Proceedings 8th ACM Symp. on Principles of Distributed Computing, pp 177–191 (1989)

    Google Scholar 

  39. Taubenfeld, G.: Synchronization Algorithms and Concurrent Programming. Pearson / Prentice-Hall, 2006. ISBN 0-131-97259-6, 423 pages

  40. Taubenfeld, G.: The computational structure of progress conditions 24th international symposium on distributed computing (DISC 2010), September 2010. LNCS 6343, vol. 2010, pp 221–235. Springer Verlag

  41. Taubenfeld, G.: Brief Announcement: Computing in the Presence of Weak Crash Failures Proceedings 35Th ACM Symp. on Principles of Distributed Computing (PODC ’16), pp 349–351 (2016)

    Chapter  Google Scholar 

  42. Taubenfeld, G., Moran, S.: Possibility and impossibility results in a shared memory environment. Acta Informatica 33(1), 1–20 (1996)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgments

I wish to thank the three anonymous referees for their constructive suggestions and corrections.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gadi Taubenfeld.

Additional information

A preliminary version of the results presented in this paper, has appeared in proceedings of the 31st annual symposium on principles of distributed computing (PODC 2012), Madeira, Portugal, July 2012.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Taubenfeld, G. A Closer Look at Fault Tolerance. Theory Comput Syst 62, 1085–1108 (2018). https://doi.org/10.1007/s00224-017-9779-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00224-017-9779-4

Keywords

Navigation