A Closer Look at Fault Tolerance

Taubenfeld, Gadi

doi:10.1007/s00224-017-9779-4

A Closer Look at Fault Tolerance

Published: 15 May 2017

Volume 62, pages 1085–1108, (2018)
Cite this article

Theory of Computing Systems Aims and scope Submit manuscript

Gadi Taubenfeld¹

185 Accesses
1 Citation
Explore all metrics

Abstract

The traditional notion of fault tolerance requires that all the correct participating processes eventually terminate, and thus, is not sensitive to the number of correct processes that should terminate as a result of failures. Intuitively, an algorithm that in the presence of any number of faults always guarantees that all the correct processes except maybe one terminate, is more resilient to faults than an algorithm that in the presence of a single fault does not even guarantee that a single correct process ever terminates. However, according to the standard notion of fault tolerance both algorithms are classified as algorithms that can not tolerate a single fault. To overcome this difficulty, we generalize the traditional notion of fault tolerance in a way which enables to capture more sensitive information about the resiliency of an algorithm. Then, we present several algorithms for solving classical problems which are resilient under the new notion. It is well known that, in an asynchronous systems where processes communicate either by reading and writing atomic registers or by sending and receiving messages, important problems such as, consensus, set-consensus, election, perfect renaming, implementations of a test-and-set bit, a shared stack, a swap object and a fetch-and-add object have no deterministic solutions which can tolerate even a single fault. We show that while, some of these problems have solutions which guarantee that in the presence of any number of faults most of the correct processes will terminate; other problems do not even have solutions which guarantee that in the presence of just one fault at least one correct process terminates. All our results are presented in the context of crash failures in asynchronous systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A brief introduction to distributed systems

Article Open access 16 August 2016

Consensus Algorithm

Noisy intermediate-scale quantum computers

Article Open access 07 March 2023

Notes

A set of processes P is maximal with respect to property ϕ, if (1) P satisfies ϕ, and (2) there is not set Q, such that P ⊂ Q and Q satisfies ϕ.

References

Afek, Y., Attiya, H., Fouren, A., Stupp, G., Touitou, D.: Long-lived renaming made adaptive Proceedings 18th ACM Symp. on Principles of Distributed Computing, pp 91–103 (1999)
Google Scholar
Afek, Y., Gafni, E., Morrison, A.: Common2 extended to stacks and unbounded concurrency Proceedings 25th ACM Symp. on Principles of Distributed Computing, pp 218–227 (2006)
Google Scholar
Afek, Y., Weisberger, E., Weisman, H.: A completeness theorem for a class of synchronization objects Proceedings 12th ACM Symp. on Principles of Distributed Computing, pp 159–170 (1993)
Google Scholar
Anderson, J.H., Moir, M.: Using k-exclusion to implement resilient, scalable shared objects Proceedings 14th ACM Symp. on Principles of Distributed Computing, pp 141–150 (1994)
Google Scholar
Attiya, H., Bar-noy, A., Dolev, D., Koller, D., Peleg, D., Reischuk, R.: Achievable cases in an asynchronous environment Proceedings 28th IEEE Symp. on Foundations of Computer Science, pp 337–346 (1987)
Google Scholar
Attiya, H., Bar-Noy, A., Dolev, D., Koller, D., Peleg, D., Reischuk, R.: Renaming in an asynchronous environment. J. Assoc. Comput. Mach. 37(3), 524–548 (1990)
Article MathSciNet MATH Google Scholar
Attiya, H., Fouren, A.: Polynomial and adaptive long-lived (2k − 1)-renaming Proceedings 14th International Symp. on Distributed Computing: Lecture Notes in Computer Science, vol. 1914, pp 149–163 (2000)
Google Scholar
Attiya, H., Fouren, A.: Algorithms adapting to point contention. J. ACM 50(4), 144–468 (2003)
Article MathSciNet MATH Google Scholar
Bar-Noy, A., Dolev, D.: Shared memory versus message-passing in an asynchronous distributed environment Proceedings 8th ACM Symp. on Principles of Distributed Computing, pp 307–318 (1989)
Google Scholar
Borowsky, E., Gafni, E.: Generalizecl FLP impossibility result for t-resilient asynchronous computations Proceedings 25th ACM Symp. on Theory of Computing, pp 91–100 (1993)
Google Scholar
Borowsky, E., Gafni, E., Lynch, N.A., Rajsbaum, S.: The BG distributed simulation algorithm. Distrib. Comput. 14(3), 127–146 (2001)
Article Google Scholar
Brodsky, A., Ellen, F., Woelfel, P.: Fully-adaptive algorithms for long-lived renaming. Distrib. Comput. 24(2), 119–134 (2011)
Article MATH Google Scholar
Burns, J.E., Fischer, M.J., Jackson, P., Lynch, N.A., Peterson, G.L.: Shared data requirements for implementation of mutual exclusion using a test-and-set primitive Proceedings of the International Conf. on Parallel Processing, pp 79–87 (1978)
Google Scholar
Burns, J.E., Jackson, P., Lynch, N.A., Fischer, M.J., Peterson, G.L.: Data requirements for implementation of N-process mutual exclusion using a single shared variable. J. Assoc. Comput. Mach. 29(1), 183–205 (1982)
Article MathSciNet MATH Google Scholar
Burns, J.E., Lynch, A.N.: Mutual exclusion using indivisible reads and writes 18th annual allerton conference on communication, control and computing, pp 833–842 (1980)
Google Scholar
Burns, J.E., Peterson, G.L.: The ambiguity of choosing Proceedings 8th ACM Symp. on Principles of Distributed Computing, pp 145–158 (1989)
Google Scholar
Burns, J.N., Lynch, N.A.: Bounds on shared-memory for mutual exclusion. Inf. Comput. 107(2), 171–184 (1993)
Article MathSciNet MATH Google Scholar
Castaneda, A., Rajsbaum, S., Raynal, M.: The renaming problem in shared memory systems: an introduction. Computer Science Review 5(3), 229–251 (2011)
Article MATH Google Scholar
Delporte-Gallet, C., Fauconnier, H., Guerraoui, R., Tielmanns, A.: The disagreement power of an adversary Proceedings 28th ACM Symp. on Principles of Distributed Computing, pp 288–289 (2009)
Google Scholar
Dijkstra, E.W.: Solution of a problem in concurrent programming control. Commun. ACM 8(9), 569 (1965)
Article Google Scholar
Fischer, M.J., Lynch, N.A., Paterson, M.S.: Impossibility of distributed consensus with one faulty process. J. ACM 32(2), 374–382 (1985)
Article MathSciNet MATH Google Scholar
Gafni, E., Merritt, M., Taubenfeld, G.: The concurrency hierarchy, and algorithms for unbounded concurrency Proceedings 20th ACM Symp. on Principles of Distributed Computing, pp 161–169 (2001)
Google Scholar
Herlihy, M.P.: Wait-free synchronization. ACM Trans. Program. Lang. Syst. 13(1), 124–149 (1991)
Article Google Scholar
Herlihy, M.P., Shavit, N.: The topological structure of asynchronous computability. J. ACM 46(6), 858–923 (1999)
Article MathSciNet MATH Google Scholar
Herlihy, M.P., Wing, J.M.: Linearizability: a correctness condition for concurrent objects. ACM Trans. Program. Lang. Syst. 12(3), 463–492 (1990)
Article Google Scholar
Imbs, D., Raynal, M., Taubenfeld, G.: On asymmetric progress conditions Proceedings 29th ACM Symp. on Principles of Distributed Computing, pp 55–64 (2010)
Google Scholar
Inoue, M., Umetani, S., Masuzawa, T., Fujiwara, H.: Adaptive long-lived O(k ²)-renaming with O(k ²) steps 15th international symposium on distributed computing (2001)
Kushilevitz, E., Rabin, M.O.: Randomized mutual exclusion algorithms revisited Proceedings 11th ACM Symp. on Principles of Distributed Computing, pp 275–283 (1992)
Google Scholar
Kuznetsov, P.: Understanding non-uniform failure models. Distributed computing column of the Bulletin of the European Association for Theoretical Computer Science (BEATCS) 106, 54–77 (2012)
MathSciNet MATH Google Scholar
Lamport, L.: The part-time parliament. ACM Trans. Comput. Syst. 16(2), 133–169 (1998)
Article Google Scholar
Loui, M.C., Abu-Amara, H.: Memory requirements for agreement among unreliable asynchronous processes. Adv. Compet. Res. 4, 163–183 (1987)
MathSciNet Google Scholar
Moir, M., Anderson, J.H.: Wait-free algorithms for fast, long-lived renaming. Sci. Comput. Program. 25(1), 1–39 (1995)
Article MathSciNet MATH Google Scholar
Moran, S., Wolfstahl, Y.: Extended impossibility results for asynchronous complete networks. Inf. Process. Lett. 26(3), 145–151 (1987)
Article MathSciNet Google Scholar
Pease, M., Shostak, R., Lamport, L.: Reaching agreement in the presence of faults. J. ACM 27(2), 228–234 (1980)
Article MathSciNet MATH Google Scholar
Peterson, G.L.: New Bounds on Mutual Exclusion Problems. Technical Report TR68, University of Rochester, February 1980 (1994)
Raynal, M.: Algorithms for Mutual Exclusion The MIT Press, 1986. Translation of Algorithmique du parallélisme (1984)
Saks, M., Zaharoglou, F.: Wait-free k-set agreement is impossible: The topology of public knowledge. SIAM J. Comput. 29 (2000)
Styer, E., Peterson, G.L.: Tight bounds for shared memory symmetric mutual exclusion problems Proceedings 8th ACM Symp. on Principles of Distributed Computing, pp 177–191 (1989)
Google Scholar
Taubenfeld, G.: Synchronization Algorithms and Concurrent Programming. Pearson / Prentice-Hall, 2006. ISBN 0-131-97259-6, 423 pages
Taubenfeld, G.: The computational structure of progress conditions 24th international symposium on distributed computing (DISC 2010), September 2010. LNCS 6343, vol. 2010, pp 221–235. Springer Verlag
Taubenfeld, G.: Brief Announcement: Computing in the Presence of Weak Crash Failures Proceedings 35Th ACM Symp. on Principles of Distributed Computing (PODC ’16), pp 349–351 (2016)
Chapter Google Scholar
Taubenfeld, G., Moran, S.: Possibility and impossibility results in a shared memory environment. Acta Informatica 33(1), 1–20 (1996)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgments

I wish to thank the three anonymous referees for their constructive suggestions and corrections.

Author information

Authors and Affiliations

The Interdisciplinary Center, P.O.Box 167, Herzliya, 46150, Israel
Gadi Taubenfeld

Authors

Gadi Taubenfeld
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gadi Taubenfeld.

Additional information

A preliminary version of the results presented in this paper, has appeared in proceedings of the 31st annual symposium on principles of distributed computing (PODC 2012), Madeira, Portugal, July 2012.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Taubenfeld, G. A Closer Look at Fault Tolerance. Theory Comput Syst 62, 1085–1108 (2018). https://doi.org/10.1007/s00224-017-9779-4

Download citation

Published: 15 May 2017
Issue Date: July 2018
DOI: https://doi.org/10.1007/s00224-017-9779-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Closer Look at Fault Tolerance

Abstract

Access this article

Similar content being viewed by others

A brief introduction to distributed systems

Consensus Algorithm

Noisy intermediate-scale quantum computers

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Closer Look at Fault Tolerance

Abstract

Access this article

Similar content being viewed by others

A brief introduction to distributed systems

Consensus Algorithm

Noisy intermediate-scale quantum computers

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation