research-article

Resilience of mutual exclusion algorithms to transient memory faults

Authors:
Thomas Moscibroda

Microsoft Research, Redmond, WA, USA

Microsoft Research, Redmond, WA, USA
View Profile

,
Rotem Oshman

MIT, Cambridge, MA, USA

MIT, Cambridge, MA, USA
View Profile

PODC '11: Proceedings of the 30th annual ACM SIGACT-SIGOPS symposium on Principles of distributed computingJune 2011Pages 69–78https://doi.org/10.1145/1993806.1993817

Published:06 June 2011Publication History

PODC '11: Proceedings of the 30th annual ACM SIGACT-SIGOPS symposium on Principles of distributed computing

Pages 69–78

ABSTRACT

We study the behavior of mutual exclusion algorithms in the presence of unreliable shared memory subject to transient memory faults. It is well-known that classical 2-process mutual exclusion algorithms, such as Dekker and Peterson's algorithms, are not fault-tolerant; in this paper we ask what degree of fault tolerance can be achieved using the same restricted resources as Dekker and Peterson's algorithms, namely, three binary read/write registers.

We show that if one memory fault can occur, it is not possible to guarantee both mutual exclusion and deadlock-freedom using three binary registers; this holds in general when fewer than 2f+1 binary registers are used and f may be faulty. Hence we focus on algorithms that guarantee (a) mutual exclusion and starvation-freedom in fault-free executions, and (b) only mutual exclusion in faulty executions. We show that using only three binary registers it is possible to design an 2-process mutual exclusion algorithm which tolerates a single memory fault in this manner. Further, by replacing one read/write register with a test&set register, we can guarantee mutual exclusion in executions where one variable experiences unboundedly many faults.

In the more general setting where up to f registers may be faulty, we show that it is not possible to guarantee mutual exclusion using 2f + 1 binary read/write registers if each faulty register can exhibit unboundedly many faults. On the positive side, we show that an n-variable single-fault tolerant algorithm satisfying certain conditions can be transformed into an ((n-1)f + 1)-variable f-fault tolerant algorithm with the same progress guarantee as the original. In combination with our three-variable algorithm, this implies that there is a (2f+1)-variable mutual exclusion algorithm tolerating a single fault in up to f variables without violating mutual exclusion.

References

Y. Afek, D. S. Greenberg, M. Merritt, and G. Taubenfeld. Computing with Faulty Shared Memory. In Proceedings of Symposium on Principles of Distributed Computing (PODC), 1992. Google ScholarDigital Library
Y. Afek, D. S. Greenberg, M. Merritt, and G. Taubenfeld. Computing with Faulty Shared Objects. Journal of the ACM, 1995. Google ScholarDigital Library
R. C. Baumann. Soft Errors in Advanced Semiconductor Devices -- Part I: The Three Radiation Sources. IEEE Transactions on Device and Materials Reliability, 2001.Google ScholarCross Ref
R. C. Baumann. Soft Errors in Commercial Semiconductor Technology: Overview and Scaling Trends. IEEE 2002 Reliability Physics Tutorial Notes, Reliability Fundamentals, 2002.Google Scholar
S. Borkar. Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation. IEEE Micro, 2005. Google ScholarDigital Library
J. E. Burns and N. A. Lynch. Bounds on shared memory for mutual exclusion. Inf. Comput., 107:171--184, December 1993. Google ScholarDigital Library
B. S. Chlebus, A. Gambin, and P. Indyk. Shared-Memory Simulations on a Faulty-Memory DMM. In Proceedings of 23rd Colloquium on Automata, Languages and Programming (ICALP), 1996. Google ScholarDigital Library
B. S. Chlebus, L. Gasieniec, and A. Pelc. Deterministic Computations on a PRAM with Static Processor and Memory Faults. Fundamenta Informaticae, 2003. Google ScholarDigital Library
J. Derrick, G. Schellhorn, and H. Wehrheim. Proving linearizability via non-atomic refinement. In J. Davies and J. Gibbons, editors, IFM, volume 4591 of Lecture Notes in Computer Science, pages 195--214. Springer, 2007. Google ScholarDigital Library
I. Finocchi, F. Grandoni, and G. F. Italiano. Designing Reliable Algorithms in Unreliable Memories. In Proceedings of European Symposium on Algorithms (ESA), pages 1--8, 2005. Google ScholarDigital Library
R. Guerraoui and M. Raynal. From Unreliable Objects to Reliable Objects: The Case of Atomic Registers and Consensus. In Proceedings of PaCT, 2007. Google ScholarDigital Library
M. P. Herlihy and J. M. Wing. Linearizability: a correctness condition for concurrent objects. ACM Trans. Program. Lang. Syst., 12:463--492, July 1990. Google ScholarDigital Library
P. Jayanti, T. D. Chandra, and S. Toueg. Fault-tolerant wait-free shared objects. Journal of the ACM, 1998. Google ScholarDigital Library
L. Lamport. The Mutual Exclusion Problem: Part II -- Statement and Solutions. Journal of the ACM, 1986. Google ScholarDigital Library
Y. Liu, W. Chen, Y. A. Liu, and J. Sun. Model checking linearizability via refinement. In Proceedings of the 2nd World Congress on Formal Methods, FM '09, pages 321--337, Berlin, Heidelberg, 2009. Springer-Verlag. Google ScholarDigital Library
T. N. V. M. Gomaa, C. Scarbrough and I. Pomeranz. Transient-fault Recovery for Chip Multiprocessors. In Proceedings of 30th Symposium on Computer Architecture (ISCA), pages 98--109, 2003. Google ScholarDigital Library
S. S. Mukherjee, M. Kontz, and S. K. Reinhardt. Detailed Design and Evaluation of Redundant Multithreading Alternatives. In Proceedings of 29th Symposium on Computer Architecture (ISCA), pages 99--110, 2002. Google ScholarDigital Library
N. Oh, P. P. Shirvani, and E. J. McCluskey. Error Detection by Duplicated Instructions in Super-Scalar Processors. IEEE Transactions on Reliability, 2002.Google ScholarCross Ref
G. L. Peterson. Concurrent Reading while Writing. Transactions on Programming Languages and Systems, 1983. Google ScholarDigital Library
G. A. Reis, J. Chang, and D. I. August. Automatic Instruction-Level Software-Only Recovery Methods. IEEE Micro Top Picks, 2007. Google ScholarDigital Library
N. W. H. B. E. T. S. E. Michalak, K. W. Harris and S. A. Wender. Predicting the Number of Fatal Soft Errors in Los Alamos National Labratory's ASC Q Computer. IEEE Transactions on Device and Materials Reliability, 2005.Google ScholarCross Ref
P. Shivakumar, M. Kistler, S. W. Keckler, D. Burger, and L. Alvisi. Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic. In Proceedings of the Conference on Dependable Systems and Networks, pages 389--388, 2002. Google ScholarDigital Library
B. K. Szymanski. Mutual Exclusion Revisited. In Proceedings of 5th Jerusalem Conference on Information Technology, 1990. Google ScholarDigital Library
J. Tromp. How to Construct an Atomic Variable. In Proceedings of 3rd Workshop on Distributed Algorithms, 1989. Google ScholarDigital Library
K. Truuvert. A Self-Stabilizing First-Come-First-Serve Mutual Exclusion Algorithm with Small Shared Variables. Technical Note, University of Toronto, 1989.Google Scholar

Index Terms

Resilience of mutual exclusion algorithms to transient memory faults
1. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Process management
        Mutual exclusion
    2. Extra-functional properties
      1. Software fault tolerance

Recommendations

Superstabilizing mutual exclusion

A superstabilizing protocol is a protocol that (i) is self-stabilizing, meaning that it can recover from an arbitrarily severe transient fault; and (ii) can recover from a local transient fault while satisfying a passage predicate during recovery. This ...
Read More
Uniform and Self-Stabilizing Fair Mutual Exclusion on Unidirectional Rings under Unfair Distributed Daemon
Self-stabilizing distributed systems

This paper presents a uniform randomized self-stabilizing mutual exclusion algorithm for an anonymous unidirectional ring of any size n, running under an unfair distributed scheduler (d-daemon). The system is stabilized with probability 1 in O(n3) ...
Read More
Self-Stabilizing Mutual Exclusion in the Presence of Faulty Nodes
FTCS '95: Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing

This paper presents the RatchetFT distributed fault-tolerant mutual exclusion algorithm for processor rings. RatchetFT is self-stabilizing, in that if mutual exclusion is lost due to any sequence of on-line failures and repairs of processors, mutual ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PODC '11: Proceedings of the 30th annual ACM SIGACT-SIGOPS symposium on Principles of distributed computing
June 2011
406 pages
ISBN:9781450307192
DOI:10.1145/1993806
General Chair:
Cyril Gavoille
LaBRI, University of Bordeaux, France
,
Program Chair:
Pierre Fraigniaud
CNRS and University of Paris Diderot, France
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 June 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
fault tolerance
mutual exclusion
transient memory faults
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate740of2,477submissions,30%
Upcoming Conference
PODC '24

Sponsor:

sigact

sigact

ACM Symposium on Principles of Distributed Computing

June 17 - 21, 2024

Nantes , France
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 299
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Resilience of mutual exclusion algorithms to transient memory faults

PODC '11: Proceedings of the 30th annual ACM SIGACT-SIGOPS symposium on Principles of distributed computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Superstabilizing mutual exclusion

Uniform and Self-Stabilizing Fair Mutual Exclusion on Unidirectional Rings under Unfair Distributed Daemon

Self-Stabilizing Mutual Exclusion in the Presence of Faulty Nodes

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Resilience of mutual exclusion algorithms to transient memory faults

PODC '11: Proceedings of the 30th annual ACM SIGACT-SIGOPS symposium on Principles of distributed computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Superstabilizing mutual exclusion

Uniform and Self-Stabilizing Fair Mutual Exclusion on Unidirectional Rings under Unfair Distributed Daemon

Self-Stabilizing Mutual Exclusion in the Presence of Faulty Nodes

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media