ABSTRACT
We study the behavior of mutual exclusion algorithms in the presence of unreliable shared memory subject to transient memory faults. It is well-known that classical 2-process mutual exclusion algorithms, such as Dekker and Peterson's algorithms, are not fault-tolerant; in this paper we ask what degree of fault tolerance can be achieved using the same restricted resources as Dekker and Peterson's algorithms, namely, three binary read/write registers.
We show that if one memory fault can occur, it is not possible to guarantee both mutual exclusion and deadlock-freedom using three binary registers; this holds in general when fewer than 2f+1 binary registers are used and f may be faulty. Hence we focus on algorithms that guarantee (a) mutual exclusion and starvation-freedom in fault-free executions, and (b) only mutual exclusion in faulty executions. We show that using only three binary registers it is possible to design an 2-process mutual exclusion algorithm which tolerates a single memory fault in this manner. Further, by replacing one read/write register with a test&set register, we can guarantee mutual exclusion in executions where one variable experiences unboundedly many faults.
In the more general setting where up to f registers may be faulty, we show that it is not possible to guarantee mutual exclusion using 2f + 1 binary read/write registers if each faulty register can exhibit unboundedly many faults. On the positive side, we show that an n-variable single-fault tolerant algorithm satisfying certain conditions can be transformed into an ((n-1)f + 1)-variable f-fault tolerant algorithm with the same progress guarantee as the original. In combination with our three-variable algorithm, this implies that there is a (2f+1)-variable mutual exclusion algorithm tolerating a single fault in up to f variables without violating mutual exclusion.
- Y. Afek, D. S. Greenberg, M. Merritt, and G. Taubenfeld. Computing with Faulty Shared Memory. In Proceedings of Symposium on Principles of Distributed Computing (PODC), 1992. Google ScholarDigital Library
- Y. Afek, D. S. Greenberg, M. Merritt, and G. Taubenfeld. Computing with Faulty Shared Objects. Journal of the ACM, 1995. Google ScholarDigital Library
- R. C. Baumann. Soft Errors in Advanced Semiconductor Devices -- Part I: The Three Radiation Sources. IEEE Transactions on Device and Materials Reliability, 2001.Google ScholarCross Ref
- R. C. Baumann. Soft Errors in Commercial Semiconductor Technology: Overview and Scaling Trends. IEEE 2002 Reliability Physics Tutorial Notes, Reliability Fundamentals, 2002.Google Scholar
- S. Borkar. Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation. IEEE Micro, 2005. Google ScholarDigital Library
- J. E. Burns and N. A. Lynch. Bounds on shared memory for mutual exclusion. Inf. Comput., 107:171--184, December 1993. Google ScholarDigital Library
- B. S. Chlebus, A. Gambin, and P. Indyk. Shared-Memory Simulations on a Faulty-Memory DMM. In Proceedings of 23rd Colloquium on Automata, Languages and Programming (ICALP), 1996. Google ScholarDigital Library
- B. S. Chlebus, L. Gasieniec, and A. Pelc. Deterministic Computations on a PRAM with Static Processor and Memory Faults. Fundamenta Informaticae, 2003. Google ScholarDigital Library
- J. Derrick, G. Schellhorn, and H. Wehrheim. Proving linearizability via non-atomic refinement. In J. Davies and J. Gibbons, editors, IFM, volume 4591 of Lecture Notes in Computer Science, pages 195--214. Springer, 2007. Google ScholarDigital Library
- I. Finocchi, F. Grandoni, and G. F. Italiano. Designing Reliable Algorithms in Unreliable Memories. In Proceedings of European Symposium on Algorithms (ESA), pages 1--8, 2005. Google ScholarDigital Library
- R. Guerraoui and M. Raynal. From Unreliable Objects to Reliable Objects: The Case of Atomic Registers and Consensus. In Proceedings of PaCT, 2007. Google ScholarDigital Library
- M. P. Herlihy and J. M. Wing. Linearizability: a correctness condition for concurrent objects. ACM Trans. Program. Lang. Syst., 12:463--492, July 1990. Google ScholarDigital Library
- P. Jayanti, T. D. Chandra, and S. Toueg. Fault-tolerant wait-free shared objects. Journal of the ACM, 1998. Google ScholarDigital Library
- L. Lamport. The Mutual Exclusion Problem: Part II -- Statement and Solutions. Journal of the ACM, 1986. Google ScholarDigital Library
- Y. Liu, W. Chen, Y. A. Liu, and J. Sun. Model checking linearizability via refinement. In Proceedings of the 2nd World Congress on Formal Methods, FM '09, pages 321--337, Berlin, Heidelberg, 2009. Springer-Verlag. Google ScholarDigital Library
- T. N. V. M. Gomaa, C. Scarbrough and I. Pomeranz. Transient-fault Recovery for Chip Multiprocessors. In Proceedings of 30th Symposium on Computer Architecture (ISCA), pages 98--109, 2003. Google ScholarDigital Library
- S. S. Mukherjee, M. Kontz, and S. K. Reinhardt. Detailed Design and Evaluation of Redundant Multithreading Alternatives. In Proceedings of 29th Symposium on Computer Architecture (ISCA), pages 99--110, 2002. Google ScholarDigital Library
- N. Oh, P. P. Shirvani, and E. J. McCluskey. Error Detection by Duplicated Instructions in Super-Scalar Processors. IEEE Transactions on Reliability, 2002.Google ScholarCross Ref
- G. L. Peterson. Concurrent Reading while Writing. Transactions on Programming Languages and Systems, 1983. Google ScholarDigital Library
- G. A. Reis, J. Chang, and D. I. August. Automatic Instruction-Level Software-Only Recovery Methods. IEEE Micro Top Picks, 2007. Google ScholarDigital Library
- N. W. H. B. E. T. S. E. Michalak, K. W. Harris and S. A. Wender. Predicting the Number of Fatal Soft Errors in Los Alamos National Labratory's ASC Q Computer. IEEE Transactions on Device and Materials Reliability, 2005.Google ScholarCross Ref
- P. Shivakumar, M. Kistler, S. W. Keckler, D. Burger, and L. Alvisi. Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic. In Proceedings of the Conference on Dependable Systems and Networks, pages 389--388, 2002. Google ScholarDigital Library
- B. K. Szymanski. Mutual Exclusion Revisited. In Proceedings of 5th Jerusalem Conference on Information Technology, 1990. Google ScholarDigital Library
- J. Tromp. How to Construct an Atomic Variable. In Proceedings of 3rd Workshop on Distributed Algorithms, 1989. Google ScholarDigital Library
- K. Truuvert. A Self-Stabilizing First-Come-First-Serve Mutual Exclusion Algorithm with Small Shared Variables. Technical Note, University of Toronto, 1989.Google Scholar
Index Terms
- Resilience of mutual exclusion algorithms to transient memory faults
Recommendations
Superstabilizing mutual exclusion
A superstabilizing protocol is a protocol that (i) is self-stabilizing, meaning that it can recover from an arbitrarily severe transient fault; and (ii) can recover from a local transient fault while satisfying a passage predicate during recovery. This ...
Uniform and Self-Stabilizing Fair Mutual Exclusion on Unidirectional Rings under Unfair Distributed Daemon
Self-stabilizing distributed systemsThis paper presents a uniform randomized self-stabilizing mutual exclusion algorithm for an anonymous unidirectional ring of any size n, running under an unfair distributed scheduler (d-daemon). The system is stabilized with probability 1 in O(n3) ...
Self-Stabilizing Mutual Exclusion in the Presence of Faulty Nodes
FTCS '95: Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant ComputingThis paper presents the RatchetFT distributed fault-tolerant mutual exclusion algorithm for processor rings. RatchetFT is self-stabilizing, in that if mutual exclusion is lost due to any sequence of on-line failures and repairs of processors, mutual ...
Comments