## On the Time and Space Complexity of ABA Prevention and Detection<sup>\*</sup>

Zahra Aghazadeh University of Calgary Philipp Woelfel University of Calgary

We investigate the time and space complexity of detecting and preventing ABAs in shared memory algorithms for systems with n processes and bounded base objects. To that end, we define ABA-detecting registers, which are similar to normal read/write registers, except that they allow a process q to detect with a read operation, whether some process wrote the register since q's last read. ABA-detecting registers can be implemented trivially from a single unbounded register, but we show that they have a high complexity if base objects are bounded: An obstruction-free implementation of an ABA-detecting single bit register cannot be implemented from fewer than n - 1bounded registers. Moreover, bounded CAS objects (or more generally, conditional read-modify-write primitives) offer little help to implement ABA-detecting single bit registers: We prove a linear time-space tradeoff for such implementations. We show that the same time-space tradeoff holds for implementations of single bit LL/SC primitives from bounded writable CAS objects. This proves that the implementations of LL/SC/VL by Anderson and Moir [2] as well as Jayanti and Petrovic [15] are optimal.

We complement our lower bounds with tight upper bounds: We give an implementation of ABA-detecting registers from n + 1 bounded registers, which has step complexity O(1). We also show that (bounded) LL/SC/VL can be implemented from a single bounded CAS object and with O(n) step complexity. Both upper bounds are asymptotically optimal with respect to their time-space product.

These results give formal evidence that the ABA problem is inherently difficult, that even writable CAS objects do not provide significant benefits over registers for dealing with the ABA problem itself, and that there is no hope of finding a more efficient implementation of LL/SC/VL from bounded CAS objects and registers than the ones from [2, 15].

<sup>\*</sup>This research was undertaken, in part, thanks to funding from the Canada Research Chairs program and from the Discovery Grants program of the Natural Sciences and Engineering Research Council of Canada (NSERC).

## 1. Introduction

Since the beginning of shared memory computing, programmers and researchers have had to deal with the ABA problem: Even though a process retrieves the same value twice in a row from a shared memory object, it is still possible that the value of the object has changed multiple times.

Especially algorithms using the standard Compare-and-Swap (CAS) primitive seem to be susceptible. A CAS object provides two operations: Read() returns the value of the object, and CAS(x,y) changes the value of the object to y provided that its value v prior to the operation equals x, and it returns v. (According to some specifications a CAS(x,y) returns a Boolean, which is True if and only if the CAS() succeeded, i.e., it wrote y.) Often, CAS() objects are used in the following way: First, a process p reads the value x stored in the CAS object, then it performs some computation, and finally it tries to propagate the result of the computation by performing a CAS(x,y). The idea is that if another process has already updated the data structure, p's CAS() should fail, and so inconsistencies are avoided. However, if multiple successful CAS() operations have occurred and the value of the object has changed back to x, p's CAS() might still succeed, possibly yielding inconsistencies.

ABAs are also a problem for algorithms using other strong primitives, or even only registers. For example, in mutual exclusion algorithms often processes busy-wait for certain events to happen, by repeatedly reading the same register. In systems with caches, the cost of waiting is small, because as long as no process changes the register value, all reads are cache hits. The event is signaled by other processes through a change in the register value. But it may also be desirable to eventually reset the register to its state, before the event was signaled, in order to be able to reuse it. But this may result in the ABA problem, and as a consequence waiting processes may miss events. Therefore, algorithm designers have to devise more complicated code in order to avoid unnoticed cache misses, or even lack of progress.

Many shared memory algorithms and data structures have to deal with the ABA problem. Often this is done in an ad-hoc, application specific way [31], or solutions are based on tagging [10, 19, 23–25, 27–29] (see below). Other papers combine tagging and memory management techniques, or suggest both as alternatives [10, 18].

Tagging, introduced by IBM [14], requires augmenting an object with a tag (which is sometimes called sequence number) that gets changed with every write operation. This technique avoids the ABA problem only, if tags never repeat. Therefore, theoretically, an infinite number of tags and thus base objects of unbounded size are required. One may argue that, in practice, for reasonably large base objects, a system will never run out of tags. However, this is unrealistic in cases where the tag has to be stored together with other information in the same object. In some cases, it is possible to store the tag in a separate object (e.g., [15]), but this requires technically difficult algorithms and tedious correctness proofs. Some architectures like the IBM System/370 [14] introduced a double-width CAS primitive, which allows one of two (32-bit) words to be used for storing tags. While using bounded tags does not completely avoid the ABA problem (because tag values may wrap around), it has been argued [24, 25, 28, 29] that an erroneous algorithm execution due to an unexpected ABA becomes very unlikely. From a theoretical perspective this is unsatisfactory. Moreover, for practical applications, it is often necessary to use the entire object space (today usually comprising 64 bits) for data, so the tagging technique requires double-width atomic instructions. Those are not supported by most mainstream architectures [20].

ABAs cause problems in algorithms that use some form of memory management, where a pointer to some memory space may change its value in an ABA fashion. In this context, memory reclamation techniques based on reference counting [32], Hazard pointers [20, 21], the repeat-offender problem technique [12], or the memory reclamation technique introduced in [1] deal with the ABA problem. But those techniques are application specific.

A more methodological approach has been followed by research that showed how a load-linked store-conditional (LL/SC) object can be implemented from CAS objects and registers. Such an object provides two operations, LL() and SC(), where LL() returns the current value of the object. SC(x) may either fail and not change anything, or succeed and write the value x to the object. Specifically, an SC(x) operation by process p succeeds if and only if no other SC() operation succeeded since p's last LL(). A Boolean return value of an SC() operation indicates its success (True) or failure (False). An extended specification also allows for a VL() (verify-link) operation, which does not change the state of the object, but it returns False if a successful SC() has been performed since the calling process' last LL(), and True otherwise. LL/SC (or LL/SC/VL) objects can in almost all cases replace CAS objects in algorithms, and are an effective way of avoiding the ABA problem. Unfortunately, existing multiprocessor systems only provide weak versions of LL/SC that restrict programmers severely in how they can use the objects [26], and hence they "offer little or no help with preventing the ABA problem" [22].

For that reason, a line of research has been dedicated to finding time and space efficient LL/SC implementations from CAS objects and registers [2,5,15,16,20,22,26]. While many of those solutions are wait-free and often even guarantee constant time execution of each LL() and SC() operation, they still have drawbacks: Existing implementations either require unbounded tags (e.g., [26]) and thus use unbounded CAS objects or registers, or they need at least linear space. Jayanti and Petrovic [15] and Anderson and Moir [2] presented the most space efficient implementations of an LL/SC object from bounded CAS and registers, which achieve constant step-complexity: they use only one CAS object but require  $\Theta(n)$  registers. This raises the question, whether time efficient implementations of LL/SC from a smaller number of bounded CAS objects and registers may exist. More generally, in order to understand the power and limits of shared memory primitives, it seems important to learn how much time and space is required to avoid or detect ABAs, and not to restrict this question to the implementation of LL/SC objects from CAS objects and registers.

CAS and LL/SC objects have a consensus number of infinity [11], while registers have a consensus number of one. Therefore, it is impossible to implement wait-free LL/SC from registers or other objects with a bounded consensus number. Time and space lower bounds for implementations of LL/SC objects may not necessarily imply that it is the ABA problem that is hard to solve, but such lower bounds may follow inherently from other properties of the LL/SC specification.

**Results.** To investigate the complexity of detecting or preventing ABAs, we define a natural object, the ABA-detecting register. It supports two operations, DRead() and DWrite(). Operation DWrite(x) writes value x to the register, and returns nothing. Operation DRead() by process p returns, in addition to the value of the register, a Boolean flag, which is True if and only if some process executed a DWrite() since p's last DRead() operation. We distinguish between single-writer ABA-detecting registers, where only one dedicated process is allowed to call DWrite(), and multi-writer ones that don't have this restriction.

A wait-free ABA-detecting register can be implemented from registers, and thus has consensus

number 1. (Therefore, they are weaker with respect to wait-freedom than CAS or LL/SC.) Using a single unbounded register with an unbounded tag that gets changed whenever some process writes to it, it is trivial to obtain an ABA-detecting register with constant time complexity. But if base objects have only bounded size, the situation is completely different: For implementations of ABA-detecting registers in a system with n processes and bounded registers, we obtain a linear (in n) space lower bound, even if the implementation satisfies only nondeterministic solo-termination (the non-deterministic variant of obstruction-freedom), which is a progress condition strictly weaker than wait-freedom. The availability of CAS seems to be of little help: For wait-free implementations from CAS objects and registers we obtain a time-space tradeoff that is linear in n. The same asymptotic time-space tradeoff is obtained if the base objects support arbitrary conditional read-modify-write operations [7]. Each conditional operation can be simulated by a single operation on a writable CAS objects, i.e., an object that supports a Write() operation in addition to Read() and CAS(). For that reason we state the lower bound for implementations from conditional read-modify-write operations in terms of writable CAS base objects.

**Theorem 1.** Any linearizable implementation of a single-writer 1-bit ABA-detecting register from m bounded base objects satisfies:

- (a)  $m \ge n-1$  if the base objects are bounded registers, and the implementation satisfies nondeterministic solo-termination;
- (b)  $m \ge (n-1)/t$ , if the the base objects are bounded CAS objects and registers, and the implementation is deterministic and wait-free with worst-case step-complexity at most t; and
- (c)  $m \ge (n-1)/(2t)$ , if the base objects are bounded writeable CAS objects, and the implementation is deterministic and wait-free with worst-case step-complexity at most t.

The requirement that base objects are bounded is necessary for this lower bound, because, as mentioned earlier, an ABA-detecting register can be trivially obtained by augmenting a normal register with an unbounded tag.

There is a simple implementation of a (bounded) ABA-detecting registers with constant stepcomplexity from a single (bounded) LL/SC/VL object of the same size: Each process uses a local variable *old*. To DWrite(x), the process executes a LL() operation followed by a SC(x). To DRead(), the process first executes a VL(). If VL() returns True, the process returns (*old*, False); otherwise, it reads the value of the LL/SC/VL object into *old* (by executing LL()), and then returns (*old*, True). It is not hard to see that this implementation is linearizable. (See Appendix A for the algorithm and proof of correctness.) Thus, by reduction we obtain the same lower bound as the one stated in Theorem 1 for implementations of single bit LL/SC/VL. Unfortunately, for that reduction the VL() operation is needed, and at least we do not know how to obtain a similarly efficient ABA-detecting register from an LL/SC object that does not support VL(). However, the proofs of Theorem 1 can be easily modified to accommodate LL/SC objects:

**Corollary 1.** Any linearizable implementation of a single bit LL/SC object from m bounded base objects satisfies

- (a)  $m \ge (n-1)/t$ , if the the base objects are bounded CAS objects and registers, and the implementation is deterministic and wait-free with worst-case step-complexity at most t; and
- (b)  $m \ge (n-1)/(2t)$ , if the base objects are bounded writeable CAS objects, and the implementation is deterministic and wait-free with worst-case step-complexity at most t.

A linear space lower bound (corresponding to Part (a) of Theorem 1) for nondeterministic soloterminating implementations of LL/SC from (even unbounded) registers follows from the fact that LL/SC objects are *perturbable* [17].

As in Theorem 1, the assumption that base objects are bounded is necessary, because there is an implementation of an LL/SC/VL object from a single unbounded CAS object with constant step complexity by Moir [26]. Our time-space tradeoff is asymptotically tight for implementations with constant step-complexity, as it matches known upper bounds [2, 15]. We show that it also asymptotically tight for implementations using a single CAS object:

**Theorem 2.** A single bounded CAS object suffices to implement a bounded LL/SC/VL object or a bounded multi-writer ABA-detecting register with O(n) step-complexity.

These results raise the question, whether bounded CAS objects are helpful for ABA detection. We determine that for this problem bounded CAS objects do not provide additional benefits over bounded registers:

# **Theorem 3.** There is a linearizable wait-free implementation of a multi-writer b-bit ABA-detecting register from n + 1 (b + 2logn + O(1))-bit registers with constant step complexity.

Not only do our lower bounds show that Anderson's and Moir's [2] as well as Jayanti's and Petrovic's [15] implementations of LL/SC from CAS objects and registers are optimal with respect to their time- and space-product, but they also clearly indicate that ABA detection is inherently difficult, even if bounded conditional read-modify-write primitives such as (writable) CAS objects are available. Therefore, other primitives that provide a solution to the ABA problem would most likely be as difficult to obtain as LL/SC. Our upper bounds demonstrate that bounded CAS objects (and in fact any conditional read-modify-write operations) are not more helpful than bounded registers with respect to ABA detection. On the other hand, ABA detection is difficult only if base objects are bounded, but for our lower bounds it does not matter how large that bound on the size of the base object is, as long as it is finite.

**Other Related Work.** Our lower bounds use covering arguments. Covering arguments were first used by Burns and Lynch [4] to prove a space lower bound for mutual exclusion, and essentially all space lower bounds are based on this technique. Examples are space lower bounds for one-time test-and-set objects [30], consensus [8], timestamps [6,9], and the general class of perturbable objects [17] (which includes LL/SC among others). These lower bounds have in common that they do not apply if CAS objects are available as base objects. (They allow for registers, swap objects, and, in case of [17], resettable consensus.) An overview of covering arguments can be found in Attiya's and Ellen's recent textbook [3].

In our time-space tradeoffs we construct executions, where a sequence of operations by a process p is interleaved with successful CAS() and Write() operations of other processes, so that p's steps remain "hidden". Such a technique has been also used by Fich, Hendler, and Shavit [7] to prove linear space lower bounds for wait-free implementations of *visible* objects implemented from conditional read-modify-write (i.e., writable CAS) objects. Visible objects include counters, queues, stacks, or snapshots. Neither ABA-detecting registers nor LL/SC objects are visible, because they can be implemented from a single unbounded CAS object. In fact, we are not aware of any other non-trivial lower bounds that, like ours, separate bounded from unbounded base objects.

**Preliminaries.** We consider a system with *n* processes with unique IDs in  $\mathcal{P} = \{0, \ldots, n-1\}$ . Processes communicate through shared memory operations, called steps, that are executed on atomic base objects provided by the system. Each process executes a possibly nondeterministic program. If processes are deterministic, a *schedule* is a sequence of process IDs, that determines the order in which processes execute their steps. If processes are nondeterministic, a *schedule* is a sequence of process IDs together with coin-flips, and it describes the order in which processes take steps together with the nondeterministic decisions they make. The sequence of shared memory steps taken by processes is called *execution*. A *history* on some implemented object is the sequence of method call invocations and responses that occur in an execution on that object. A *configuration* describes the state of the system, i.e., of all processes and all base objects.

Our implementations are deterministic and *wait-free*, which means that every method call terminates within a finite number of the calling process' steps, in any execution. The step-complexity of a deterministic wait-free method is the maximum number of steps a process needs to terminate the method call in any execution. Our lower bounds hold for implementations that satisfy a progress condition which is strictly weaker than wait-freedom: A nondeterministic method m satisfies *nondeterministic solo-termination*, if for every process p and every configuration C in which a call of method m by p is pending, there is a p-only execution that starts in C and during which pfinishes method m. For deterministic algorithms, nondeterministic solo-termination is the same as *obstruction-freedom*. Our algorithms are *linearizable* [13], but our lower bounds work for much weaker correctness conditions.

## 2. Lower Bounds

For a configuration C and a schedule  $\sigma$ , let  $\text{Exec}(C, \sigma)$  denote the execution arising from processes taking steps, starting in configuration C, in the order defined by  $\sigma$ , and using the nondeterministic decisions defined by  $\sigma$ , if the algorithm is nondeterministic. Let  $\text{Conf}(C, \sigma)$  denote the configuration resulting from execution  $\text{Exec}(C, \sigma)$  started in C. For two configurations C and D and a schedule  $\alpha$ , we write  $C \stackrel{\alpha}{\rightsquigarrow} D$  to indicate that  $\text{Conf}(C, \alpha) = D$ . Let  $C_{init}$  denote the initial configuration. If there exists a schedule  $\alpha$  such that  $C \stackrel{\alpha}{\rightsquigarrow} D$ , then we say D is reachable from C, and if D is reachable from  $C_{init}$ , we simply say D is reachable.

An execution E or a schedule  $\alpha$  is P-only for a set  $P \subseteq \{0, \ldots, n-1\}$  of processes, if only processes in P take steps during E respectively  $\alpha$ . If  $P = \{p\}$  is the set of a single process, then we sometimes write p-only instead of  $\{p\}$ -only.

For an execution E, let  $\prec_E$  denote the happens-before order on operations in E, i.e., if operation op responds in E before op' gets invoked, then and only then  $op \prec_E op'$  (op happens before op'). We write simply  $\prec$  instead of  $\prec_E$ , if is clear from the context which execution E the relation refers to. For a schedule  $\alpha$ , an execution E and a process p, E|p and  $\alpha|p$  denote the sub-sequences of steps by p in E and in  $\alpha$ , respectively.

Two configurations C and D are *indistinguishable* to process p, if every register has the same value in C as in D, and p is in the same state in both configurations. We write  $C \sim_p D$  to denote that C and D are indistinguishable to p. We write  $C \sim_S D$  for a set S of processes to denote that  $C \sim_p D$  for every process  $p \in S$ . We say process p is *idle* in configuration C, if it has no pending method call, and if all processes are idle, then the configuration is *quiescent*. A process *completes* a method call in an execution E, if that method terminates in E.

For our lower bounds, we do not require that the implementation of the ABA-detecting registers is linearizable. Instead, we consider methods WeakRead() and WeakWrite() that take no arguments, and where WeakRead() returns a Boolean value, and WeakWrite() returns nothing. A correct concurrent implementation of these methods must guarantee for every execution, that a WeakRead() operation r by process p returns True if and only if there exists a WeakWrite() operation w such that w happens before r and every other WeakRead() operation by p happens before w.

Linearizability of an ABA-detecting register R guarantees that the operations R.DRead() (in place of WeakRead()) and R.DWrite() (in place of WeakWrite()) satisfy the correctness properties above. Therefore, every lower bound on the time and/or space complexity for correct implementations of those methods implies the same lower bound for linearizable ABA-detecting registers.

Let p be some process and C a configuration. We say C is p-clean, if there exists a schedule  $\alpha$ ,  $C_{init} \stackrel{\alpha}{\leadsto} C$ , such that  $\text{Exec}(C_{init}, \alpha)$  contains a complete WeakRead() operation  $r^*$  by p, and every WeakWrite() happens before  $r^*$ . Configuration C is p-dirty, if there exists a schedule  $\alpha$ ,  $C_{init} \stackrel{\alpha}{\leadsto} C$ , and  $\text{Exec}(C_{init}, \alpha)$  contains a complete WeakWrite() operation  $w^*$  such that no WeakRead() by p is pending at any point after  $w^*$  has been invoked. Note that some configurations are neither p-dirty nor p-clean.

Throughout this section we assume that each process executes an infinite program, in which it repeatedly calls WeakRead() and WeakWrite() methods. More specifically, process 0 repeatedly executes WeakWrite(), while every process in  $\{1, ..., n-1\}$  repeatedly calls WeakRead().

Then in a *p*-only execution starting from a configuration C, the first WeakRead() operation by p returns False if C is *p*-clean and True if C is *p*-dirty. Therefore, each process must be able to distinguish *p*-clean configurations from *p*-dirty ones. The full proof of the following observation can be found in Appendix B.1.

**Observation 1.** Suppose the WeakRead() method satisfies nondeterministic solo-termination. For any process  $p \in \{1, ..., n-1\}$  and any two reachable configurations  $C_1, C_2$ , if  $C_1$  is p-clean and  $C_2$ is p-dirty, then  $C_1 \not\sim_p C_2$ .

#### 2.1. A Space Lower Bound for Implementations from Bounded Registers

Let  $\mathcal{R}$  be a set of k registers and P a set of processes. We say the processes in P cover  $\mathcal{R}$  in configuration C, if for each register  $R \in \mathcal{R}$  there is a process in P that is poised to write to R. A block-write to  $\mathcal{R}$  is an execution in which k processes participate, and each of them takes exactly one step in which it writes to a distinct register in  $\mathcal{R}$ . (The only block-write to  $\emptyset$  is the empty execution.)

In the following we assume an implementation of methods WeakRead() and WeakWrite() from m bounded registers. The register configuration of a configuration C is an m-tuple, reg $(C) = (v_1, \ldots, v_m)$ , where  $v_i$  is the value of the *i*-th register.

**Lemma 1.** Suppose methods WeakRead() and WeakWrite() satisfy nondeterministic solution. termination. For any quiescent configuration Q and any set  $P_k = \{p_1, \ldots, p_k\} \subseteq \mathcal{P} \setminus \{0\}$ , where  $k \in \{0, \ldots, n-1\}$ , there exists a  $(P_k \cup \{0\})$ -only schedule  $\alpha$  such that in  $\text{Conf}(Q, \alpha)$  process 0 is idle and k distinct registers are covered by  $p_1, \ldots, p_k$ .

Lemma 1 immediately implies Theorem 1 (a).



Figure 1: Proof of Lemma 1. Let  $P_k^0$  denote the set  $P_k \cup \{0\}$ . Double circles denote quiescent configurations.

*Proof of Lemma 1.* The proof is by induction on k. If k = 0, we let  $\alpha$  be the empty schedule, and the lemma is immediate because  $\text{Conf}(Q, \alpha) = Q$  is a quiescent configuration (so 0 is idle).

Now suppose we have proved the inductive hypothesis for some integer k < n - 1. Let  $\beta = (p_1, \ldots, p_k)$  be the schedule in which each of  $p_1, \ldots, p_k$  takes exactly one step. Let  $Q_0 = Q$ . By the inductive hypothesis there is a schedule  $\alpha_1$  such that in  $C_1 := \operatorname{Conf}(Q_0, \alpha_1)$  a set  $\mathcal{R}_1$  of exactly k registers is covered, and process 0 is idle. Hence,  $\operatorname{Exec}(C_1, \beta)$  is a block-write to  $\mathcal{R}_1$  yielding a configuration  $D_1 = \operatorname{Conf}(C_1, \beta)$ . We let  $\gamma_1$  be the schedule such that in  $\operatorname{Exec}(D_1, \gamma_1)$  first each process in  $\{p_1, \ldots, p_k\}$  takes enough unobstructed steps to finish its WeakRead() method call, and after that process 0 takes enough unobstructed steps to complete exactly one WeakWrite() method. Then  $Q_1 = \operatorname{Conf}(D_1, \gamma_1)$  is quiescent, and during  $\operatorname{Exec}(D_1, \gamma_1)$  exactly one complete WeakWrite() gets executed. Repeating this construction (using the inductive hypothesis repeatedly) we obtain a schedule  $\alpha_1\beta\gamma_1\alpha_2\beta\gamma_2\alpha_3\ldots$  and configurations  $Q_0, C_1, D_1, Q_1, C_2, D_2, Q_2, \ldots$  and sets of k registers  $\mathcal{R}_1, \mathcal{R}_2, \ldots$ , such that for any  $i \geq 1$ :

- $Q_{i-1} \stackrel{\alpha_i}{\leadsto} C_i \stackrel{\beta}{\leadsto} D_i \stackrel{\gamma_i}{\leadsto} Q_i;$
- $Q_i$  is quiescent;
- during  $\text{Exec}(D_i, \gamma_i)$  process 0 executes a complete WeakWrite() operation; and

• in  $C_i$  process 0 is idle and  $\mathcal{R}_i$  is covered by  $P_k$  (and thus  $\operatorname{Exec}(C_i, \beta)$  is a block-write to  $\mathcal{R}_i$ ). Since the number of registers is finite, and all registers are bounded, there exist indices  $1 \leq i < j$ such that  $\operatorname{reg}(D_i) = \operatorname{reg}(D_j)$ . Let  $\sigma = \gamma_i \alpha_{i+1} \beta \gamma_{i+1} \alpha_{i+2} \dots \alpha_j \beta$ , i.e.,

$$C_i \stackrel{\beta}{\rightsquigarrow} D_i \stackrel{\sigma}{\rightsquigarrow} D_j. \tag{1}$$

This situation is depicted in Figure 1. Now let  $\lambda'$  be a  $p_{k+1}$ -only schedule such that in  $\text{Exec}(C_i, \lambda')$  process  $p_{k+1}$  completes exactly one WeakRead() method call. By the nondeterministic solotermination property, such a schedule  $\lambda'$  exists. Let  $\lambda$  be the prefix of  $\lambda'$ , such that  $\text{Exec}(C_i, \lambda)$  ends when  $p_{k+1}$  is poised to write to a register  $R \notin \mathcal{R}_i$  for the first time, or  $\lambda = \lambda'$  if  $p_{k+1}$  finishes its WeakRead() method call without writing to a register outside of  $\mathcal{R}_i$ .

First assume  $\lambda \neq \lambda'$ , i.e., in Exec $(C_i, \lambda)$  process  $p_{k+1}$  does not finish its WeakRead() method call, but instead the execution ends when  $p_{k+1}$  covers a register  $R \notin \mathcal{R}_i$ . Since in  $C_i$  process 0 is idle and  $\mathcal{R}_i$  is covered by  $P_k$ , and since  $\lambda$  is  $p_{k+1}$ -only, in configuration  $\operatorname{Conf}(C_i, \lambda) = \operatorname{Conf}(Q, \alpha_1 \beta \gamma_1 \dots \alpha_i \lambda)$ processes  $p_1, \dots, p_{k+1}$  cover k + 1 registers, and process 0 is still idle. This completes the proof of the inductive step for  $\alpha = \alpha_1 \beta \gamma_1 \dots \alpha_i \lambda$ .

Now we consider the case  $\lambda = \lambda'$ , i.e., during  $\text{Exec}(C_i, \lambda)$  process  $p_{k+1}$  finishes its WeakRead() method call without writing to a register outside of  $\mathcal{R}_i$ . To complete the proof of the lemma, it suffices to show that this case cannot occur. (This case is depicted in Figure 1.)

Since in  $C_i$  the processes in  $P_k$  cover  $\mathcal{R}_i$ , and  $p_{k+1}$  only writes to registers in  $\mathcal{R}_i$  during  $\text{Exec}(C_i, \lambda)$ , it follows that  $\text{Exec}(C_i, \lambda\beta)$  ends with a block-write by  $P_k$  in which all writes by  $p_{k+1}$  get obliterated. In particular, for  $D'_i := \text{Conf}(C_i, \lambda\beta)$  we have

$$D_i' \underset{\mathcal{P} \setminus \{p_{k+1}\}}{\sim} D_i \tag{2}$$

Hence, since schedule  $\sigma$  is  $(P_k \cup \{0\})$ -only, i.e.,  $p_{k+1}$  does not participate, we obtain  $\text{Exec}(D'_i, \sigma) = \text{Exec}(D_i, \sigma)$ , and in particular using Eq. (1)

$$C_i \stackrel{\lambda\beta}{\leadsto} D'_i \stackrel{\sigma}{\leadsto} D'_j, \quad \text{where} \quad D_j \underset{\mathcal{P} \setminus \{p_{k+1}\}}{\sim} D'_j.$$
 (3)

Now recall that we chose *i* and *j* in such a way that  $\operatorname{reg}(D_i) = \operatorname{reg}(D_j)$ . Thus, from Eq. (2) and (3) we get  $\operatorname{reg}(D'_i) = \operatorname{reg}(D_i) = \operatorname{reg}(D_j) = \operatorname{reg}(D'_j)$ . Because  $D'_i \stackrel{\sigma}{\rightsquigarrow} D'_j$  (Eq. (3)), and since by construction only processes  $\{0, p_1, \ldots, p_k\}$  appear in  $\sigma$ ,  $p_{k+1}$  is in  $D'_i$  in exactly the same state as in  $D'_j$ . Hence,

$$D'_i \underset{p_{k+1}}{\sim} D'_j. \tag{4}$$

Now recall that  $C_i \stackrel{\lambda\beta}{\leadsto} D'_i$ , and in the corresponding execution process  $p_{k+1}$  executes a complete WeakRead() method, while process 0 takes no steps, and  $p_{k+1}$  is idle in  $D'_i$ . Hence,  $D'_i$  is  $p_{k+1}$ -clean. On the other hand,  $\operatorname{Exec}(D'_i, \sigma) = \operatorname{Exec}(D_i, \sigma)$  starts with a complete WeakWrite() operation (during the prefix  $\operatorname{Exec}(D'_i, \gamma_i)$ ) by process 0, while process  $p_{k+1}$  takes no steps, and thus remains idle. It follows that the configuration resulting from that execution,  $D'_j$ , is  $p_{k+1}$ -dirty. Summarizing, we have two reachable configurations,  $D'_i$  and  $D'_j$ , where one of them is  $p_{k+1}$ -clean and the other one is  $p_{k+1}$ -dirty, and both are indistinguishable to  $p_{k+1}$ , according to Eq. (4). This contradicts Observation 1.

## 2.2. A Time-Space Tradeoff for Implementations from CAS Objects

We now consider deterministic wait-free implementations of WeakRead() and WeakWrite() from m writable bounded CAS objects. We assume without loss of generality that every CAS(x,y) operation satisfies  $x \neq y$ . (A CAS(x,x) operation can be replaced by a Read()).

For any configuration C and any shared CAS object R let CCov(C, R) and WCov(C, R) denote the sets of processes that are poised in C to execute a CAS() respectively Write() operation on R. Let P be a set of processes and C a configuration. A schedule  $\beta$  is called P-successful for C, if it contains every process in P exactly once, and every step of  $Exec(C, \beta)$  is either a Write() or a successful CAS(). If a configuration C has a P-successful schedule  $\beta$ , then we also say execution  $Exec(C, \beta)$  is P-successful.

As before, we assume that all processes run an infinite loop, where process 0 repeatedly calls WeakWrite() while all other processes repeatedly call WeakRead().

**Lemma 2.** Let  $P \subsetneq \mathcal{P} \setminus \{0\}$ ,  $q \in \mathcal{P} \setminus P$ ,  $q \neq 0$ . Let C be a configuration, in which either q is idle, or in no execution starting from C process q executes more than t shared memory steps before finishing a pending WeakRead() call. If  $\beta$  is a P-successful schedule for C, then there is a schedule  $\sigma$  such that

$$\operatorname{Conf}(C,\beta) \underset{\mathcal{P} \setminus \{q\}}{\sim} \operatorname{Conf}(C,\sigma), \tag{5}$$

and at least one of the following is the case:

- (a) In  $\operatorname{Conf}(C, \sigma)$  process q is idle;
- (b) in  $\operatorname{Conf}(C, \sigma)$  process q is poised to write to some object R and  $|\operatorname{WCov}(C, R) \cap P| < t$ ; or
- (c) in  $\operatorname{Conf}(C, \sigma)$  process q is poised to execute a  $\operatorname{CAS}(x,y)$  operation on some object R, where x is the value of R in configuration  $\operatorname{Conf}(C, \sigma)$ , and  $|\operatorname{WCov}(C, R) \cap P| + |\operatorname{CCov}(C, R) \cap P| < t$ .

*Proof.* We prove the lemma by induction on t. If t = 0, then q is idle in C. Hence, for  $\sigma = \beta$  we obtain Eq. (5) and Case (a).

Let  $op_q$  be the step q is poised to execute in C, and let V be the object affected by  $op_q$ . Further, let  $val_C(V)$  denote the value of V in configuration C.

**Case 1:** First, assume that  $op_q$  is a Read() or a CAS() operation. Let z be the first process in P that executes a step in  $\text{Exec}(C,\beta)|V$ , and let  $op_z$  be that step. We construct a two-step schedule  $\lambda$  that contains q and z, such that

$$C' := \operatorname{Conf}(C, \lambda) \underset{\mathcal{P} \setminus \{q\}}{\sim} \operatorname{Conf}(C, z).$$
(6)

First suppose  $op_z$  is a CAS(a, b) operation and  $op_q$  a CAS(x, y) operation that would succeed in C (i.e.,  $x = val_C(V)$ ). Then we define  $\lambda = (z, q)$ . Since  $\beta$  is *P*-successful, the CAS(a, b) by z in configuration C succeeds and changes the value of V from a to b. In this case, x = a, so in the execution  $\text{Exec}(C, \lambda)$  the CAS(x, y) by q fails. Eq. (6) follows.

In all other cases (i.e., if  $op_z$  is a Write() operation or  $op_q$  is a Read() or a CAS() that fails in C), then we let  $\lambda = (q, z)$ . Then in Exec $(C, \lambda)$  either operation  $op_q$  does not change the value of object V, or  $op_z$  executes a Write() and overwrites any changes that may have resulted from  $op_q$ . It follows that Eq. (6) is true.

Now let  $\beta' = \beta | (P \setminus \{z\})$ , and recall that  $C' = \text{Conf}(C, \lambda)$ . Since  $\beta$  is *P*-successful and in  $\text{Exec}(C, \beta)$  process *z* executes the first step on *V*, it follows that  $\beta'$  is *P*-successful in *C'*.

Hence, we can apply the inductive hypothesis for configuration C', process set  $P' = P \setminus \{z\}$ , and schedule  $\beta'$ , to obtain a schedule  $\sigma'$  that satisfies one of the Cases (a)-(c). Let  $\sigma = \lambda \circ \sigma'$ . Then by construction,  $\operatorname{Conf}(C, \sigma) = \operatorname{Conf}(C', \sigma') \sim_{P \setminus \{q,z\}} \operatorname{Conf}(C', \beta') = \operatorname{Conf}(C, \beta)$ . Because of Eq. (6), process z can also not distinguish between  $\operatorname{Conf}(C, \sigma)$  and  $\operatorname{Conf}(C, \beta)$ , so we obtain Eq. (5). If (a) of the inductive hypothesis applies for C' and  $\sigma'$ , then the same also applies for C and  $\sigma$ , because  $\operatorname{Conf}(C, \sigma) = \operatorname{Conf}(C', \sigma')$ . Now suppose that Case (b) applies for C' and  $\sigma'$ . Let R be the object on which process q is poised to execute a Write() in  $\operatorname{Conf}(C', \sigma') = \operatorname{Conf}(C, \sigma)$ . Starting from configuration C', process q must finish its WeakRead() method within t' = t - 1 steps. Hence,  $|\operatorname{WCov}(C', R) \cap P'| \leq t'$ . Since all processes other than z are poised to execute the same step in C as in C', we have  $|\operatorname{WCov}(C, R) \cap P| \leq |\operatorname{WCov}(C', R) \cap P'| + 1 \leq t' + 1 = t$ . Hence, Case (b) follows for C,  $\sigma$  and P. With exactly the same argument, if Case (c) applies for C',  $\sigma'$ , and P', then it also applies for C,  $\sigma$ , and P.

**Case 2:** We now assume that in C process q is poised to execute a Write() operation  $op_q$  on object V. If  $|WCov(C, V) \cap P| < t$ , we let  $\sigma = \beta$ . Then Eq. (5) and Case (b) (for R = V) of the lemma are trivially satisfied.

Hence, assume that  $|WCov(C, V) \cap P| \ge t$ . Then  $Exec(C, \beta)$  contains at least t writes to V. Let  $z_1, \ldots, z_{\ell-1}$  be the processes accessing V in this order in  $Exec(C, \beta)|V$  before the first write to V occurs, and let  $z_{\ell}$  be the first process writing to V. Let  $Z = \{z_1, \ldots, z_{\ell}\}, \lambda = (z_1, \ldots, z_{\ell-1}, q, z_{\ell}), \gamma = \beta |Z = (z_1, \ldots, z_{\ell}), \beta' = \beta |(P \setminus Z) \text{ and } P' = P \setminus Z$ . Then in  $Exec(C, \gamma \circ \beta')$  all processes in P execute exactly one step, as they do in  $Exec(C, \beta)$ , and for each object U the steps executed on U occur in the same order in both executions. Hence, processes cannot distinguish these executions from each other, and in particular

$$\operatorname{Conf}(C, \gamma \circ \beta') = \operatorname{Conf}(C, \beta). \tag{7}$$

In Exec $(C, \lambda)$ , first processes  $z_1, \ldots, z_{\ell-1}$  execute successful CAS operations on V, then q writes to V, and finally,  $z_{\ell}$  overwrites what q has written. It follows that

$$C' := \operatorname{Conf}(C, \lambda) \underset{\mathcal{P} \setminus \{q\}}{\sim} \operatorname{Conf}(C, \gamma).$$
(8)

Combining this with Eq. (7) we obtain that  $\beta'$  is P'-successful in C'. Moreover, since q executed one step in the execution leading from C to C', in any execution starting from C' it finishes its WeakRead() method after at most t' = t - 1 steps. Thus, we can apply the inductive hypothesis to obtain a schedule  $\sigma'$  such that  $\operatorname{Conf}(C', \beta') \sim_{\mathcal{P} \setminus \{q\}} \operatorname{Conf}(C', \sigma')$ , and one of Cases (a)-(c) holds. Let  $\sigma = \lambda \circ \sigma'$ . Then  $\operatorname{Conf}(C, \sigma) = \operatorname{Conf}(C', \sigma') \sim_{\mathcal{P} \setminus \{q\}} \operatorname{Conf}(C', \beta') \sim_{\mathcal{P} \setminus \{q\}} \operatorname{Conf}(C, \beta)$ , where the last relation follows from Eq. (7) and (8). This proves Eq. (5).

If Case (a) of the inductive hypothesis holds for C' and  $\sigma'$ , then it is also true for C and  $\sigma$  because  $\operatorname{Conf}(C, \sigma) = \operatorname{Conf}(C', \sigma')$ . Now suppose either Case (b) or (c) applies to C' and  $\sigma'$ . Let R be the object process q is poised to access in  $\operatorname{Conf}(C, \sigma) = \operatorname{Conf}(C', \sigma')$ . If  $R \neq V$ , then by construction we have  $|\operatorname{WCov}(C, R) \cap P| = |\operatorname{WCov}(C', R) \cap P'|$  and  $|\operatorname{CCov}(C, R) \cap P| = |\operatorname{CCov}(C', R) \cap P'|$ . Hence, Case (b) or (c) for C',  $\sigma'$ , and P' immediately implies the same case for C,  $\sigma$ , an P. Finally, suppose Case (b) or (c) occurs for R = V. By construction in  $\operatorname{Exec}(C, \lambda)$  only one process among all processes in  $\operatorname{WCov}(C, V)$  writes to V, namely process  $z_{\ell}$ . Hence, in the configuration C' reached by  $\operatorname{Exec}(C, \lambda)$  all other processes in  $\operatorname{WCov}(C, R) \cap P| - 1 \geq t - 1 = t'$ , so neither Case (b) nor Case (c) can apply to C',  $\sigma'$ , and P'—contradiction.



Figure 2: Proof of Lemma 3. Double circles denote quiescent configurations.

**Lemma 3.** Suppose WeakRead() and WeakWrite() have step complexity at most t. For any reachable quiescent configuration Q and any set  $P_k = \{p_1, \ldots, p_k\} \subseteq \mathcal{P} \setminus \{0\}$ , where  $k \in \{0, \ldots, n-1\}$ , there exists a  $(P_k \cup \{0\})$ -only schedule  $\alpha$  such that  $C := \text{Conf}(Q, \alpha)$  satisfies all of the following:

- (i) all processes in  $\mathcal{P} \setminus P_k$  are idle in C;
- (ii) there is a  $P_k$ -successful schedule for C; and

(iii)  $|\operatorname{WCov}(C, R) \cap P_k| \leq t$  and  $|\operatorname{CCov}(C, R) \cap P_k| \leq t$  for all objects R.

*Proof.* Throughout this proof let  $P_k^0$  denote the set  $P_k \cup \{0\}$ . We prove the lemma by induction on k. For k = 0, we let  $\alpha$  be the empty schedule, so  $C = \text{Conf}(Q, \alpha) = Q$ . Then C is quiescent and (i) is true. Statements (ii)–(iii) follow immediately from  $P_k = \emptyset$ .

Now suppose the inductive hypothesis is true for some value of  $k \in \{0, \ldots, n-2\}$ . We let  $Q_0 = Q$ and  $P_{k+1} = P_k \cup \{p_{k+1}\}$  for an arbitrary process  $p_{k+1} \in \mathcal{P} \setminus P_k^0$ . Then, for  $i = 1, 2, \ldots$  we iteratively construct executions  $\alpha_i, \beta_i, \gamma_i$  and configurations  $Q_i, C_i, D_i$ , where  $Q_{i-1} \stackrel{\alpha_i}{\hookrightarrow} C_i \stackrel{\beta_i}{\hookrightarrow} D_i \stackrel{\gamma_i}{\hookrightarrow} Q_i$ , and  $\alpha_i, \beta_i, \gamma_i$  are determined as follows:  $\alpha_i$  is a  $P_k^0$ -only schedule that guarantees properties (i)-(iii) from the inductive hypothesis for configuration  $Q_{i-1}$ ;  $\beta_i$  is a  $P_k$ -successful schedule for  $C_i$ ; and  $\gamma_i$  is an arbitrary  $P_k$ -only schedule followed by a 0-only schedule such that  $Q_i$  is quiescent, and where  $\text{Exec}(D_i, \gamma_i)$  contains exactly one complete WeakWrite() operation by process 0. By the assumption that WeakRead() and WeakWrite() are wait-free,  $\gamma_i$  exists.

We define for each configuration  $C_i$  a signature,  $sig(C_i)$ , which encodes for every process p the exact shared memory operation p is poised to execute next (including its parameters), and for every base object R its value.

Since there is only a finite number of bounded base objects in the system, there is a finite number of signatures, and thus there exist  $1 \leq i < j$  such that  $C_i$  and  $C_j$  have the same signature. We let  $\lambda = \alpha_1 \beta_1 \gamma_1 \alpha_2 \dots \alpha_{i-1} \beta_{i-1} \gamma_{i-1}$ , and  $\lambda' = \gamma_i \alpha_{i+1} \beta_{i+1} \gamma_{i+1} \alpha_{i+1} \dots \alpha_{j-1} \beta_{j-1} \gamma_{j-1}$ . From the construction above we have  $Q \xrightarrow{\lambda} Q_{i-1} \xrightarrow{\alpha_i} C_i \xrightarrow{\beta_i} D_i \xrightarrow{\lambda'} Q_{j-1} \xrightarrow{\alpha_j} C_j$ , where  $Q_{i-1}$  and  $Q_{j-1}$  are quiescent,  $C_i$  satisfies (i)-(iii) from the inductive hypothesis, and  $sig(C_i) = sig(C_j)$ . This situation, as well as the following construction is depicted in Figure 2.

Now we apply Lemma 2 to configuration  $C_i$ . For the purpose of applying this lemma, we may assume that in  $C_i$  process  $p_{k+1}$  has just invoked a WeakRead() operation but not yet executed its first shared memory step during that operation. Hence, in all executions starting from  $C_i$ ,  $p_{k+1}$  will finish that pending WeakRead() operation in at most t steps. Then Lemma 2 yields a  $P_{k+1}$ -only schedule  $\sigma_i$  such that for  $D'_i = \text{Conf}(C_i, \sigma_i)$ 

$$D_i \underset{\mathcal{P} \setminus \{p_{k+1}\}}{\sim} D'_i, \tag{9}$$

and one of the Cases (a)-(c) of Lemma 2 hold. Let  $C'_j = \operatorname{Conf}(D'_i, \lambda'\alpha_j)$ , and  $D'_j = \operatorname{Conf}(C'_j, \beta_i)$ . (The use of  $\beta_i$  instead of  $\beta_j$  is intentional.) Then  $Q_{i-1} \stackrel{\alpha_i}{\leadsto} C_i \stackrel{\sigma_i}{\leadsto} D'_i \stackrel{\lambda'\alpha_j}{\leadsto} C'_j \stackrel{\beta_i}{\leadsto} D'_j$ . Since  $\lambda'\alpha_j$  does not contain  $p_{k+1}$ , which is the only process that, according to Eq. (9), may be able to distinguish  $D_i$  from  $D'_i$ , we obtain

$$C'_{j} \underset{\mathcal{P} \setminus \{p_{k+1}\}}{\sim} C_{j}. \tag{10}$$

Configurations  $C_i$  and  $C_j$  have the same signature. Therefore, every process is poised to execute the same step in  $C_i$  as in  $C_j$ , and all objects have the same values in both configurations. This, together with the fact that  $\beta_i$  is  $P_k$ -only and every process appears at most once in  $\beta_i$  implies

$$\operatorname{Exec}(C_i,\beta_i) = \operatorname{Exec}(C_j,\beta_i) \stackrel{(10)}{=} \operatorname{Exec}(C'_j,\beta_i).$$
(11)

Hence, all objects have the same value in  $D_i = \operatorname{Conf}(C_i, \beta_i)$  as in  $D'_j = \operatorname{Conf}(C'_j, \beta_i)$ , and thus from Eq. (9), all objects have the same value in  $D'_i$  as in  $D'_j$ . Since  $p_{k+1}$  does not appear in  $\lambda' \alpha_j \beta_i$ , and thus takes no step in the execution leading from  $D'_i$  to  $D'_j$ , we conclude

$$D'_i \underset{p_{k+1}}{\sim} D'_j. \tag{12}$$

Now recall that  $\sigma_i$  is the schedule  $\sigma$  guaranteed by Lemma 2 (applied with  $C = C_i$  and  $q = p_{k+1}$ ), and the claim guarantees one of three Cases (a)-(c). First, assume Case (a) occurs, i.e.,  $p_{k+1}$ completes a WeakRead() method call in  $\text{Exec}(C_i, \sigma_i)$  (recall that in  $C_i$  it had just invoked that method call) and is idle in  $D'_i = \text{Conf}(C_i, \sigma_i)$ . Since process 0 takes no steps in  $\text{Exec}(C_i, \sigma_i)$  it follows that  $D'_i$  is  $p_{k+1}$ -clean. On the other hand,  $\text{Exec}(D'_i, \lambda'\alpha_j\beta_i)$  contains no steps by  $p_{k+1}$ , but instead a complete WeakWrite() by process 0. Hence,  $D'_j = \text{Conf}(D'_i, \lambda'\alpha_j\beta_j)$ , is  $p_{k+1}$ -dirty. But this contradicts Observation 1, because according to Eq. (12) process  $p_{k+1}$  cannot distinguish  $D'_i$ from  $D'_i$ . Hence, we know that Case (a) from Lemma 2 cannot apply.

Now, suppose that instead Case (b) or (c) applies. We show that statements (i)–(iii) of the lemma are true for  $\alpha = \lambda \alpha_i \sigma_i \lambda' \alpha_j$  and  $C = \text{Conf}(Q, \alpha) = C'_j$ . By the inductive hypothesis (i), in  $C_j$  all processes in  $\mathcal{P} \setminus P_k$  are idle, so from Eq. (10) it follows that in  $C'_j$  all processes in  $\mathcal{P} \setminus P_{k+1}$  are idle. This proves (i).

According to Cases (b) and (c) of Lemma 2, in configuration  $D'_i$  (and thus also in  $C'_j$  and  $D'_j$ ) process  $p_{k+1}$  is poised to execute an operation op that is either a Write() or a CAS(x, y) on some object  $R^*$ . Moreover, in case that op is a CAS(x, y), in configuration  $D'_i$  object  $R^*$  has value x. Then Eq. (12) implies that the value of  $R^*$  is also x in configuration  $D'_j$ , and in partial, if  $p_{k+1}$  executes CAS(x,y) in that configuration, that CAS() succeeds. We conclude that in the execution  $Exec(C'_j, \beta_i \circ p_{k+1})$  process  $p_{k+1}$  takes exactly one step, which is either a Write() or a successful CAS(x,y). By construction,  $Exec(C_i, \beta_i)$  is  $P_k$ -successful, and so Eq. (11) implies that  $Exec(C'_j, \beta_i)$  is also  $P_k$ -successful. It follows that  $Exec(C'_j, \beta_i \circ p_{k+1})$  is  $P_{k+1}$ -successful, which proves statement (ii).

Finally, since in  $C'_j$  process  $p_{k+1}$  is poised to execute operation op on  $R^*$ , and all other processes are poised to execute exactly the same step as in  $C_i$ , we have: In case op is a Write(),  $WCov(C'_j, R^*) = WCov(C_i, R^*) \cup \{p_{k+1}\}$ , and in case op is a CAS(),  $CCov(C'_j, R^*) =$  $CCov(C_i, R^*) \cup \{p_{k+1}\}$ . All other sets  $WCov(\cdot, \cdot)$  and  $CCov(\cdot, \cdot)$  are the same for  $C_i$  as for  $C'_j$ . Therefore, Cases (b) and (c) of Lemma 2 together with the inductive hypothesis (iii) immediately imply  $|WCov(C'_j, R) \cap P_{k+1}| \leq t$  and  $|CCov(C'_j, R) \cap P_{k+1}| \leq t$  for all objects R. This proves (iii) and completes the inductive step.

Parts (b) and (c) of Theorem 1 follow immediately from this lemma. (See Appendix B.2 for additional details.) Replacing each WeakWrite() with a LL()/SC() pair by process 0, and accommodating the definition of *p*-clean and *p*-dirty configurations, we obtain Corollary 1. (See Appendix B.3 for additional details.)

## 3. Upper Bounds

#### 3.1. Constant-Time ABA-Detecting Registers from Registers

Figure 4 depicts an optimal linearizable implementation of an ABA-detecting register from n + 1 bounded registers with constant step-complexity. We use two more registers than needed according to the lower bound in Theorem 1 (a).

The main idea of the algorithm is similar to one used in the multi-layered construction of LL/SC/VL from CAS by Jayanti and Petrovic [15], which itself is a modified version of the implementation by Anderson and Moir [2].

Here, we briefly discuss the implementation. A complete correctness proof is provided in Appendix C. We use a shared bounded register X that stores a triple (x, p, s), where x is the value stored in the ABA-detecting register,  $p \in \mathcal{P}$  is a process ID, and  $s \in \{0, \ldots, 2n+1\}$  is a sequence number. We also use a shared *announce array*  $A[0 \cdots n - 1]$ , where only process qcan write to A[q]. Each array entry A[q] stores a pair (p, s), where  $p \in \mathcal{P}$  is a process ID and  $s \in \{0, \ldots, 2n+1\}$  is a sequence number. Register X is initialized to  $(\perp, \perp, \perp)$  and all entries of A are initialized to  $(\perp, \perp)$ .

In a DWrite(x) operation, the calling process p first determines a suitable sequence number, s, using the helper method GetSeq(), and writes the pair (x, p, s) to X (lines 26-27). Method GetSeq() ensures that the sequence number s it returns satisfies the following: If there is any point at which  $X = (\cdot, p, s)$  and A[q] = (p, s) for some process q, then p will not use sequence number s again in any following DWrite() call, until  $A[q] \neq (p, s)$ . To achieve that, in a sequence of n consecutive GetSeq() calls process p scans through the entire announce array, reading one array entry with each GetSeq() call. It then returns a sequence number that p has not used in its preceding n DWrite() method calls, and which it has not found in any array entry of A[], when it read that entry last.



Figure 3: An LL/SC/VL implementation from bounded CAS.

For its DRead() operation each process q uses a local variable, b, that indicates whether a DWrite() operation linearized already during q's previous DRead() operation after that operation's linearization point. Our algorithms ensure that if b = True at the beginning of a DRead() operation, then such a DWrite() has happened. In a DRead() operation, a process reads X twice to obtain the triples (x, p, s) in line 38, and (x', p', s') in line 41. Between those two reads q first reads the old value of A[q] into  $(r, s_r)$  (line 39), and then it announces the pair (p, s) by writing it to A[q] (line 40). Now  $(r, s_r)$  stores the "old" announcement from q's preceding DRead() operation, and A[q] stores the current one. In lines 42–45, q now decides the return value: If b = True or if  $(p, s) \neq (r, s_r)$ , then q returns (x, True), and otherwise (x, False). Moreover, in preparation for the next DRead(), if (x, p, s) = (x', p', s'), q sets b to False, and otherwise it sets it to True.

First suppose q reads two different triples from X in line 38 and line 41, i.e.,  $(x, p, s) \neq (x', p', x')$ . Then the DRead() operation will linearize with the first read of X (line 38). We now know that the value of X has changed between the linearization point and the response of q's DRead(). Hence, q sets flag b to indicate that its next DRead() should return a pair ( $\cdot$ , True). If (x, p, s) = (x', p', s'), then, on the other hand, it is ensured that A[q] = (p, s) at the point when q read (x, p, s) from X in line 41. As explained above, in this case the pair (p, s) will not be used again in any following DWrite() operation, until q has replaced its announcement (p, s) with a new one. Hence, q resets b because in the following DRead() operation, q will be able to detect any DWrite() that has happened inbetween by comparing A[q] with the corresponding pair stored in X.

| shared<br>Register $X = (\bot, \bot, \bot)$<br>Register $A[0 \dots n - 1] = ((\bot, \bot), \dots, (\bot, \bot))$<br>Method DWrite <sub>p</sub> (x)                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | <b>local</b> (to each process)<br>Boolean b = False<br>Queue $usedQ[n+1] = (\bot,, \bot)$<br>Set $na = \{\}$<br>int $c = 0$                                                                                                                                                                                                                                             |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 26 $s \leftarrow \text{GetSeq}()$<br>27 $X.\text{Write}(x,p,s)$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | $\begin{array}{c} \textbf{Method } \texttt{DRead}_q(\texttt{)} \\ \\ \textbf{38} \ (x,p,s) \leftarrow X.\texttt{Read}(\texttt{)} \end{array}$                                                                                                                                                                                                                           |
| $\begin{array}{c c} \textbf{Method GetSeq}_p() \\ \hline \textbf{28} & (r, s_r) \leftarrow A[c].\texttt{Read}() \\ \textbf{29 if } r = p \textbf{ then} \\ \textbf{30} & \mid na \leftarrow (na \setminus \{(c, i) \mid i \in \mathbb{N}\}) \cup (c, s_r) \\ \textbf{31 else} \\ \textbf{32} & \lfloor na \leftarrow na \setminus \{(c, i) \mid i \in \mathbb{N}\} \\ \textbf{33 } c \leftarrow (c+1) \bmod n \\ \textbf{34 choose arbitrary} \\ & s \in \{0, \dots, 2n+1\} \setminus (\{i \mid (j, i) \in na\} \cup usedQ) \\ \textbf{35 } usedQ.enq(s) \\ \textbf{36 } usedQ.deq() \\ \textbf{37 return } s \end{array}$ | 39 $(r, s_r) \leftarrow A[q]$ .Read()<br>40 $A[q]$ .Write $(p,s)$<br>41 $(x', p', s') \leftarrow X$ .Read()<br>42 if $(p, s) = (r, s_r)$ then<br>43   $ret \leftarrow (x, b)$<br>44 else<br>45 $\lfloor ret \leftarrow (x, True)$<br>46 if $(x, p, s) = (x', p', s')$ then<br>47   $b \leftarrow False$<br>48 else<br>49 $\lfloor b \leftarrow True$<br>50 return $ret$ |

Figure 4: An ABA-detecting register implemented from bounded registers.

#### 3.2. LL/SC/VL from a Single Bounded CAS

We now briefly sketch our wait-free implementation of LL/SC/VL from a single bounded CAS object. The implementation has O(n) step complexity, and thus, by Corollary 1, is optimal. The pseudo-code is presented in Figure 3 and correctness proofs can be found in Appendix D.

In a CAS object X, we store a pair (x, a), where x represents the value of the implemented LL/SC/VL object, and a is an n-bit string. The p-th bit of a is used to indicate whether an SC() operation linearized since p's last LL() (the bit is usually set in this case). As in the previous algorithm, we use a local variable b for each process p. In an LL() call, a process p tries to reset its bit (the p-th bit of the second component) of X. As we explain below, this may fail, but only if an SC() sc linearizes during p's attempts to reset that bit. If that happens, p sets the flag b and its LL() linearizes before sc. Thus, in a subsequent SC() or VL(), p determines from the set flag b that it does not have a valid link, and that SC() or VL() can fail, even though p's bit in X is not set.

More precisely, in a LL() method call, process p reads the pair (x, a) from X (line 14) and checks whether its bit in a is set. If not, in lines 16–17 it simply resets b (because in subsequent SC() or VL() calls p's bit in X will indicate whether p has a valid link or not) and returns x. That LL() operation linearizes with the Read() of X in line 14. Now suppose p's bit in X is set. Then ptries to reset that bit, using a CAS() operation on X. However, that CAS() may fail because of some other process' successful CAS() during a LL() or SC() call. Therefore, p repeatedly reads Xfollowed by a CAS() to set its bit in the second component of X, until its CAS() operation succeeds, or until it has failed n times (lines 20–21). If a CAS() succeeds, p resets b and returns the first component of X that it read just before its last, successful CAS() attempt (lines 22-23); the LL() linearizes with that CAS(), and since p's bit in X is now reset, in the next DRead() operation p can use its bit in X to determine whether an SC() linearized since the linearization point of the current DRead(). If p's CAS() fails n times, then X must have changed n times since p's first Read() of X. We argue that then at least one such change must be due to a CAS() operation during some process' SC(): Suppose not. Then X must have changed at least n times, and every time it must have changed because of a CAS() executed in a LL() operation. But this is not possible, because each time such a CAS() succeeds, one of the bits in the second part of X changes from 1 to 0, and p's bit does not change at all. We conclude that at least once, while p has been trying to reset its bit in X, a successful CAS() on X must have occurred during an SC() operation. As we discuss below, this means that a successful SC() linearized. Hence, in this case p can set its bit to True (line 24), which guarantees that p's next SC() or VL() will fail, and return in line 25 the value xit read at the very beginning from X. The linearization point of that LL() is the Read() of X in line 14.

In an SC(y) operation a process p first checks flag b, and if it is set, p immediately returns False—this indicates that an SC() linearized during p's last LL() but after the linearization point of that LL(). If b is not set, then p reads X to determine whether its bit in X is set, and if yes, it can also return False (lines 3–5), because this indicates that some other SC() has linearized since p's last LL(). If p's bit in X is not set, then p tries to write  $(y, 2^n - 1)$  into X using a CAS() operation (line 6). If that CAS() succeeds, as a result the value of the LL/SC/VL object change to y, and the bits of all processes are now set in X. Hence, p's SC(y) linearizes with that successful CAS(), and p returns True (line 7). If the CAS() fails, then p repeats up to n times, until either it finds that its bit in X is set (and thus some other process' SC() succeeded), or its own CAS() succeeds. If p's CAS() fails n times, then for the same reasons as explained earlier, we know that some process' SC() must have linearized during p's ongoing SC(y) operation, and thus p can return False (the unsuccessful SC() linearizes with its response).

Operation VL() is very simple: A process simply checks whether flag b or its bit in X is set, and if yes, it returns False, otherwise it returns True.

## References

- Z. Aghazadeh, W. Golab, and P. Woelfel. Making objects writable. In Proc. of 33rd PODC, pages 385–395, 2014.
- [2] J. H. Anderson and M. Moir. Universal constructions for multi-object operations. In Proc. of 14th PODC, pages 184–193, 1995.
- [3] H. Attiya and F. Ellen. *Impossibility Results for Distributed Computing*. Synthesis Lectures on Distributed Computing Theory. Morgan & Claypool Publishers, 2014.
- [4] J. Burns and N. Lynch. Bounds on shared memory for mutual exclusion. Information and Computation, 107(2):171–184, 1993.

- [5] S. Doherty, M. Herlihy, V. Luchangco, and M. Moir. Bringing practical lock-free synchronization to 64-bit applications. In Proc. of 23rd PODC, pages 31–39, 2004.
- [6] F. Ellen, P. Fatourou, and E. Ruppert. The space complexity of unbounded timestamps. *Distr. Comp.*, 21(2):103–115, 2008.
- [7] F. E. Fich, D. Hendler, and N. Shavit. On the inherent weakness of conditional primitives. Distr. Comp., 18(4):267–277, 2006.
- [8] F. E. Fich, M. Herlihy, and N. Shavit. On the space complexity of randomized synchronization. J. of the ACM, 45(5):843–862, 1998.
- [9] M. Helmi, L. Higham, E. Pacheco, and P. Woelfel. The space complexity of long-lived and one-shot timestamp implementations. J. of the ACM, 61(1):7, 2014.
- [10] D. Hendler, N. Shavit, and L. Yerushalmi. A scalable lock-free stack algorithm. J. Parallel Distrib. Comp., 70(1):1–12, 2010.
- [11] M. Herlihy. Wait-free synchronization. ACM Trans. Program. Lang. Syst., 13(1):124–149, 1991.
- [12] M. Herlihy, V. Luchangco, and M. Moir. The repeat offender problem: A mechanism for supporting dynamic-sized, lock-free data structures. In *Proc. of 16th DISC*, pages 339–353, 2002.
- [13] M. Herlihy and J. M. Wing. Linearizability: A correctness condition for concurrent objects. ACM Trans. Program. Lang. Syst., 12(3), 1990.
- [14] IBM. IBM system/370 extended architecture, principles of operation. Technical report, 1983. Publication No. SA22-7085.
- [15] P. Jayanti and S. Petrovic. Efficient and practical constructions of LL/SC variables. In Proc. of 2nd PODC, pages 285–294, 2003.
- [16] P. Jayanti and S. Petrovic. Efficiently implementing a large number of LL/SC objects. In Proc. of 9th OPODIS, pages 17–31, 2006.
- [17] P. Jayanti, K. Tan, and S. Toueg. Time and space lower bounds for nonblocking implementations. SIAM J. on Comp., 30(2):438–456, 2000.
- [18] E. Ladan-Mozes and N. Shavit. An optimistic approach to lock-free FIFO queues. Distr. Comp., 20(5):323–341, 2008.
- [19] M. Michael. High performance dynamic lock-free hash tables and list-based sets. In *Proc. of* 14th SPAA, pages 73–82, 2002.
- [20] M. Michael. ABA prevention using single-word instructions. Technical report, IBM T. J. Watson Research Center, 2004.

- [21] M. Michael. Hazard pointers: Safe memory reclamation for lock-free objects. IEEE Transactions on Parallel and Distributed Systems, 15(6):491–504, 2004.
- [22] M. Michael. Practical lock-free and wait-free LL/SC/VL implementations using 64-bit CAS. In Proc. of 18th DISC, pages 144–158, 2004.
- [23] M. Michael. Scalable lock-free dynamic memory allocation. In Proceedings of the ACM SIG-PLAN 2004 Conference on Programming Language Design and Implementation, pages 35–46, 2004.
- [24] M. Michael and M. L. Scott. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In Proc. of 15th PODC, pages 267–275, 1996.
- [25] M. Michael and M. L. Scott. Nonblocking algorithms and preemption-safe locking on multiprogrammed shared memory multiprocessors. J. Parallel Distrib. Comp., 51(1):1–26, 1998.
- [26] M. Moir. Practical implementations of non-blocking synchronization primitives. In Proc. of 16th PODC, pages 219–228, 1997.
- [27] S. Prakash, Y. Lee, and T. Johnson. A non-blocking algorithm for shared queues using compare-and-swap. In Proc. of ICPP, pages 68–75, 1991.
- [28] C. Shann, T. Huang, and C. Chen. A practical nonblocking queue algorithm using compareand-swap. In Proc. of 7th ICPADS, pages 470–475, 2000.
- [29] J. M. Stone. A simple and correct shared-queue algorithm using compare-and-swap. In Proc. of SC, pages 495–504, 1990.
- [30] E. Styer and G. Peterson. Tight bounds for shared memory symmetric mutual exclusion problems. In Proc. of 8th PODC, pages 177–191, 1989.
- [31] P. Tsigas and Y. Zhang. A simple, fast and scalable non-blocking concurrent FIFO queue for shared memory multiprocessor systems. In Proc. of 13th SPAA, pages 134–143, 2001.
- [32] J. D. Valois. Lock-free linked lists using compare-and-swap. In Proc. of 14th PODC, pages 214–222, 1995.

## A. Implementation of ABA-Detecting Registers from LL/SC/VL

**Theorem 4.** There is an implementation of an ABA-Detecting register from a single LL/SC/VL object, such that each DRead() and DWrite() operation takes only two shared memory steps.

*Proof.* It suffices to show that the implementation in Figure 5 is linearizable.

Consider a history H on the implemented ABA-Detecting register. We assume w.l.o.g. that all operations in H complete.

We say a VL() operation *succeeds*, if it returns True. Note that we assumed w.l.o.g. (see the description of Figure 5) that the first X.VL() call by a process q succeeds, even if q has not called X.LL(), provided that no X.SC() call has been executed, either. Therefore, for the purpose of this proof we may assume w.l.o.g. that the history H starts with n complete X.LL() operations, one by each process. We say a process has a *valid link* on X, whenever no successful X.SC() has occurred since p's last X.LL() operation.

For each DWrite() and DRead() operation op by process p respectively q, we define a point in time,  $\ell(op)$ , as follows. If op is a DWrite() by process p, and p's SC() in line 52 is successful, then  $\ell(op)$  is the point when that successful SC() gets executed. If the SC() is unsuccessful, then  $\ell(op)$  is the point immediately before the first successful SC() that gets executed after p's LL() in line 51 (such an SC() must occur before p's SC() in line 52, because that SC() by p fails). If op is a DRead() by q, then  $\ell(op)$  is the point of the last shared memory operation executed by q during op (which is either X.VL() in line 53 or X.LL() in line 54).

We prove below that  $\ell()$  maps each operation op to a linearization point of op; therefore, we say that op linearizes at  $\ell(op)$ . Each point  $\ell(op)$  occurs between the invocation and the response of op. It suffices to show that the history obtained by ordering all operations in H by  $\ell(op)$  is valid. Note that every DWrite() operation linearizes either at or immediately before the point of some successful SC(). Therefore, at any point the value of X is equal to the value of the DWrite() operation that linearized last.

Now consider a DRead() operation op by process q. Initially, the value of old equals the value of X, and q has a valid link on x. It follows from the structure of the DRead() operation that the following invariant is maintained: At any point throughout H, q has a valid link on X if and only if no successful X.SC() has been executed on X since q's last X.LL() or its last successful X.VL(), whichever came later. Since a DWrite() operation linearizes at some point t if and only if a successful X.SC() operation is executed at point t, and a DRead() operation linearizes with

shared LL/SC Object  $X = \bot$ 

Method DWrite $_p(x)$ 51 X.LL()52 X.SC(x)

local  $old = \bot$ 

| $dd, \texttt{False}) \\ d, \texttt{True})$ |
|--------------------------------------------|
|                                            |

Figure 5: Implementation of an ABA-detecting register from LL/SC/VL. For the ease of description but w.l.o.g. we assume that, even if q has not called X.LL(), an X.VL() call by process q returns **True** as long as no successful X.SC() has been executed.

its successful X.VL() or, in case of an unsuccessful X.VL(), with its X.LL(), we obtain: Process q has a valid link on X if and only if no DWrite() has linearized since q's last X.DRead() operation linearized (or since the beginning of the history, if none of q's DRead() operations have linearized, yet).

Now suppose op linearizes with the X.VL() operation in line 53, i.e., that VL() operation is successful. Then q returns (old, False). The second component, False, is correct because at  $\ell(op)$  process q has a valid link on X, i.e., no DWrite() operation has linearized since the linearization point of q's preceding DRead() operation (or since the beginning of the execution, if op is q's first DRead() operation). The first component, old, is also correct, because q has a valid link on X, which means that the value of X cannot have changed since q's last LL() in which it obtained the value of old from X (or since the beginning of the execution, when  $X = old = \bot$ ).

Finally, suppose *op* linearizes with the X.LL() operation in line 54, i.e., the preceding X.VL() operation in line 53 fails. Then *op* returns in line 54. Moreover, *q* has no valid link at the point of that X.VL(), and thus also not immediately before  $\ell(op)$ . Hence, a DWrite() operation has linearized since the linearization point of *q*'s preceding DRead() operation (or since the beginning of the execution), and so the second component of the return value, True, is correct. The first component of the return value is the value of X at  $\ell(op)$ , which is also correct.

## B. Additional Details on the Lower Bound Proofs

## B.1. Proof of Observation 1

*Proof.* For the purpose of a contradiction, suppose  $C_1 \sim_p C_2$ . Let  $\alpha_i, i \in \{1, 2\}$ , be the schedules such that  $C_{init} \xrightarrow{\alpha_i} C_i$ , and

- in  $E_1 := \text{Exec}(C_{init}, \alpha_1) p$  executes at least one complete WeakRead(), and the last one,  $r^*$ , happens after any WeakWrite(); and
- in  $E_2 := \text{Exec}(C_{init}, \alpha_2)$  there is a complete WeakWrite()  $w^*$  (by process 0) that overlaps with no WeakRead() by p, and p invokes no WeakRead() after  $w^*$ .

By the nondeterministic solo-termination property, there exists a *p*-only schedule  $\lambda$ , such that in  $\text{Exec}(C_2, \lambda)$  process *p* completes exactly one WeakRead() method call *r*. Then in  $E_2 \circ \text{Exec}(C_2, \lambda)$  the WeakWrite()  $w^*$  happens before *r* and after any preceding WeakRead() by *p*. Therefore, by the specification of WeakWrite() and WeakRead(), operation *r* returns True.

Since  $C_1 \sim_p C_2$ , process p also completes its WeakRead() r with return value True during  $\operatorname{Exec}(C_1, \lambda)$ . But since  $C_1$  is p-clean, there is now a complete WeakRead() operation  $r^*$  by process p that precedes r in  $\operatorname{Exec}(C_{init}, \alpha_1 \circ \lambda)$ , and any WeakWrite() operation happens before  $r^*$ . By the specification of WeakWrite() and WeakRead(), operation r returns False—a contradiction.

#### B.2. Additional Details on the Proof of Theorem 1

Parts (b) and (c) of Theorem 1 follow almost immediately from Lemma 3, as follows.

By Lemma 3, for  $P = \{1, \ldots, n-1\}$ , there is a reachable configuration C which has a P-successful schedule, and for every object R,  $|WCov(C, R) \cap P| \leq t$  and  $|CCov(C, R) \cap P| \leq t$ . Hence, if the implementation uses m writable CAS objects, then in C at most 2t process are poised to access each CAS object, and thus

$$2mt \ge n-1.$$

Similarly, if each of the m base objects supports, in addition to Read(), only one of the two operations, CAS() or Write(), then in C at most t processes are poised to access each object, and so

$$mt \ge n-1.$$

The result now follows immediately by solving for m.

#### B.3. Lower Bounds for Implementations of LL/SC

We can easily modify the lower bounds for implementations of ABA-detecting registers to obtain the same lower bounds for implementations of LL/SC objects from bounded CAS objects and registers.

Consider a linearizable implementation of the methods SC() and LL(). Given a process  $p \in \mathcal{P} \setminus \{0\}$ , we define *p*-clean and *p*-dirty configurations in almost the same way as for WeakRead() and WeakWrite(): Configuration *C* is *p*-clean, if there exists a schedule  $\alpha$ ,  $C_{init} \stackrel{\alpha}{\to} C$ , such that  $\text{Exec}(C_{init}, \alpha)$  contains a complete LL() operation  $r^*$  by *p*, and any (successful or unsuccessful) SC() happens before  $r^*$ . Configuration *C* is *p*-dirty, if there exists a schedule  $\alpha$ ,  $C_{init} \stackrel{\alpha}{\to} C$ , such that  $\text{Exec}(C_{init}, \alpha)$  contains a successful SC()  $w^*$ , and no LL() by *p* is pending at any point after  $w^*$  has been invoked.

We observe analogously to Observation 1:

**Observation 2.** Suppose the LL() and SC() methods satisfy nondeterministic solo-termination. For any process  $p \in \mathcal{P} \setminus \{0\}$  and any two reachable configurations  $C_1, C_2$ , if  $C_1$  is p-clean and  $C_2$  is p-dirty, then  $C_1 \not\sim_p C_2$ .

*Proof.* For the purpose of a contradiction, suppose  $C_1 \sim_p C_2$ . Let  $\alpha_i, i \in \{1, 2\}$ , be the schedules such that  $C_{init} \xrightarrow{\alpha_i} C_i$ , and

- in  $E_1 := \text{Exec}(C_{init}, \alpha_1) p$  executes at least one complete LL(), and the last one,  $r^*$ , happens after any SC(); and
- in  $E_2 := \text{Exec}(C_{init}, \alpha_2)$  there is a complete successful SC()  $w^*$  that overlaps with no LL() by p, and p invokes no LL() after  $w^*$ .

Then process p must be idle in configuration  $C_2$ , and thus also in configuration  $C_1$ . The implementation of LL()/SC() must remain correct, if starting in configuration  $C_1$  respectively  $C_2$  process p executes a SC(y) call, where y is an arbitrary value. In a long enough p-only execution, that SC(y) call must complete, but since  $C_1$  and  $C_2$  are indistinguishable, the resulting p-only executions starting in  $C_1$  respectively  $C_2$  must be the same. I.e., there is a p-only execution E such that  $E_1 \circ E$  and  $E_2 \circ E$  are also executions, and in E process p completes exactly one SC() method  $s^*$ . In  $E_1 \circ E$ , all SC() operations except for  $s^*$  terminate before p's last LL(),  $r^*$ , and that LL() is followed by p's SC(),  $s^*$ . By the semantics of LL/SC,  $s^*$  succeeds. On the other hand, in  $E_2 \circ E$  a successful SC()  $w^*$  happens before p's SC()  $s^*$  and no LL() by p is pending at any point after the invocation of  $w^*$ . Hence,  $s^*$  fails, which is a contradiction.

Now, in the proofs of Lemmas 1 and 3, we can simply replace every occurrence of WeakRead() with LL(), and every occurrence of WeakWrite() with a LL() followed by an SC(). With those

replacements, every step of the proofs of Lemmas 1 and 3 holds vacuously. (Observe, that in the execution constructed in the original proofs, only process 0 calls WeakWrite(), so if we make the replacements as described, each SC() must succeeds.) This yields Corollary 1.

## C. Proof of Theorem 3

To prove Theorem 3 it is enough to show that the implementation of the ABA-detecting register in Figure 4 is linearizable.

In every line of the code, at most one shared memory operation is executed. Consider some history H in which processes execute DRead() and DWrite() operations. For any operation m by process p and any line number k, we let  $t_k^m$  denote the point in time when p executes its shared memory operation in line k during m. Further, inv(m) and rsp(m) denote the points in time of the invocation respectively response of m. We say operation m completes in some interval [t, t'], if  $[inv(m), rsp(m)] \subseteq [t, t']$ .

Consider some history H on the ABA-detecting register. We define the linearization point  $\ell(m)$  of each operation m as follows. A DWrite() operation dw in H linearizes when the value of X is updated in line 27 of dw (i.e.  $\ell(dw) = t_{27}^{dw}$ ). For a DRead() operation dr by some process q, we define  $\ell(dr) = t_{38}^{dr}$  if b = True at rsp(dr), and otherwise  $\ell(dr) = t_{41}^{dr}$ . Clearly the linearization point of each operation is between the invocation and response of the operation. It remains to show that the history  $S_H$  obtained by ordering all operations by their linearization points is valid. For that, we first prove the following auxiliary claims.

**Claim 1.** For every complete DRead() operation dr in which process q reads (x, p, s) in line 38 and (x', p', s') in line 41, if b = True at rsp(dr), then some process writes to X during  $[\ell(dr), rsp(dr)]$ , and otherwise at  $\ell(dr)$ , we have A[q] = (p, s) = (p', s') and (x, p, s) = (x', p', s').

*Proof.* First suppose b = True at rsp(dr), and thus  $\ell(dr) = t_{38}^{dr}$ . This implies that q executes line 49 of dr, and so  $(x, p, s) \neq (x', p', s')$ . I.e., q reads different triples from X in line 38 and in line 41. Therefore, a process writes to register X during  $[t_{38}^{dr}, t_{41}^{dr}] \subseteq [\ell(dr), rsp(dr)]$ .

line 41. Therefore, a process writes to register X during  $[t_{38}^{dr}, t_{41}^{dr}] \subseteq [\ell(dr), rsp(dr)]$ . Now suppose b = False at rsp(dr), and thus  $\ell(dr) = t_{41}^{dr}$ . It also implies that q executes line 47 of dr and hence (x, p, s) = (x', p', s'). Process q writes the pair (p, s) to A[q] at  $t_{40}^{dr}$  and it does not change it before  $t_{41}^{dr}$ . Thus, A[q] = (p, s) = (p', s') at  $t_{41}^{dr} = \ell(dr)$ .

**Claim 2.** Consider two GetSeq() operations  $gs_1$  and  $gs_2$  by the same process p, which both return the same value s. Then p completes at least n GetSeq() calls between  $gs_1$  and  $gs_2$ .

*Proof.* This follows from the fact that before a process returns a sequence number s in a GetSeq() call, it enqueues it in line 35 in the queue usedQ of length n + 1. After that, according to line 34, it does not choose s again until s has been removed from the queue, and in every GetSeq() call only one element gets dequeued (in line 36).

**Claim 3.** Suppose X = (x, p, s) at some point t for some triple (x, p, s), and A[q] = (p, s) throughout [t, t'], for some  $t' \ge t$ . Then, p does not write (x', p, s) into X during (t, t'], for any value of x'.

*Proof.* As X = (x, p, s) at t, p must have written that triple to X before t in a DWrite(x) call. Let gs be the GetSeq() operation that p executed during that DWrite(x) call. Then gs responds before t and returns s, and p can complete at most one additional GetSeq() operation after qsand before t (this may happen during its following DWrite() call, just before it writes to X again). Note that not more than one GetSeq() operation can be invoked after qs and before t.

Suppose for the sake of contradiction that p writes (x', p, s) to X during (t, t'] in line 27 of some DWrite() operation, for some value x'. Thus, p completes a GetSeq() operation (in line 26 of the same DWrite() operation) during [rsp(gs), t'], such that it returns s. Let gs' be the first such GetSeq() operation.

By Claim 2, p completes at least n GetSeq() operations  $gs_1, \ldots, gs_n$ , executed in the same order, during [rsp(gs), inv(gs')]. As at most one GetSeq() operation can respond after gs and before t, only  $qs_1$  can get invoked and reponsed during [rsp(qs), t]. Thus,

$$gs_2, \ldots, gs_n, gs'$$
 all complete, in the same order, during  $[t, t']$ . (13)

As p increments its local variable c by 1 modulo n during each GetSeq() operation, c = q at the invocation of some GetSeq() operation  $gs'' \in \{gs_2, \ldots, gs_n, gs'\}$ . By the assumption, A[q] = (p, s)throughout [t, t'], and therefore, by Eq. (13), A[q] = (p, s) throughout the execution of gs''. Thus, p reads (p,s) from A[q] and adds (q,s) to its set na in lines 28-30 of qs''. Process p only removes (q, s) from na, when it reads a new pair (p, s'), for  $s' \neq s$ , from A[q] (lines 29-32), hence, (q, s)remains in p's set na until some time after the value stored in A[q] changes, which is after t'. Therefore, no GetSeq() operation by p that executes line 34 during  $[t_{30}^{gs''}, t']$  returns s. But by Eq. (13),  $t_{34}^{gs'} \in [t_{30}^{gs''}, t']$ , because  $gs'' \in \{gs_2, \ldots, gs_n, gs'\}$  and gs' is the last operation executed in this set. Hence, gs' does not return s, which contradicts the assumption.

Claim 4. Consider two consecutive DRead() operations  $dr_1$  and  $dr_2$  by the same process q. Suppose b = False at  $inv(dr_2)$ , and the if-condition in line 42 of  $dr_2$  evaluates to True. Then no process writes to X during  $[\ell(dr_1), \ell(dr_2)]$ .

*Proof.* Let  $(x_1, p_1, s_1)$  and  $(x_2, p_2, s_2)$  be the triples that process q reads from X in line 38 of  $dr_1$ respectively  $dr_2$ . As the value of A[q] is only modified in line 40 of a DRead() operation by q,  $A[q] = (p_1, s_1)$  throughout  $(t_{40}^{dr_1}, t_{39}^{dr_2}]$ , and so  $(r, s_r) = (p_1, s_1)$ . Since the if-condition in line 42 of  $dr_2$  evaluates to True,  $(p_1, s_1) = (p_2, s_2)$ . So let  $p = p_1 = p_2$  and  $s = s_1 = s_2$ . Then process q writes (p, s) to A[q] at both  $t_{40}^{dr_1}$  and  $t_{40}^{dr_2}$ , and since A[q] is not changed elsewhere,

we have

$$A[q] = (p, s) \text{ throughout } [t_{40}^{dr_1}, rsp(dr_2)].$$
(14)

Suppose for the sake of contradiction that some process writes to X during  $[\ell(dr_1), \ell(dr_2)]$ . As q reads  $(x_2, p, s)$  from X at  $t_{38}^{dr_2}$ , the last write to X during interval  $[\ell(dr_1), t_{38}^{dr_2}]$  must be by pand it must write triple  $(x_2, p, s)$  to X. Process q does not change the value stored in b during  $[rsp(dr_1), inv(dr_2)]$ , and by the assumption b = False at  $inv(dr_2)$ , thus,

$$b =$$
False at  $rsp(dr_1)$ . (15)

Thus, by the definition of  $\ell$  we have

$$\ell(dr_1) = t_{41}^{dr_1}.\tag{16}$$

Eq. (15) also implies that if-condition in line 46 of  $dr_1$  evaluates to **True**. Thus, p reads the triple  $(x_1, p, s)$  from X at  $t_{41}^{dr_1}$ . By (14), we have A[q] = (p, s) throughout  $[t_{41}^{dr_1}, rsp(dr_2)]$ , and so by Claim 3, p does not write (x', p, s) to X, for any value of x', during  $[t_{41}^{dr_1}, rsp(dr_2)] = [\ell(dr_1), rsp(dr_2)]$ , and so not during interval  $[\ell(dr_1), t_{38}^{dr_2}]$ —a contradiction.

**Claim 5.** Consider two consecutive DRead() operations  $dr_1$  and  $dr_2$  by some process q. Suppose the *if-condition in line 42 of dr*<sub>2</sub> evaluates to False. Then a process writes to X during  $[\ell(dr_1), \ell(dr_2)]$ .

Proof. Suppose process q reads some values  $(x_1, p_1, s_1)$  and  $(x_2, p_2, s_2)$  from X in line 38 of  $dr_1$  respectively  $dr_2$ . Register A[q] can only be modified by q and only in line 40 of a DRead() operation, so  $A[q] = (p_1, s_1)$  throughout  $(t_{40}^{dr_1}, t_{39}^{dr_2}]$ , and so  $(r, s_r) = (p_1, s_1)$ . By the assumption that the if-condition in line 42 of  $dr_2$  evaluates to False,  $(p_1, s_1) \neq (p_2, s_2)$ . Hence, the value  $(x_2, p_2, s_2)$  gets written to X during  $(t_{38}^{dr_1}, t_{38}^{dr_2})$ . If  $\ell(dr_1) = t_{38}^{dr_1}$ , then the claim follows, as  $\ell(dr_2)$  is either  $t_{38}^{dr_2}$  or  $t_{41}^{dr_2}$ .

Now suppose  $\ell(dr_1) = t_{41}^{dr_1}$ . By the definition of  $\ell$ , b = False at  $rsp(dr_1)$ . Hence, q executes line 47, and thus it reads the same triple  $(x_1, p_1, s_1)$  from X in line 41, as it did in line 38 of  $dr_1$ . Suppose for the sake of contradiction that no process writes to X during  $[\ell(dr_1), \ell(dr_2)] =$  $[t_{41}^{dr_1}, \ell(dr_2)] \supseteq [t_{41}^{dr_1}, t_{38}^{dr_2}]$ . Then X remains unchanged throughout that interval, and in particular at  $t_{38}^{dr_2}$  process q reads  $(x_1, p_1, s_1)$  from X, and so  $(x_1, p_1, x_1) = (x_2, p_2, s_2)$ —a contradiction.

Now, we prove that sequential history  $S_H$  is valid. Consider the first DRead() dr by some process q. If no DWrite() linearizes before  $t_{41}^{dr}$ , then X has its initial value,  $(\bot, \bot, \bot)$ , from the beginning of the execution until  $t_{41}^{dr}$ . Hende, q reads that triple from X in line 38 and also in line 41, and so the if-condition in line 46 evaluates True, and d returns  $(\bot, \texttt{False})$ . Since  $\ell(dr)$  is before  $t_{41}^{dr}$ , and thus before any DWrite() linearizes, this return value is correct. Now suppose some DWrite() operation linearizes before  $t_{41}^{dr}$ , and the last such operation uses parameter  $x^*$ . If that happens before  $t_{38}^{dr}$ , then q reads a triple  $(x^*, p, s)$  from X in line 38, where  $(p, s) \neq (\bot, \bot)$ . But when q executes line 39,  $A[q] = (\bot, \bot)$ , so q assigns ret the value  $(x^*, \text{True})$  in line 45, and thus dr later correctly returns that pair. If the first DWrite() operation linearizes in  $(t_{38}^{dr}, t_{41}^{dr})$ , then q reads  $(\bot, \bot, \bot)$  from X in line 38, and  $(x^*, p', s')$  in line 41, where  $(p', s') \neq (\bot, \bot)$ . Hence, the if-condition in line 42 evaluates to False, and dr returns correctly  $(x^*, \text{True})$ .

Now suppose dr is a DRead() by q, but it is not the first one. For ease of notation, we write  $dr_2$  instead of dr, and we let  $dr_1$  be the DRead() by q immediately preceding  $dr_2$ . Let  $(x^*, p^*, s^*)$  be the triple that q reads from X in line 38 of  $dr_2$ , and so  $ret = (x^*, g)$  is the return value of  $dr_2$ , for some  $g \in \{\text{True}, \text{False}\}$ . To prove that  $S_H$  is a valid history, we show that

(a)  $X = (x^*, \cdot, \cdot)$  at  $\ell(dr_2)$ ; and

(b)  $g = \text{True if and only if a DWrite() linearizes between <math>\ell(dr_1)$  and  $\ell(dr_2)$ .

**Proof of (a).** By definition, either  $\ell(dr_2) = t_{38}^{dr_2}$ , or  $\ell(dr_2) = t_{41}^{dr_2}$ . If  $\ell(dr_2) = t_{38}^{dr_2}$ , then (a) is immediate, because q reads  $(x^*, p^*, s^*)$  from X in that line. If  $\ell(dr_2) = t_{41}^{dr_2}$ , then according to the definition of  $\ell()$ , we have b = False at the response of  $dr_2$ . Hence, q executes line 47 of  $dr_2$ , and thus it reads the same triple  $(x^*, p^*, s^*)$  from X in line 41. It follows that  $X = (x^*, \cdot, \cdot)$  when that happens, i.e., at  $\ell(dr_2) = t_{41}^{dr_2}$ .

**Proof of (b).** First suppose g = False. This implies that line 43 is executed during  $dr_2$ , and b = False at the invocation of  $dr_2$ . Thus, by Claim 4, no process writes to X during  $[\ell(dr_1), \ell(dr_2)]$ .

Now suppose g = True. Then either line 43 of  $dr_2$  is executed and b = True at the invocation of  $dr_2$ , or line 45 of  $dr_2$  is executed. In the latter case, the if-condition in line 42 evaluates to False, and so by Claim 5 a process writes to X during  $[\ell(dr_1), \ell(dr_2)]$ . Hence, consider the case that b = True at the invocation of  $dr_2$ . Since q's local variable b does not change between consecutive DRead() method calls by q, we also have b = True at  $rsp(dr_1)$ . Hence, by Claim 1, a process writes to X during  $[\ell(dr_1), \ell(dr_2)]$ .

## D. Proof of Theorem 2

In this section, we prove Theorem 2 by showing that the implementation of LL/SC/VL object using CAS given in Figure 3 is linearizable. Let inv(m) and rsp(m) denote the points in time of the invocation respectively response of some operation m. First we show that if a process pexecutes n unsuccessful consecutive CAS() operations during a LL() or a SC() operation, then at least another process executes a successful CAS() during its SC() operation while the first process' CAS() operations fail.

**Claim 6.** Suppose a process p executes n consecutive unsuccessful CAS() operations  $c_1, \ldots, c_n$  all either in line 21 of a LL() or in line 6 of a SC(). Then during the time interval I that starts when p reads X for the last time before  $c_1$  (in line 20, respectively line 3), and ends when p finishes  $c_n$ , another process executes a successful CAS() in line 6 of a SC() operation.

*Proof.* Let  $r_i$  be the Read() operation p executes just before it executes  $c_i$  (in line 20, or line 3). Operation  $c_i$  fails if only if a process executes a successful CAS() operation between p's  $r_i$  and  $c_i$ . As all  $c_1, \ldots, c_n$  fail, n successful CAS() operations must have happened during interval I.

Now suppose for the sake of contradiction, that none of these n successful CAS() operations during I was due to a CAS() in line 6. Hence, all n successful CAS() operations during I are due to a CAS() in line 21. Each successful CAS() operation by some process q in line 21 resets q's bit in the second component of X to 0. The second component of X has n bits, and each of these n bits can change to 0 at most once, as no CAS() in line 6 succeeds to change any of these bits to 1. Moreover, none of p's CAS() operations are successful. Hence, at most n-1 successful CAS() in line 21 can be executed during I—contradiction.

To prove Theorem 2, it suffices to prove that any history H on the implementation of the LL/SC/VL object given in Figure 3 is linearizable. For each operation m, we define the linearization point of m,  $\ell(m)$ , as follows. For an unsuccessful SC() operation sc (i.e., it returns False in any of lines 1, 5 and 8), we define  $\ell(sc) = rsp(sc)$ . A successful SC() operation sc (i.e., it returns True in line 7) linearizes at the point at which its CAS() in line 6 succeeds. For a VL() operation vl by some process p, if it returns False (in line 13), then  $\ell(vl) = rsp(vl)$ . If vl returns True (in line 11), then vl linearizes at the point at which p reads X in line 9 of vl. For a LL() operation ld by some process p, we let  $\ell(ld)$  be the point at which p executes line 14 if ld returns in either line 17, or line 25. If ld returns in line 23, then  $\ell(ld)$  is the point at which its CAS() in line 21 succeeds. It is not hard to see that the linearization point of each operation is between its invocation and its response. It only remains to show that the sequential history  $S_H$  obtained by ordering operations in H by their linearization points is valid. For that we first prove the following auxiliary claims.

Claim 7. Consider some LL() operation ld by some process p, such that at rsp(ld), process p's bit in X is set or b =True. Then-and-only-then some successful SC() operation linearizes during  $(\ell(ld), rsp(ld))$ .

*Proof.* First we prove the if-then statement. The value of the local variable b is updated during each LL() operation, just before the operation returns (lines 16, 22 and 24). First consider the case at which b = True at rsp(ld). This case can only happen if ld returns in line 25. In this case all n CAS() operations in line 21 are unsuccessful. By Claim 6, some process executes a successful CAS() operation during a SC() operation sc, while p executes its n unsuccessful CAS(). As in this case,  $\ell(ld)$  is when p executes line 14, operation sc linearizes at its successful CAS() operation during  $(\ell(ld), rsp(ld)]$ .

Now, suppose b = False, but p's bit in X is set at rsp(ld). If ld returns in line 17, then p's bit in X is not set when p reads X in line 14 at  $\ell(ld)$ . However, by the assumption, p's bit in X is set when ld responds. This bit is only set when a CAS() operation succeeds during a SC() operation. Hence, a process executes a successful CAS() during a SC() operation (and thus its SC() linearizes) during  $(\ell(ld), rsp(ld)]$ . If operation ld returns in line 23, p's last CAS() operation in line 21 of ldmust have been successful, and so p's bit in X must have changed to 0. But by the assumption, p's bit is set at rsp(ld), so some other process must have changed it back to 1 after p's successful CAS(). As  $\ell(ld)$  is when p's CAS() succeeds, the value of p's bit changes after  $\ell(ld)$  and before rsp(ld). Recall that p's bit is only set when a process executes a successful CAS operation during a SC(). Therefore, some process must have executed a successful CAS() operation during a SC() and thus its SC() linearized during  $(\ell(ld), rsp(ld)]$ .

Now we prove the only-then statement. For that we show if p's bit in X is not set and b = False at rsp(ld), then no SC() linearizes during  $(\ell(ld), rsp(ld)]$ . Local variable b = False at rsp(ld), therefore ld returns either in line 17 or in line 23. In the case that ld returns in line 17, p's bit is 0 when p reads X in line 14 (at the linearization point of ld). Thus, p's bit is 0 at both  $\ell(ld)$  and rsp(ld). This bit can only be changed to 0 by p and only in line 21, which is not executed during ld in this case. Hence p's bit has value 0 throughout ( $\ell(ld), rsp(ld)$ ]. As all processes' bits in X change to 1 when a SC() linearizes at its successful CAS() in line 6, no successful CAS() of a SC() happens throughout ( $\ell(ld), rsp(ld)$ ].

Now suppose ld returns in line 23. In this case ld linearizes when its successful CAS() changes p's bit to 0. Process p's bit is not set at rsp(ld) and as p does not try to change the value of its bit after its successful CAS(), and thus after  $\ell(ld)$ , p's bit has value 0 throughout ( $\ell(ld), rsp(ld)$ ], and so with the same argument as before, no SC() linearizes at its successful CAS() operation in this interval.

Claim 8. Consider a successful CAS() operation cas in line 6 of a SC() operation sc and the last Read() operation r executed before cas in line 3 of sc. Then no successful SC() linearizes between r and cas.

*Proof.* Let p be the process which executes sc, and  $(y^*, a^*)$  be the value p reads from X when it executes r. As cas succeeds, the if-condition in line 4 cannot be evaluated to **True**. Hence, p's bit in X must be 0 when p executes r, and so  $a^* \neq 2^n - 1$ . Moreover, since cas is successful, the value of X is  $(y^*, a^*)$  just before cas is executed. Suppose for the sake of contradiction that at least one successful SC() operation linearizes between r and cas. Note that the value of X is updated at the

linearization point of a successful SC() operation. Thus, the last successful SC() executed between r and cas must update the value of X to  $(y^*, a^*)$ . However, a successful SC() operation changes the second component of X to  $2^n - 1$ , and so  $a^* = 2^n - 1$ —contradiction.

**Claim 9.** Consider a SC() operation sc by some process p and let ld be the last LL() operation by the same process p executed before sc. Then sc is successful if and only if no successful SC() operation linearizes between ld and sc.

*Proof.* First we prove the if-then statement. Operation sc is successful, if one of its CAS() operations  $c^*$  succeeds in line 6 and so sc returns in line 7. Hence, b = False at the inv(sc). As p's local variable b is only changed during a LL() operation,

$$b =$$
False at the  $rsp(ld)$ . (17)

Moreover, since sc returns in line 7, p reads value 0 from its bit when it reads X in line 3 the last time before cas at some point t. This bit can only be reset in line 21 of a LL() operation by p, hence,

$$p$$
's bit is 0 throughout  $[rsp(ld), t]$ . (18)

Hence by Eq. (17), Eq. (18), and Claim 7, no successful SC() operation linearizes during  $(\ell(ld), rsp(ld)]$ . Moreover, by Eq. (18), no successful SC() linearizes throughout [rsp(ld), t], as otherwise the value of p's bit would change to 1. Moreover, by Claim 8 no successful SC() linearizes during  $[t, \ell(sc)]$ , as  $\ell(sc)$  is when cas succeeds. Therefore, no successful SC() linearizes throughout  $(\ell(ld), \ell(sc)]$ .

Now we show the only-if statement is also true, by showing that if sc is not successful, then at least one successful SC() operation linearizes between ld and sc. There are three cases where sc can return. The first case is if sc returns in line 1 and so b = True at inv(sc). Process p's local variable b does not change outside a LL() operation, hence, b = True at the rsp(ld). By Claim 7, a successful SC() operation linearizes during  $(\ell(ld), rsp(ld)] \subseteq (\ell(ld), \ell(sc)]$ .

The second case happens when sc returns in line 5. In this case, p's bit is set when p reads X in line 3 for the last time during sc at some point t. Note that  $\ell(sc) = rsp(sc) \ge t$ . Now suppose p's bit is 0 at rsp(ld). Hence, some process sets this bit with a successful CAS() at the linearization point of a successful SC() operation during  $(rsp(ld), t] \subseteq (\ell(ld), \ell(sc)]$ . If p's bit is 1 at rsp(ld), then by Claim 7, a successful SC() operation linearizes during  $(\ell(ld), rsp(ld)] \subseteq (\ell(ld), \ell(sc)]$ .

The last case is when sc returns in line 8. This implies that all n CAS() operations of p during sc failed. Thus by Claim 6, a successful CAS happens during [inv(sc), rsp(sc)] and so a successful SC() linearizes during  $(\ell(ld), \ell(sc)]$ .

Claim 10. Consider a VL() operation vl by some process p and let ld be the last LL() operation by the same process p executed before sc. Then vl returns True if and only if no successful SC() operation linearizes between ld and sc.

*Proof.* First we prove the if-then statement. Operation vl returns True, hence, b = False at the inv(vl). As p's local variable b is only changed during a LL() operation,

$$b =$$
False at the  $rsp(ld)$ . (19)

Moreover, since vl returns True, p reads value 0 from its bit when it reads X in line 9 of vl at the linearization point of vl. This bit can only be reset in line 21 of a LL() operation by p, hence,

$$p$$
's bit is 0 throughout  $[rsp(ld), \ell(vl)].$  (20)

Hence by Eq. (19), Eq. (20), and Claim 7, no successful SC() operation linearizes during  $(\ell(ld), rsp(ld)]$ . Moreover, by Eq. (20), no successful SC() linearizes throughout  $[rsp(ld), \ell(vl)]$ , as otherwise the value of p's bit would change to 1. Therefore, no successful SC() linearizes throughout  $(\ell(ld), \ell(vl)]$ .

Now we show the only-if statement is also true, by showing that if vl returns False, then at least one successful SC() operation linearizes between ld and vl. There are two cases that can cause vl to return False. The first case is if b = True at inv(sc). Process p's local variable b does not change outside a LL() operation, hence, b = True at the rsp(ld). By Claim 7, a successful SC() operation linearizes during  $(\ell(ld), rsp(ld)] \subseteq (\ell(ld), \ell(vl)]$ . The second case is when p's bit is set when p reads X in line 9 of vl at some point t. Note that  $\ell(vl) = rsp(vl) \ge t$ . Now suppose p's bit is 0 at rsp(ld). Hence, some process sets this bit with a successful CAS() at the linearization point of a successful SC() operation during  $(rsp(ld), t] \subseteq (\ell(ld), \ell(vl)]$ . If p's bit is 1 at rsp(ld), then by Claim 7, a successful SC() operation linearizes during  $(\ell(ld), rsp(ld)] \subseteq (\ell(ld), \ell(vl)]$ .

Now we can quickly argue why  $S_H$  is valid. Consider some LL() operation ld in  $S_H$  that returns some value y. For  $S_H$  to be valid, register X must contain value y at the linearization point of ld. Recall that  $\ell(ld)$  is the point at which p executes line 14, if ld returns in either line 17 or line 25. Moreover, if ld returns in line 23, then  $\ell(ld)$  is the point at which p's CAS() in line 21 succeeds. In the former case, y is the value that p reads from X in line 14 at  $\ell(ld)$ . In the latter case, y is the value that p writes into X during its successful CAS() operation in line 21 at  $\ell(ld)$ . Hence ldreturns a valid value. This in addition to the results of Claim 9 and Claim 10 complete the proof that the resulting history  $S_H$  is valid.