Verifying and Optimizing the HMCS Lock for Arm Servers

Oberhauser, Jonas; Oberhauser, Lilith; Paolillo, Antonio; Behrens, Diogo; Fu, Ming; Vafeiadis, Viktor

doi:10.1007/978-3-030-91014-3_17

Verifying and Optimizing the HMCS Lock for Arm Servers

Jonas Oberhauser^10,11,
Lilith Oberhauser^10,11,
Antonio Paolillo^10,11,
Diogo Behrens^10,11,
Ming Fu^10,11 &
…
Viktor Vafeiadis¹²

Conference paper
First Online: 02 December 2021

256 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCCN,volume 12754))

Abstract

To optimize the performance of some of our systems running on non-uniform memory architecture (NUMA) servers with Arm processors, we have implemented multiple versions of the HMCS lock, an advanced NUMA-aware lock that has been identified in the literature as particularly scalable.

This is a highly non-trivial task because of the many implementation choices for interlocked operations, alignment, and memory barrier placement, affecting not only the lock’s performance but also its correctness. The published HMCS lock does not discuss choices that affect performance, but it does present a choice of barriers. We observe that this choice is wrong, leading to hangs on Kunpeng Arm servers. We repair the barriers and implement the first formally-verified HMCS lock with VSync, an automated formal verification and optimization tool for weak consistency. We explain the barrier bugs in detail and report our experience of barrier optimizations for Arm servers.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

https://e.huawei.com/uk/products/servers/taishan-server/taishan-2280-v2
https://en.wikichip.org/wiki/hisilicon/kunpeng/920-6426
https://openeuler.org
Amazon Web Services: AWS Graviton Processor - Enabling the best price performance in Amazon EC2 (2020). https://aws.amazon.com/ec2/graviton
Chabbi, M., Amer, A., Wen, S., Liu, X.: An efficient abortable-locking protocol for multi-level NUMA systems. SIGPLAN Not. 52(8), 61–74 (2017). https://doi.org/10.1145/3155284.3018768
Article Google Scholar
Chabbi, M., Fagan, M.W., Mellor-Crummey, J.M.: High performance locks for multi-level NUMA systems. In: PPoPP 2015, New York, USA, pp. 215–226. ACM (2015). https://doi.org/10.1145/2688500.2688503
Dean, J., Ghemawat, S.: Leveldb (2021). https://github.com/google/leveldb
Defilippi, J.: Introducing AMBA 5 CHI protocol enhancements (2017). https://community.arm.com/developer/ip-products/system/b/soc-design-blog/posts/introducing-new-amba-5-chi-protocol-enhancements
Dice, D., Kogan, A.: Compact NUMA-aware locks. In: EuroSys 2019, New York, USA. ACM (2019). https://doi.org/10.1145/3302424.3303984
Guiroux, H., Lachaize, R., Quéma, V.: Multicore locks: the case is not closed yet. In: USENIX Annual Technical Conference, pp. 649–662 (2016)
Google Scholar
Huawei: Huawei unveils industry’s highest-performance ARM-based CPU, January 2019. https://www.huawei.com/en/news/2019/1/huawei-unveils-highest-performance-arm-based-cpu
Kokologiannakis, M., Raad, A., Vafeiadis, V.: Model checking for weakly consistent libraries. In: PLDI 2019, New York, USA, pp. 96–110. ACM (2019). https://doi.org/10.1145/3314221.3314609
Liu, N., Zang, B., Chen, H.: No barrier in the road: a comprehensive study and optimization of ARM barriers. In: PPoPP 2020, New York, USA, pp. 348–361. ACM (2020). https://doi.org/10.1145/3332466.3374535
Mellor-Crummey, J.M., Scott, M.L.: Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Comput. Syst. 9(1), 21–65 (1991). https://doi.org/10.1145/103727.103729
Article Google Scholar
Oberhauser, J., et al.: VSync: push-button verification and optimization for synchronization primitives on weak memory models. In: ASPLOS 2021, New York, USA. ACM (2021). https://doi.org/10.1145/3445814.3446748
Podkopaev, A., Lahav, O., Vafeiadis, V.: Bridging the gap between programming languages and hardware weak memory models. Proc. ACM Program. Lang. 3(POPL) (2019). https://doi.org/10.1145/3290382
Pulte, C., Flur, S., Deacon, W., French, J., Sarkar, S., Sewell, P.: Simplifying ARM concurrency: multicopy-atomic axiomatic and operational models for ARMv8. Proc. ACM Program. Lang. 2(POPL) (2017). https://doi.org/10.1145/3158107

Download references

Author information

Authors and Affiliations

Huawei Dresden Research Center, 01067, Dresden, Germany
Jonas Oberhauser, Lilith Oberhauser, Antonio Paolillo, Diogo Behrens & Ming Fu
Huawei OS Kernel Lab, Shenzhen, China
Jonas Oberhauser, Lilith Oberhauser, Antonio Paolillo, Diogo Behrens & Ming Fu
Max Planck Institute for Software Systems, 67663, Kaiserslautern, Germany
Viktor Vafeiadis

Authors

Jonas Oberhauser
View author publications
You can also search for this author in PubMed Google Scholar
Lilith Oberhauser
View author publications
You can also search for this author in PubMed Google Scholar
Antonio Paolillo
View author publications
You can also search for this author in PubMed Google Scholar
Diogo Behrens
View author publications
You can also search for this author in PubMed Google Scholar
Ming Fu
View author publications
You can also search for this author in PubMed Google Scholar
Viktor Vafeiadis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jonas Oberhauser .

Editor information

Editors and Affiliations

Mohammed VI Polytechnic University, Ben Guerir, Morocco
Karima Echihabi
Technische Universität Braunschweig, Braunschweig, Niedersachsen, Germany
Roland Meyer

Appendices

A Arm vs. IMM Consistency Model

A standard way to define weak consistency models is through execution graphs. Nodes in these graphs represent events such as reads and writes, and edges specify various relations between these events, e.g., the order in which reads and writes to the same location are committed. Memory models are defined by a) the edges that exist in the graph and b) restrictions on these edges. For brevity, we introduce only the event and edge types of Arm and IMM that are relevant to the bugs we mention in this paper. We consider write events \(\mathbf {W}_{X}( loc , val )\), read events \(\mathbf {R}_{X}( loc , val )\), and fence events \(\mathbf {F}_{X}\), where is the so called mode of the event, \( loc \) is the shared memory location on which the event operates, and \( val \) is the value written or read in the event. The mode denotes the type of memory barrier (if any) represented by the event: \(\mathbf {rlx}\) indicates that no barrier is present, \(\mathbf {acq}\) represents acquire, \(\mathbf {rel}\) release, and \(\mathbf {sc}\) sequentially consistent barriers. Mode \(\mathbf {rlx}\) is the default mode and omitted.

We consider the following types of fundamental edges:

(read from external) edges connect a write event of a thread to a read event of another thread that reads from it.
(modification order external) edges connect write events (writing to the same location) of different threads indicating the order in which they were committed.
(program order) edges connect events of the same thread in the order in which they are issued by the program.
(control dependency) edges connect a read \(\mathbf {R}_{X}(x, a)\) that influences a condition (e.g., if- or while-condition) evaluation to every event of the same thread that is issued after the condition.
event-type self-loops for event type connect every event e of type E to itself.

Other edges are derived from these fundamental edges according to the rules of the consistency model (Figs. 15 and 16). For instance, the edge in Fig. 3(b) implies an edge on IMM (with Eq. (16)). Such derived rules are often defined with the composition operator ‘;’, which for arbitrary edge types R and S is defined by

The meaning of barriers is defined by the derived edges they imply; for example, the meaning of \({\mathbf {F_{\mathbf {rel}}}}\) (which maps to the full DMB.ISH fence) on Arm is defined through the edge it implies between preceding and subsequent operations (with Eqs. (3) to (5)).

In Figs. 15 and 16 we have collected the rules of IMM and Arm consistency that are relevant to our discussion. In [16] it is shown that Arm consistency implies IMM consistency; thus any bug on Arm is also present on IMM, and verification on IMM implies correctness on Arm. The converse is not true, and bugs on IMM are not always bugs on Arm. Indeed, some of the bugs identified by VSync on the HMCS lock on IMM are not bugs on Arm. The key difference relevant to these bugs is that edges imply an edge on Arm, but do not imply an edge on IMM. Thus they contribute to cycles but not to cycles.

We illustrate the implications at hand of the execution graphs in Fig. 3. In Fig. 3(a), we have an edge from Event e to Event a; in Fig. 3(b), we instead have an edge from Event e to Event a. Other than those events and the edge between them, the graphs are the same. Thus in both graphs, the following imply edges:

(Eqs. (3) to (5))
(Eqs. (1) and (5))
(Eqs. (2), (4) and (5))

The only edge missing for an cycle is . This edge is implied by the edge in Fig. 3(a) and the edge in Fig. 3(b) (with Eqs. (1) and (5)). Note that due to transitivity (Eq. (5)) the cycle implies a reflexive edge , which contradicts the irreflexivity of (Eq. (6)). Thus both graphs are inconsistent on Arm.

On IMM, the following imply -edges:

and (Eqs. (7), (10) and (11))
(Eq. (11))
(Eqs. (8), (9) and (11))

Analogous to before, only an is missing for an cycle. In Fig. 3(a) this edge is implied by the edge with Eq. (11), and this graph is inconsistent on IMM. But in Fig. 3(b), the edge does not contribute an edge. Indeed, there is no cycle in Fig. 3(b), which is consistent on IMM. Unfortunately, two of the bugs detected by VSync on IMM appear only in graphs that look like Fig. 3(b). These bugs therefore only appear on IMM, but can not appear on Arm.

We proceed to discuss how to fix these bugs on IMM. Consider the third graph (see Fig. 3(c)) which is almost identical to the second (see Fig. 3(b)). We only added a \({\mathbf {F_{\mathbf {acq}}}}\) fence between the and . Adding this fence does not eliminate the -cycle we inferred previously, and this graph is also inconsistent with Arm. On IMM we derive the following edges:

thus (Eqs. (13) and (14))
, and thus (Eq. (15))
(Eq. (16))

As shown in (Fig. 3(d)) we end up with and thus . But the ; relation is irreflexive (Eq. (17)). We conclude that this graph is inconsistent with IMM. In other words, due to the \({\mathbf {F_{\mathbf {acq}}}}\) fence the execution with the bug cannot occur on IMM.

B Optimizing Barriers on Atomic Operations

The implicit \(\mathbf {sc}\) barriers on CAS and SWAP in Fig. 2 are not optimal. VSync reports that they are already too strong for IMM, and indeed they can be optimized further for Arm. The exact optimization depends on the variant. Manual analysis shows that when using fences, all barriers on the atomic operations can be removed. When using implicit barriers, release barriers on Lines 17 and 30 are needed to avoid non-termination (with similar bugs as those in Sect. 3.1) and acquire and release barriers are needed on Line 17 resp. Line 55 to ensure that operations in the critical section can not leak out of the lock (resulting in loss of mutual exclusion). The resulting barriers are shown in Table 1. That table also shows a variant that may be more optimal when using interlocked LSE instructions. Unlike load/store-exclusive pairs, on which \(\mathbf {sc}\) implicit barriers do not act like a full barrier (see discussion in Sect. 3.1), LSE interlocked operations have been strengthened in a recent change to Arm specifications to provide the same semantics for \(\mathbf {sc}\) implicit barriers as a DMB.ISH (see Eq. (10)) through the rule

where \( amo \) relates a the read event of an atomic memory operation (such as SWP) to its write event, and \([\mathbf{A} ]\) and \([\mathbf{L} ]\) are event-type self-loops for acquire resp. release events. This contrasts the earlier definition in [17], in which LSE instructions provide the same ordering guarantees as load/store-exclusive pairs.

However, this stronger ordering is not necessary for the HMCS lock, and thus we optimize barriers further by relegating the acquire barrier to a trailing fence. This variant is what is denoted by hmcs-amo in Sect. 5. As demonstrated in Fig. 10, this optimization does not currently improve performance compared to hmcs-vsync (which uses \(\mathbf {sc}\) barriers on atomic operations). Perhaps if LSE operations become more efficient for low-contention cases in the future, these optimizations will become more interesting.

Table 1. Possible optimizations on Arm for atomic operations when using fences or implicit barriers.

Full size table

For the sake of completeness we also implement a variant hmcs-armamo which applies the optimization to hmcs-arm, i.e., in which as described in Table 1 all implicit barriers on atomic operations are removed. Performance results (without LSE) are shown in Figs. 17 and 18. While minor improvements can be measured in the microbenchmark, these improvements also do not translate to the larger benchmark.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Oberhauser, J., Oberhauser, L., Paolillo, A., Behrens, D., Fu, M., Vafeiadis, V. (2021). Verifying and Optimizing the HMCS Lock for Arm Servers. In: Echihabi, K., Meyer, R. (eds) Networked Systems. NETYS 2021. Lecture Notes in Computer Science(), vol 12754. Springer, Cham. https://doi.org/10.1007/978-3-030-91014-3_17

Download citation

DOI: https://doi.org/10.1007/978-3-030-91014-3_17
Published: 02 December 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-91013-6
Online ISBN: 978-3-030-91014-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics