ABSTRACT
Atomic operations are indivisible operations guaranteed to execute as a whole. One of the most important and widely used atomic operations is "compare-and-swap" (CAS), which allows threads to perform concurrent read-modify-write operations on the same memory location, free of data races. On recent Arm architectures, CAS operations can be implemented either directly via CAS instructions, or via load-linked/store-conditional (LL-SC) instruction pairs.
In this work we explore the performance of the CAS and LL-SC approaches to implement CAS operations on recent high-performance AArch64 CPUs, namely the A64FX, ThunderX2 (TX2), and Graviton3. We observe that these instructions can lead to fundamentally different performance profiles. On A64FX, for example, the newer CAS instructions---often preferred by compilers over the older LL-SC pairs---can lead to a quadratic increase in average time per successful CAS operation as the number of threads increases, whereas the older LL-SC pairs show the expected linear increase. For high thread counts, this translates into LL-SC being more than 20x faster than CAS. On TX2 and Graviton3, LL-SC can bring more conservative (but still significant) 2--3x speedups. We characterise the conditions under which each approach delivers better performance on each CPU.
- Victor Alessandrini. 2016. Concurrent Access to Shared Data. In Shared Memory Application Programming. Morgan Kaufmann, Boston, 101--127. Google ScholarCross Ref
- Arm Limited. 2022. Synchronization Benchmarks. https://github.com/ARM-software/synchronization-benchmarksGoogle Scholar
- Milind Chabbi, Michael Fagan, and John Mellor-Crummey. 2015. High Performance Locks for Multi-Level NUMA Systems. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, San Francisco CA USA, 215--226. Google ScholarDigital Library
- Jim Cownie. 2021. Atomics in AArch64. https://cpufun.substack.com/p/atomics-in-aarch64Google Scholar
- Tudor David, Rachid Guerraoui, and Vasileios Trigonakis. 2013. Everything You Always Wanted to Know about Synchronization but Were Afraid to Ask. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM, Farminton Pennsylvania, 33--48. Google ScholarDigital Library
- David Dice, Danny Hendler, and Ilya Mirsky. 2013. Lightweight Contention Management for Efficient Compare-and-Swap Operations. In Euro-Par 2013 Parallel Processing, David Hutchison, Takeo Kanade, Josef Kittler, Jon M. Kleinberg, Friedemann Mattern, John C. Mitchell, Moni Naor, Oscar Nierstrasz, C. Pandu Rangan, Bernhard Steffen, Madhu Sudan, Demetri Terzopoulos, Doug Tygar, Moshe Y. Vardi, Gerhard Weikum, Felix Wolf, Bernd Mohr, and Dieter an Mey (Eds.). Vol. 8097. Springer Berlin Heidelberg, Berlin, Heidelberg, 595--606. Google ScholarDigital Library
- Maurice Herlihy, Nir Shavit, Victor Luchangco, and Michael Spear. 2021. The Art of Multiprocessor Programming (second ed.). Morgan Kaufmann, Cambridge, MA.Google Scholar
- Michael Klemm and Jim Cownie. 2021. 6 Mutual Exclusion and Atomicity. In High Performance Parallel Runtimes: Design and Implementation. De Gruyter Oldenbourg, Berlin, Boston, 146--193. Google ScholarCross Ref
- Darko Makreshanski, Justin Levandoski, and Ryan Stutsman. 2015. To Lock, Swap, or Elide: On the Interplay of Hardware Transactional Memory and Lock-Free Indexing. Proceedings of the VLDB Endowment 8, 11 (July 2015), 1298--1309. Google ScholarDigital Library
- John M. Mellor-Crummey and Michael L. Scott. 1991. Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors. ACM Transactions on Computer Systems 9, 1 (Feb. 1991), 21--65. Google ScholarDigital Library
- Maged M. Michael and Michael L. Scott. 1996. Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms. In Proceedings of the Fifteenth Annual ACM Symposium on Principles of Distributed Computing - PODC '96. ACM Press, Philadelphia, Pennsylvania, United States, 267--275. Google ScholarDigital Library
- Adam Morrison and Yehuda Afek. 2013. Fast Concurrent Queues for X86 Processors. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming - PPoPP '13. ACM Press, Shenzhen, China, 103. Google ScholarDigital Library
- Jonas Oberhauser, Lilith Oberhauser, Antonio Paolillo, Diogo Behrens, Ming Fu, and Viktor Vafeiadis. 2021. Verifying and Optimizing the HMCS Lock for Arm Servers. In Networked Systems, Karima Echihabi and Roland Meyer (Eds.). Vol. 12754. Springer International Publishing, Cham, 240--260. Google ScholarDigital Library
- Víctor Soria Pardos. 2022. Characterization and Modeling of Atomic Memory Operations in Arm Based Architectures. Master's thesis. Universitat Politècnica de Catalunya, BarcelonaTech. https://upcommons.upc.edu/handle/2117/363728Google Scholar
- Christos Sakalis, Carl Leonardsson, Stefanos Kaxiras, and Alberto Ros. 2016. Splash-3: A Properly Synchronized Benchmark Suite for Contemporary Research. In 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, Uppsala, Sweden, 101--111. Google ScholarCross Ref
- Hermann Schweizer, Maciej Besta, and Torsten Hoefler. 2015. Evaluating the Cost of Atomic Operations on Modern Architectures. In 2015 International Conference on Parallel Architecture and Compilation (PACT). IEEE, San Francisco, CA, USA, 445--456. Google ScholarDigital Library
- Hancheng Wu and Michela Becchi. 2020. Evaluating Thread Coarsening and Low-cost Synchronization on Intel Xeon Phi. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, New Orleans, LA, USA, 1018--1029. Google ScholarCross Ref
Index Terms
- AArch64 Atomics: Might They Be Harming Your Performance?
Recommendations
A Study on the Performance Implications of AArch64 Atomics
High Performance ComputingAbstractAtomic operations are indivisible operations guaranteed to execute as a whole. One of the most important and widely used atomic operations is “compare-and-swap” (CAS), which allows threads to perform concurrent read-modify-write operations on the ...
Futex based locks for C11's generic atomics
SAC '16: Proceedings of the 31st Annual ACM Symposium on Applied ComputingWe present a new algorithm and implementation of a lock primitive that is based on Linux' native lock interface, the futex system call. It allows us to assemble compiler support for atomic data structures that can not be handled through specific ...
Massive atomics for massive parallelism on GPUs
ISMM '14One important type of parallelism exploited in many applications is reduction type parallelism. In these applications, the order of the read-modify-write updates to one shared data object can be arbitrary as long as there is an imposed order for the ...
Comments