skip to main content
10.1145/3572848.3579838acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
poster

AArch64 Atomics: Might They Be Harming Your Performance?

Authors Info & Claims
Published:21 February 2023Publication History

ABSTRACT

Atomic operations are indivisible operations guaranteed to execute as a whole. One of the most important and widely used atomic operations is "compare-and-swap" (CAS), which allows threads to perform concurrent read-modify-write operations on the same memory location, free of data races. On recent Arm architectures, CAS operations can be implemented either directly via CAS instructions, or via load-linked/store-conditional (LL-SC) instruction pairs.

In this work we explore the performance of the CAS and LL-SC approaches to implement CAS operations on recent high-performance AArch64 CPUs, namely the A64FX, ThunderX2 (TX2), and Graviton3. We observe that these instructions can lead to fundamentally different performance profiles. On A64FX, for example, the newer CAS instructions---often preferred by compilers over the older LL-SC pairs---can lead to a quadratic increase in average time per successful CAS operation as the number of threads increases, whereas the older LL-SC pairs show the expected linear increase. For high thread counts, this translates into LL-SC being more than 20x faster than CAS. On TX2 and Graviton3, LL-SC can bring more conservative (but still significant) 2--3x speedups. We characterise the conditions under which each approach delivers better performance on each CPU.

References

  1. Victor Alessandrini. 2016. Concurrent Access to Shared Data. In Shared Memory Application Programming. Morgan Kaufmann, Boston, 101--127. Google ScholarGoogle ScholarCross RefCross Ref
  2. Arm Limited. 2022. Synchronization Benchmarks. https://github.com/ARM-software/synchronization-benchmarksGoogle ScholarGoogle Scholar
  3. Milind Chabbi, Michael Fagan, and John Mellor-Crummey. 2015. High Performance Locks for Multi-Level NUMA Systems. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, San Francisco CA USA, 215--226. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Jim Cownie. 2021. Atomics in AArch64. https://cpufun.substack.com/p/atomics-in-aarch64Google ScholarGoogle Scholar
  5. Tudor David, Rachid Guerraoui, and Vasileios Trigonakis. 2013. Everything You Always Wanted to Know about Synchronization but Were Afraid to Ask. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM, Farminton Pennsylvania, 33--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. David Dice, Danny Hendler, and Ilya Mirsky. 2013. Lightweight Contention Management for Efficient Compare-and-Swap Operations. In Euro-Par 2013 Parallel Processing, David Hutchison, Takeo Kanade, Josef Kittler, Jon M. Kleinberg, Friedemann Mattern, John C. Mitchell, Moni Naor, Oscar Nierstrasz, C. Pandu Rangan, Bernhard Steffen, Madhu Sudan, Demetri Terzopoulos, Doug Tygar, Moshe Y. Vardi, Gerhard Weikum, Felix Wolf, Bernd Mohr, and Dieter an Mey (Eds.). Vol. 8097. Springer Berlin Heidelberg, Berlin, Heidelberg, 595--606. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Maurice Herlihy, Nir Shavit, Victor Luchangco, and Michael Spear. 2021. The Art of Multiprocessor Programming (second ed.). Morgan Kaufmann, Cambridge, MA.Google ScholarGoogle Scholar
  8. Michael Klemm and Jim Cownie. 2021. 6 Mutual Exclusion and Atomicity. In High Performance Parallel Runtimes: Design and Implementation. De Gruyter Oldenbourg, Berlin, Boston, 146--193. Google ScholarGoogle ScholarCross RefCross Ref
  9. Darko Makreshanski, Justin Levandoski, and Ryan Stutsman. 2015. To Lock, Swap, or Elide: On the Interplay of Hardware Transactional Memory and Lock-Free Indexing. Proceedings of the VLDB Endowment 8, 11 (July 2015), 1298--1309. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. John M. Mellor-Crummey and Michael L. Scott. 1991. Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors. ACM Transactions on Computer Systems 9, 1 (Feb. 1991), 21--65. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Maged M. Michael and Michael L. Scott. 1996. Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms. In Proceedings of the Fifteenth Annual ACM Symposium on Principles of Distributed Computing - PODC '96. ACM Press, Philadelphia, Pennsylvania, United States, 267--275. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Adam Morrison and Yehuda Afek. 2013. Fast Concurrent Queues for X86 Processors. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming - PPoPP '13. ACM Press, Shenzhen, China, 103. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Jonas Oberhauser, Lilith Oberhauser, Antonio Paolillo, Diogo Behrens, Ming Fu, and Viktor Vafeiadis. 2021. Verifying and Optimizing the HMCS Lock for Arm Servers. In Networked Systems, Karima Echihabi and Roland Meyer (Eds.). Vol. 12754. Springer International Publishing, Cham, 240--260. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Víctor Soria Pardos. 2022. Characterization and Modeling of Atomic Memory Operations in Arm Based Architectures. Master's thesis. Universitat Politècnica de Catalunya, BarcelonaTech. https://upcommons.upc.edu/handle/2117/363728Google ScholarGoogle Scholar
  15. Christos Sakalis, Carl Leonardsson, Stefanos Kaxiras, and Alberto Ros. 2016. Splash-3: A Properly Synchronized Benchmark Suite for Contemporary Research. In 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, Uppsala, Sweden, 101--111. Google ScholarGoogle ScholarCross RefCross Ref
  16. Hermann Schweizer, Maciej Besta, and Torsten Hoefler. 2015. Evaluating the Cost of Atomic Operations on Modern Architectures. In 2015 International Conference on Parallel Architecture and Compilation (PACT). IEEE, San Francisco, CA, USA, 445--456. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Hancheng Wu and Michela Becchi. 2020. Evaluating Thread Coarsening and Low-cost Synchronization on Intel Xeon Phi. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, New Orleans, LA, USA, 1018--1029. Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. AArch64 Atomics: Might They Be Harming Your Performance?

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          PPoPP '23: Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming
          February 2023
          480 pages
          ISBN:9798400700156
          DOI:10.1145/3572848

          Copyright © 2023 Owner/Author

          Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 21 February 2023

          Check for updates

          Qualifiers

          • poster

          Acceptance Rates

          Overall Acceptance Rate230of1,014submissions,23%
        • Article Metrics

          • Downloads (Last 12 months)215
          • Downloads (Last 6 weeks)21

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader