poster

AArch64 Atomics: Might They Be Harming Your Performance?

Authors:

Michèle WeilandAuthors Info & Claims

PPoPP '23: Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

Pages 419 - 421

https://doi.org/10.1145/3572848.3579838

Published: 21 February 2023 Publication History

Abstract

Atomic operations are indivisible operations guaranteed to execute as a whole. One of the most important and widely used atomic operations is "compare-and-swap" (CAS), which allows threads to perform concurrent read-modify-write operations on the same memory location, free of data races. On recent Arm architectures, CAS operations can be implemented either directly via CAS instructions, or via load-linked/store-conditional (LL-SC) instruction pairs.

In this work we explore the performance of the CAS and LL-SC approaches to implement CAS operations on recent high-performance AArch64 CPUs, namely the A64FX, ThunderX2 (TX2), and Graviton3. We observe that these instructions can lead to fundamentally different performance profiles. On A64FX, for example, the newer CAS instructions---often preferred by compilers over the older LL-SC pairs---can lead to a quadratic increase in average time per successful CAS operation as the number of threads increases, whereas the older LL-SC pairs show the expected linear increase. For high thread counts, this translates into LL-SC being more than 20x faster than CAS. On TX2 and Graviton3, LL-SC can bring more conservative (but still significant) 2--3x speedups. We characterise the conditions under which each approach delivers better performance on each CPU.

References

[1]

Victor Alessandrini. 2016. Concurrent Access to Shared Data. In Shared Memory Application Programming. Morgan Kaufmann, Boston, 101--127.

[2]

Arm Limited. 2022. Synchronization Benchmarks. https://github.com/ARM-software/synchronization-benchmarks

[3]

Milind Chabbi, Michael Fagan, and John Mellor-Crummey. 2015. High Performance Locks for Multi-Level NUMA Systems. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, San Francisco CA USA, 215--226.

Digital Library

[4]

Jim Cownie. 2021. Atomics in AArch64. https://cpufun.substack.com/p/atomics-in-aarch64

[5]

Tudor David, Rachid Guerraoui, and Vasileios Trigonakis. 2013. Everything You Always Wanted to Know about Synchronization but Were Afraid to Ask. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM, Farminton Pennsylvania, 33--48.

Digital Library

[6]

David Dice, Danny Hendler, and Ilya Mirsky. 2013. Lightweight Contention Management for Efficient Compare-and-Swap Operations. In Euro-Par 2013 Parallel Processing, David Hutchison, Takeo Kanade, Josef Kittler, Jon M. Kleinberg, Friedemann Mattern, John C. Mitchell, Moni Naor, Oscar Nierstrasz, C. Pandu Rangan, Bernhard Steffen, Madhu Sudan, Demetri Terzopoulos, Doug Tygar, Moshe Y. Vardi, Gerhard Weikum, Felix Wolf, Bernd Mohr, and Dieter an Mey (Eds.). Vol. 8097. Springer Berlin Heidelberg, Berlin, Heidelberg, 595--606.

Digital Library

[7]

Maurice Herlihy, Nir Shavit, Victor Luchangco, and Michael Spear. 2021. The Art of Multiprocessor Programming (second ed.). Morgan Kaufmann, Cambridge, MA.

[8]

Michael Klemm and Jim Cownie. 2021. 6 Mutual Exclusion and Atomicity. In High Performance Parallel Runtimes: Design and Implementation. De Gruyter Oldenbourg, Berlin, Boston, 146--193.

[9]

Darko Makreshanski, Justin Levandoski, and Ryan Stutsman. 2015. To Lock, Swap, or Elide: On the Interplay of Hardware Transactional Memory and Lock-Free Indexing. Proceedings of the VLDB Endowment 8, 11 (July 2015), 1298--1309.

Digital Library

[10]

John M. Mellor-Crummey and Michael L. Scott. 1991. Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors. ACM Transactions on Computer Systems 9, 1 (Feb. 1991), 21--65.

Digital Library

[11]

Maged M. Michael and Michael L. Scott. 1996. Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms. In Proceedings of the Fifteenth Annual ACM Symposium on Principles of Distributed Computing - PODC '96. ACM Press, Philadelphia, Pennsylvania, United States, 267--275.

Digital Library

[12]

Adam Morrison and Yehuda Afek. 2013. Fast Concurrent Queues for X86 Processors. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming - PPoPP '13. ACM Press, Shenzhen, China, 103.

Digital Library

[13]

Jonas Oberhauser, Lilith Oberhauser, Antonio Paolillo, Diogo Behrens, Ming Fu, and Viktor Vafeiadis. 2021. Verifying and Optimizing the HMCS Lock for Arm Servers. In Networked Systems, Karima Echihabi and Roland Meyer (Eds.). Vol. 12754. Springer International Publishing, Cham, 240--260.

Digital Library

[14]

Víctor Soria Pardos. 2022. Characterization and Modeling of Atomic Memory Operations in Arm Based Architectures. Master's thesis. Universitat Politècnica de Catalunya, BarcelonaTech. https://upcommons.upc.edu/handle/2117/363728

[15]

Christos Sakalis, Carl Leonardsson, Stefanos Kaxiras, and Alberto Ros. 2016. Splash-3: A Properly Synchronized Benchmark Suite for Contemporary Research. In 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, Uppsala, Sweden, 101--111.

[16]

Hermann Schweizer, Maciej Besta, and Torsten Hoefler. 2015. Evaluating the Cost of Atomic Operations on Modern Architectures. In 2015 International Conference on Parallel Architecture and Compilation (PACT). IEEE, San Francisco, CA, USA, 445--456.

Digital Library

[17]

Hancheng Wu and Michela Becchi. 2020. Evaluating Thread Coarsening and Low-cost Synchronization on Intel Xeon Phi. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, New Orleans, LA, USA, 1018--1029.

Index Terms

AArch64 Atomics: Might They Be Harming Your Performance?

Recommendations

A Study on the Performance Implications of AArch64 Atomics
High Performance Computing
Abstract
Atomic operations are indivisible operations guaranteed to execute as a whole. One of the most important and widely used atomic operations is “compare-and-swap” (CAS), which allows threads to perform concurrent read-modify-write operations on the ...
Free atomics: hardware atomic operations without fences
ISCA '22: Proceedings of the 49th Annual International Symposium on Computer Architecture

Atomic Read-Modify-Write (RMW) instructions are primitive synchronization operations implemented in hardware that provide the building blocks for higher-abstraction synchronization mechanisms to programmers. According to publicly available documentation,...
Massive atomics for massive parallelism on GPUs
ISMM '14

One important type of parallelism exploited in many applications is reduction type parallelism. In these applications, the order of the read-modify-write updates to one shared data object can be arbitrary as long as there is an imposed order for the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '23: Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

February 2023

480 pages

ISBN:9798400700156

DOI:10.1145/3572848

General Chair:
Maryam Mehri Dehnavi
University of Toronto
,
Program Chairs:
Milind Kulkarni
Purdue University
,
Sriram Krishnamoorthy
Google

Copyright © 2023 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 February 2023

Check for updates

Author Tags

Qualifiers

Poster

Conference

PPoPP '23

Sponsor:

PPoPP '23: The 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

February 25 - March 1, 2023

QC, Montreal, Canada

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
372
Total Downloads

Downloads (Last 12 months)103
Downloads (Last 6 weeks)8

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten