research-article

Modeling the Performance of Atomic Primitives on Modern Architectures

Authors:

Fazeleh Hoseini,

Philippas TsigasAuthors Info & Claims

ICPP '19: Proceedings of the 48th International Conference on Parallel Processing

Article No.: 28, Pages 1 - 11

https://doi.org/10.1145/3337821.3337901

Published: 05 August 2019 Publication History

Abstract

Utilizing the atomic primitives of a processor to access a memory location atomically is key to the correctness and feasibility of parallel software systems. The performance of atomics plays a significant role in the scalability and overall performance of parallel software systems.

In this work, we study the performance -in terms of latency, throughput, fairness, energy consumption- of atomic primitives in the context of the two common software execution settings that result in high and low contention access on shared memory. We perform and present an exhaustive study of the performance of atomics in these two application contexts and propose a performance model that captures their behavior. We consider two state-of-the-art architectures: Intel Xeon E5, Xeon Phi (KNL). We propose a model that is centered around the bouncing of cache lines between threads that execute atomic primitives on these shared cache lines. The model is very simple to be used in practice and captures the behavior of atomics accurately under these execution scenarios and facilitate algorithmic design decisions in multi-threaded programming.

References

[1]

Aras Atalar, Anders Gidenstam, Paul Renaud-Goud, and Philippas Tsigas. 2015. Modeling Energy Consumption of Lock-Free Queue Implementations. In IPDPS. IEEE Computer Society, Washington, DC, USA, 229--238.

Digital Library

[2]

Vlastimil Babka and Petr Tůma. 2009. Investigating Cache Parameters of x86 Family Processors. In Computer Performance Evaluation and Benchmarking. Springer Berlin Heidelberg, Berlin, Heidelberg, 77--96.

Digital Library

[3]

Howard David, Eugene Gorbatov, Ulf R. Hanebutte, Rahul Khanna, and Christian Le. 2010. RAPL: memory power estimation and capping. In ISLPED. ACM, New York, NY, USA, 189--194.

Digital Library

[4]

Tudor David, Rachid Guerraoui, and Vasileios Trigonakis. 2013. Everything you always wanted to know about synchronization but were afraid to ask. In SOSP. ACM, New York, NY, USA, 33--48.

Digital Library

[5]

Miyuru Dayarathna, Yonggang Wen, and Rui Fan. 2016. Data Center Energy Consumption Modeling: A Survey. IEEE Communications Surveys and Tutorials 18 (2016), 732--794.

Digital Library

[6]

Phuong Hoai Ha, Marina Papatriantafilou, and Philippas Tsigas. 2007. Efficient self-tuning spin-locks using competitive analysis. In Journal of Systems and Software, Vol. 80. Elsevier Science Inc., New York, NY, USA, 1077--1090.

Digital Library

[7]

Daniel Hackenberg, Daniel Molka, and Wolfgang E. Nagel. 2009. Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems. In MICRO. ACM, New York, NY, USA, 413--422.

Digital Library

[8]

Mor Harchol-Balter. 2013. Performance Modeling and Design of Computer Systems: Queueing Theory in Action (1st ed.). Cambridge University Press, Cambridge.

Digital Library

[9]

Maurice Herlihy and Nir Shavit. 2008. The Art of Multiprocessor Programming. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.

Digital Library

[10]

Intel Corporation 2014. Intel 64 and IA-32 Architectures Optimization Reference Manual. Intel Corporation.

[11]

Intel Corporation 2014. IntelR Xeon PhiTM Coprocessor: Software Developers Guide. Intel Corporation.

[12]

Guido Juckeland, Michael Kluge, Wolfgang E. Nagel, and Stefan Pflüger. 2004. Performance Analysis with BenchIT: Portable, Flexible, Easy to Use. In QEST. IEEE Computer Society, Washington, DC, USA, 320--321.

Digital Library

[13]

Daniel Molka, Daniel Hackenberg, and Robert Schöne. 2014. Main memory and cache performance of intel sandy bridge and AMD bulldozer. In MSPC@PLDI. ACM, New York, NY, USA, 4:1--4:10.

Digital Library

[14]

Daniel Molka, Daniel Hackenberg, Robert Schöne, and Matthias S. Müller. 2009. Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System. In PACT. IEEE Computer Society, Washington, DC, USA, 261--270.

Digital Library

[15]

Sabela Ramos and Torsten Hoefler. 2013. Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi. In HPDC. ACM, New York, NY, USA, 97--108.

Digital Library

[16]

Hermann Schweizer, Maciej Besta, and Torsten Hoefler. 2015. Evaluating the Cost of Atomic Operations on Modern Architectures. In PACT. IEEE Computer Society, Washington, DC, USA, 445--456.

Digital Library

[17]

Yakun Sophia Shao and David Brooks. 2013. Energy characterization and instruction-level energy model of Intel's Xeon Phi processor. In ISLPED. IEEE Computer Society, Washington, DC, USA, 389--394.

Digital Library

Cited By

Togkousidis AChernomor OStamatakis A(2023)Parallel Inference of Phylogenetic Stands with Gentrius2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW59300.2023.00035(139-148)Online publication date: May-2023
https://doi.org/10.1109/IPDPSW59300.2023.00035
Gurumurthy BBroneske DSchäler MPionteck TSaake G(2023)Novel insights on atomic synchronization for sort-based group-by on GPUsDistributed and Parallel Databases10.1007/s10619-023-07424-241:3(387-409)Online publication date: 24-Apr-2023
https://doi.org/10.1007/s10619-023-07424-2
Rukundo AAtalar ATsigas PAgrawal KLee I(2022)Performance Analysis and Modelling of Concurrent Multi-access Data StructuresProceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3490148.3538578(333-344)Online publication date: 11-Jul-2022
https://dl.acm.org/doi/10.1145/3490148.3538578
Show More Cited By

Index Terms

Modeling the Performance of Atomic Primitives on Modern Architectures
1. General and reference
  1. Cross-computing tools and techniques
    1. Estimation
    2. Performance
2. Theory of computation
  1. Models of computation
    1. Concurrency
      1. Parallel computing models

Recommendations

Performance of memory reclamation for lockless synchronization

Achieving high performance for concurrent applications on modern multiprocessors remains challenging. Many programmers avoid locking to improve performance, while others replace locks with non-blocking synchronization to protect against deadlock, ...
Inferring locks for atomic sections
PLDI '08: Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation

Atomic sections are a recent and popular idiom to support the development of concurrent programs. Updates performed within an atomic section should not be visible to other threads until the atomic section has been executed entirely. Traditionally, ...
Inferring locks for atomic sections
PLDI '08

Atomic sections are a recent and popular idiom to support the development of concurrent programs. Updates performed within an atomic section should not be visible to other threads until the atomic section has been executed entirely. Traditionally, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP '19: Proceedings of the 48th International Conference on Parallel Processing

August 2019

1107 pages

ISBN:9781450362955

DOI:10.1145/3337821

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

University of Tsukuba: University of Tsukuba

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 August 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Vetenskapsrådet
Stiftelsen för Strategisk Forskning

Conference

ICPP 2019

ICPP 2019: 48th International Conference on Parallel Processing

August 5 - 8, 2019

Kyoto, Japan

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
194
Total Downloads

Downloads (Last 12 months)32
Downloads (Last 6 weeks)3

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Togkousidis AChernomor OStamatakis A(2023)Parallel Inference of Phylogenetic Stands with Gentrius2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW59300.2023.00035(139-148)Online publication date: May-2023
https://doi.org/10.1109/IPDPSW59300.2023.00035
Gurumurthy BBroneske DSchäler MPionteck TSaake G(2023)Novel insights on atomic synchronization for sort-based group-by on GPUsDistributed and Parallel Databases10.1007/s10619-023-07424-241:3(387-409)Online publication date: 24-Apr-2023
https://doi.org/10.1007/s10619-023-07424-2
Rukundo AAtalar ATsigas PAgrawal KLee I(2022)Performance Analysis and Modelling of Concurrent Multi-access Data StructuresProceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3490148.3538578(333-344)Online publication date: 11-Jul-2022
https://dl.acm.org/doi/10.1145/3490148.3538578
Gurumurthy BBroneske DSchaler MPionteck TSaake G(2021)An Investigation of Atomic Synchronization for Sort-Based Group-By Aggregation on GPUs2021 IEEE 37th International Conference on Data Engineering Workshops (ICDEW)10.1109/ICDEW53142.2021.00016(48-53)Online publication date: Apr-2021
https://doi.org/10.1109/ICDEW53142.2021.00016
Williams BLeidel JWang XDonofrio DChen Y(2021)CircusTent: A Tool for Measuring the Performance of Atomic Memory Operations on Emerging ArchitecturesOpenSHMEM and Related Technologies. OpenSHMEM in the Era of Exascale and Smart Networks10.1007/978-3-031-04888-3_6(92-110)Online publication date: 13-Sep-2021
https://dl.acm.org/doi/10.1007/978-3-031-04888-3_6
Williams BLeidel JWang XDonofrio DChen Y(2020)CircusTent: A Benchmark Suite for Atomic Memory OperationsProceedings of the International Symposium on Memory Systems10.1145/3422575.3422789(144-157)Online publication date: 28-Sep-2020
https://dl.acm.org/doi/10.1145/3422575.3422789

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten