CircusTent: A Tool for Measuring the Performance of Atomic Memory Operations on Emerging Architectures

Williams, Brody; Leidel, John D.; Wang, Xi; Donofrio, David; Chen, Yong

doi:10.1007/978-3-031-04888-3_6

CircusTent: A Tool for Measuring the Performance of Atomic Memory Operations on Emerging Architectures

Brody Williams¹²,
John D. Leidel¹³,
Xi Wang¹⁴,
David Donofrio¹³ &
…
Yong Chen¹²

Conference paper
First Online: 20 May 2022

257 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13159))

Abstract

Endeavors to engineer the next generation of exascale platforms have resulted in a fundamental shift in system architectures. Orthogonal to what was once considered conventional wisdom, high performance systems designed today are characterized by heterogeneous architectures wherein distinct components are carefully combined in order to optimize system performance and energy efficiency. One unintended consequence of this new paradigm is an increasingly complex memory hierarchy that frequently spans multiple devices and may be composed of disparate memory types. Unfortunately, the effect on performance of this new memory model is not well understood. Moreover, a quantifiable, system-agnostic methodology capable of assessing the performance of the diverse memory subsystems within emerging architectures has yet to be introduced. The CircusTent benchmark suite has been introduced to fill this void by measuring system performance with respect to atomic memory operations using established parallel programming models. However, a detailed description and evaluation of CircusTent in a distributed memory environment, critical to both current and future system architectures, has yet to be produced. In this work, we rectify this shortcoming by introducing CircusTent implementations based on the OpenSHMEM and MPI programming models and evaluating these implementations across a variety of platforms. We then detail our conclusions and characterize our observations regarding the effect of different system interconnects, memory hierarchies, and instruction set architectures on system performance.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 44.99; Price excludes VAT (USA)

Softcover Book: USD 59.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
In this work, we use the generic term “NIC” to refer to network adapters in both Ethernet and Cray Aries networks as well as InfiniBand HCAs.

References

Bale project repository (2020). https://github.com/jdevinney/bale
Ahmed, A., Skadron, K.: Hopscotch: a micro-benchmark suite for memory performance evaluation. In: Proceedings of the International Symposium on Memory Systems, MEMSYS 2019, pp. 167–172. Association for Computing Machinery, New York (2019). https://doi.org/10.1145/3357526.3357574
Alverson, B., Froese, E., Kaplan, L., Roweth, D.: Cray XC series network. Cray Inc., White Paper WP-Aries01-1112 (2012)
Google Scholar
InfiniBand Trade Association: Infiniband architecture specification volume 1 release 1.3. http://www.infinibandta.org/content/pages.php?pg=technology_download
Broadcom: Stingray PS250 SmartNIC product brief. https://docs.broadcom.com/doc/PS250-PB)
Chapman, B., et al.: Introducing OpenSHMEM: SHMEM for the PGAS community. In: Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model, PGAS 2010. Association for Computing Machinery, New York (2010). https://doi.org/10.1145/2020373.2020375
Che, S., et al.: Rodinia: a benchmark suite for heterogeneous computing. In: 2009 IEEE International Symposium on Workload Characterization (IISWC), pp. 44–54 (2009). https://doi.org/10.1109/IISWC.2009.5306797
Chen, R., Shao, Z., Li, T.: Bridging the I/O performance gap for big data workloads: a new NVDIMM-based approach. In: 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1–12 (2016). https://doi.org/10.1109/MICRO.2016.7783712
UCF Consortium: OpenSNAPI project homepage. https://www.ucfconsortium.org/projects/opensnapi/
David, T., Guerraoui, R., Trigonakis, V.: Everything you always wanted to know about synchronization but were afraid to ask. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP 2013, pp. 33–48. Association for Computing Machinery, New York (2013). https://doi.org/10.1145/2517349.2522714
Esmaeilzadeh, H., Blem, E., St. Amant, R., Sankaralingam, K., Burger, D.: Dark silicon and the end of multicore scaling. In: Proceedings of the 38th Annual International Symposium on Computer Architecture, ISCA 2011, pp. 365–376. Association for Computing Machinery, New York (2011). https://doi.org/10.1145/2000064.2000108
MPI Forum: MPI: A Message-Passing Interface Standard Version 3.0. Chapter author for Collective Communication, Process Topologies, and One Sided Communications (2012)
Google Scholar
Grodowitz, M., Shamis, P., Poole, S.: OpenSHMEM I/O extensions for fine-grained access to persistent memory storage. In: Nichols, J., Verastegui, B., Maccabe, A.B., Hernandez, O., Parete-Koon, S., Ahearn, T. (eds.) SMC 2020. CCIS, vol. 1315, pp. 318–333. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-63393-6_21
Chapter Google Scholar
Hoseini, F., Atalar, A., Tsigas, P.: Modeling the performance of atomic primitives on modern architectures. In: Proceedings of the 48th International Conference on Parallel Processing, ICPP 2019. Association for Computing Machinery, New York (2019). https://doi.org/10.1145/3337821.3337901
Jeddeloh, J., Keeth, B.: Hybrid memory cube new dram architecture increases density and performance. In: 2012 Symposium on VLSI Technology (VLSIT), pp. 87–88 (2012). https://doi.org/10.1109/VLSIT.2012.6242474
Jouppi, N.P., et al.: In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA 2017, pp. 1–12. Association for Computing Machinery, New York (2017). https://doi.org/10.1145/3079856.3080246
Jun, H., et al.: HBM (high bandwidth memory) dram technology and architecture. In: 2017 IEEE International Memory Workshop (IMW), pp. 1–4 (2017). https://doi.org/10.1109/IMW.2017.7939084
Labs, T.C.: RISC-V extended addressing architecture extension specification codenamed: xBGAS. https://github.com/tactcomplabs/xbgas-archspec
Lavin, P., Young, J., Riedy, J., Vuduc, R., Vose, A., Ernst, D.: Spatter: a tool for evaluating gather/scatter performance (2018)
Google Scholar
Nabi, S.W., Vanderbauwhede, W.: MP-STREAM: a memory performance benchmark for design space exploration on heterogeneous HPC devices. In: 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 194–197 (2018). https://doi.org/10.1109/IPDPSW.2018.00036
Naughton, T., Aderholdt, F., Baker, M., Pophale, S., Gorentla Venkata, M., Imam, N.: Oak ridge OpenSHMEM benchmark suite. In: Pophale, S., Imam, N., Aderholdt, F., Gorentla Venkata, M. (eds.) OpenSHMEM 2018. LNCS, vol. 11283, pp. 202–216. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-04918-8_13
Chapter Google Scholar
NVIDIA: Bluefield-2 data sheet. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/documents/datasheet-nvidia-bluefield-2-dpu.pdf
Schweizer, H., Besta, M., Hoefler, T.: Evaluating the cost of atomic operations on modern architectures. In: 2015 International Conference on Parallel Architecture and Compilation PACT, pp. 445–456. IEEE (2015)
Google Scholar
Seager, K., Choi, S.-E., Dinan, J., Pritchard, H., Sur, S.: Design and implementation of OpenSHMEM using OFI on the aries interconnect. In: Gorentla Venkata, M., Imam, N., Pophale, S., Mintz, T.M. (eds.) OpenSHMEM 2016. LNCS, vol. 10007, pp. 97–113. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-50995-2_7
Chapter Google Scholar
Shamis, P., et al.: UCX: an open source framework for HPC network APIS and beyond. In: 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects, pp. 40–43. IEEE (2015)
Google Scholar
Shamis, P., et al.: Development and extension of atomic memory operations in OpenSHMEM. In: Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models, PGAS 2014. Association for Computing Machinery, New York (2014). https://doi.org/10.1145/2676870.2676891
OSS Solutions: OpenSHMEM 1.4 specification. http://www.openshmem.org/site/sites/default/site_files/OpenSHMEM-1.4.pdf
Strohmaier, E., Shan, H.: Apex-map: a global data access benchmark to analyze HPC systems and parallel programming paradigms. In: SC 2005: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing, p. 49 (2005). https://doi.org/10.1109/SC.2005.13
TOS University: OSU micro-benchmarks. https://mvapich.cse.ohio-state.edu/benchmarks/
Wang, X., et al.: xBGAS: a global address space extension on RISC-V for high performance computing. In: 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 454–463 (2021). https://doi.org/10.1109/IPDPS49936.2021.00054
Weeks, H., Dosanjh, M.G.F., Bridges, P.G., Grant, R.E.: SHMEM-MT: a benchmark suite for assessing multi-threaded SHMEM performance. In: Gorentla Venkata, M., Imam, N., Pophale, S., Mintz, T.M. (eds.) OpenSHMEM 2016. LNCS, vol. 10007, pp. 227–231. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-50995-2_16
Chapter Google Scholar
Williams, B., Leidel, J., Wang, X., Donofrio, D., Chen, Y.: CircusTent: a benchmark suite for atomic memory operations. In: The International Symposium on Memory Systems, MEMSYS 2020, pp. 144–157. Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3422575.3422789

Download references

Acknowledgments

The authors would like to thank Los Alamos National Laboratory for use of the Trinitite and Capulin systems during the evaluation of this work. This study is authorized for unlimited release under LA-UR-21-28928.

Author information

Authors and Affiliations

Texas Tech University, Lubbock, TX, USA
Brody Williams & Yong Chen
Tactical Computing Laboratories, Muenster, TX, USA
John D. Leidel & David Donofrio
RISC-V International Open Source Laboratory, Shenzhen, China
Xi Wang

Authors

Brody Williams
View author publications
You can also search for this author in PubMed Google Scholar
John D. Leidel
View author publications
You can also search for this author in PubMed Google Scholar
Xi Wang
View author publications
You can also search for this author in PubMed Google Scholar
David Donofrio
View author publications
You can also search for this author in PubMed Google Scholar
Yong Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Brody Williams .

Editor information

Editors and Affiliations

Los Alamos National Laboratory, Los Almos, NM, USA
Stephen Poole
NVIDIA Corporation, Santa Clara, CA, USA
Oscar Hernandez
Oak Ridge National Laboratory, Oak Ridge, TN, USA
Matthew Baker
Stony Brook University, Stony Brook, NY, USA
Tony Curtis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Williams, B., Leidel, J.D., Wang, X., Donofrio, D., Chen, Y. (2022). CircusTent: A Tool for Measuring the Performance of Atomic Memory Operations on Emerging Architectures. In: Poole, S., Hernandez, O., Baker, M., Curtis, T. (eds) OpenSHMEM and Related Technologies. OpenSHMEM in the Era of Exascale and Smart Networks. OpenSHMEM 2021. Lecture Notes in Computer Science, vol 13159. Springer, Cham. https://doi.org/10.1007/978-3-031-04888-3_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-04888-3_6
Published: 20 May 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-04887-6
Online ISBN: 978-3-031-04888-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics