Abstract
Endeavors to engineer the next generation of exascale platforms have resulted in a fundamental shift in system architectures. Orthogonal to what was once considered conventional wisdom, high performance systems designed today are characterized by heterogeneous architectures wherein distinct components are carefully combined in order to optimize system performance and energy efficiency. One unintended consequence of this new paradigm is an increasingly complex memory hierarchy that frequently spans multiple devices and may be composed of disparate memory types. Unfortunately, the effect on performance of this new memory model is not well understood. Moreover, a quantifiable, system-agnostic methodology capable of assessing the performance of the diverse memory subsystems within emerging architectures has yet to be introduced. The CircusTent benchmark suite has been introduced to fill this void by measuring system performance with respect to atomic memory operations using established parallel programming models. However, a detailed description and evaluation of CircusTent in a distributed memory environment, critical to both current and future system architectures, has yet to be produced. In this work, we rectify this shortcoming by introducing CircusTent implementations based on the OpenSHMEM and MPI programming models and evaluating these implementations across a variety of platforms. We then detail our conclusions and characterize our observations regarding the effect of different system interconnects, memory hierarchies, and instruction set architectures on system performance.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
In this work, we use the generic term “NIC” to refer to network adapters in both Ethernet and Cray Aries networks as well as InfiniBand HCAs.
References
Bale project repository (2020). https://github.com/jdevinney/bale
Ahmed, A., Skadron, K.: Hopscotch: a micro-benchmark suite for memory performance evaluation. In: Proceedings of the International Symposium on Memory Systems, MEMSYS 2019, pp. 167–172. Association for Computing Machinery, New York (2019). https://doi.org/10.1145/3357526.3357574
Alverson, B., Froese, E., Kaplan, L., Roweth, D.: Cray XC series network. Cray Inc., White Paper WP-Aries01-1112 (2012)
InfiniBand Trade Association: Infiniband architecture specification volume 1 release 1.3. http://www.infinibandta.org/content/pages.php?pg=technology_download
Broadcom: Stingray PS250 SmartNIC product brief. https://docs.broadcom.com/doc/PS250-PB)
Chapman, B., et al.: Introducing OpenSHMEM: SHMEM for the PGAS community. In: Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model, PGAS 2010. Association for Computing Machinery, New York (2010). https://doi.org/10.1145/2020373.2020375
Che, S., et al.: Rodinia: a benchmark suite for heterogeneous computing. In: 2009 IEEE International Symposium on Workload Characterization (IISWC), pp. 44–54 (2009). https://doi.org/10.1109/IISWC.2009.5306797
Chen, R., Shao, Z., Li, T.: Bridging the I/O performance gap for big data workloads: a new NVDIMM-based approach. In: 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1–12 (2016). https://doi.org/10.1109/MICRO.2016.7783712
UCF Consortium: OpenSNAPI project homepage. https://www.ucfconsortium.org/projects/opensnapi/
David, T., Guerraoui, R., Trigonakis, V.: Everything you always wanted to know about synchronization but were afraid to ask. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP 2013, pp. 33–48. Association for Computing Machinery, New York (2013). https://doi.org/10.1145/2517349.2522714
Esmaeilzadeh, H., Blem, E., St. Amant, R., Sankaralingam, K., Burger, D.: Dark silicon and the end of multicore scaling. In: Proceedings of the 38th Annual International Symposium on Computer Architecture, ISCA 2011, pp. 365–376. Association for Computing Machinery, New York (2011). https://doi.org/10.1145/2000064.2000108
MPI Forum: MPI: A Message-Passing Interface Standard Version 3.0. Chapter author for Collective Communication, Process Topologies, and One Sided Communications (2012)
Grodowitz, M., Shamis, P., Poole, S.: OpenSHMEM I/O extensions for fine-grained access to persistent memory storage. In: Nichols, J., Verastegui, B., Maccabe, A.B., Hernandez, O., Parete-Koon, S., Ahearn, T. (eds.) SMC 2020. CCIS, vol. 1315, pp. 318–333. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-63393-6_21
Hoseini, F., Atalar, A., Tsigas, P.: Modeling the performance of atomic primitives on modern architectures. In: Proceedings of the 48th International Conference on Parallel Processing, ICPP 2019. Association for Computing Machinery, New York (2019). https://doi.org/10.1145/3337821.3337901
Jeddeloh, J., Keeth, B.: Hybrid memory cube new dram architecture increases density and performance. In: 2012 Symposium on VLSI Technology (VLSIT), pp. 87–88 (2012). https://doi.org/10.1109/VLSIT.2012.6242474
Jouppi, N.P., et al.: In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA 2017, pp. 1–12. Association for Computing Machinery, New York (2017). https://doi.org/10.1145/3079856.3080246
Jun, H., et al.: HBM (high bandwidth memory) dram technology and architecture. In: 2017 IEEE International Memory Workshop (IMW), pp. 1–4 (2017). https://doi.org/10.1109/IMW.2017.7939084
Labs, T.C.: RISC-V extended addressing architecture extension specification codenamed: xBGAS. https://github.com/tactcomplabs/xbgas-archspec
Lavin, P., Young, J., Riedy, J., Vuduc, R., Vose, A., Ernst, D.: Spatter: a tool for evaluating gather/scatter performance (2018)
Nabi, S.W., Vanderbauwhede, W.: MP-STREAM: a memory performance benchmark for design space exploration on heterogeneous HPC devices. In: 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 194–197 (2018). https://doi.org/10.1109/IPDPSW.2018.00036
Naughton, T., Aderholdt, F., Baker, M., Pophale, S., Gorentla Venkata, M., Imam, N.: Oak ridge OpenSHMEM benchmark suite. In: Pophale, S., Imam, N., Aderholdt, F., Gorentla Venkata, M. (eds.) OpenSHMEM 2018. LNCS, vol. 11283, pp. 202–216. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-04918-8_13
NVIDIA: Bluefield-2 data sheet. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/documents/datasheet-nvidia-bluefield-2-dpu.pdf
Schweizer, H., Besta, M., Hoefler, T.: Evaluating the cost of atomic operations on modern architectures. In: 2015 International Conference on Parallel Architecture and Compilation PACT, pp. 445–456. IEEE (2015)
Seager, K., Choi, S.-E., Dinan, J., Pritchard, H., Sur, S.: Design and implementation of OpenSHMEM using OFI on the aries interconnect. In: Gorentla Venkata, M., Imam, N., Pophale, S., Mintz, T.M. (eds.) OpenSHMEM 2016. LNCS, vol. 10007, pp. 97–113. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-50995-2_7
Shamis, P., et al.: UCX: an open source framework for HPC network APIS and beyond. In: 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects, pp. 40–43. IEEE (2015)
Shamis, P., et al.: Development and extension of atomic memory operations in OpenSHMEM. In: Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models, PGAS 2014. Association for Computing Machinery, New York (2014). https://doi.org/10.1145/2676870.2676891
OSS Solutions: OpenSHMEM 1.4 specification. http://www.openshmem.org/site/sites/default/site_files/OpenSHMEM-1.4.pdf
Strohmaier, E., Shan, H.: Apex-map: a global data access benchmark to analyze HPC systems and parallel programming paradigms. In: SC 2005: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing, p. 49 (2005). https://doi.org/10.1109/SC.2005.13
TOS University: OSU micro-benchmarks. https://mvapich.cse.ohio-state.edu/benchmarks/
Wang, X., et al.: xBGAS: a global address space extension on RISC-V for high performance computing. In: 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 454–463 (2021). https://doi.org/10.1109/IPDPS49936.2021.00054
Weeks, H., Dosanjh, M.G.F., Bridges, P.G., Grant, R.E.: SHMEM-MT: a benchmark suite for assessing multi-threaded SHMEM performance. In: Gorentla Venkata, M., Imam, N., Pophale, S., Mintz, T.M. (eds.) OpenSHMEM 2016. LNCS, vol. 10007, pp. 227–231. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-50995-2_16
Williams, B., Leidel, J., Wang, X., Donofrio, D., Chen, Y.: CircusTent: a benchmark suite for atomic memory operations. In: The International Symposium on Memory Systems, MEMSYS 2020, pp. 144–157. Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3422575.3422789
Acknowledgments
The authors would like to thank Los Alamos National Laboratory for use of the Trinitite and Capulin systems during the evaluation of this work. This study is authorized for unlimited release under LA-UR-21-28928.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Williams, B., Leidel, J.D., Wang, X., Donofrio, D., Chen, Y. (2022). CircusTent: A Tool for Measuring the Performance of Atomic Memory Operations on Emerging Architectures. In: Poole, S., Hernandez, O., Baker, M., Curtis, T. (eds) OpenSHMEM and Related Technologies. OpenSHMEM in the Era of Exascale and Smart Networks. OpenSHMEM 2021. Lecture Notes in Computer Science, vol 13159. Springer, Cham. https://doi.org/10.1007/978-3-031-04888-3_6
Download citation
DOI: https://doi.org/10.1007/978-3-031-04888-3_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-04887-6
Online ISBN: 978-3-031-04888-3
eBook Packages: Computer ScienceComputer Science (R0)