SU3_Bench on a Programmable Integrated Unified Memory Architecture (PIUMA) and How that Differs from Standard NUMA CPUs

Tithi, Jesmin Jahan; Checconi, Fabio; Doerfler, Douglas; Petrini, Fabrizio

doi:10.1007/978-3-031-07312-0_4

Jesmin Jahan Tithi¹¹,
Fabio Checconi¹¹,
Douglas Doerfler¹² &
…
Fabrizio Petrini¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13289))

Included in the following conference series:

International Conference on High Performance Computing

1197 Accesses

Abstract

SU3_Bench explores performance portability across multiple programming models using a simple but nontrivial mathematical kernel. This kernel has been derived from the (LQCD) code used in applications such as Hadron Physics and hence should be of interest to the scientific community.

SU3_Bench has a regular compute and data access pattern and on most traditional CPU and GPU-based systems, its performance is mainly determined by the achievable memory bandwidth. However, this paper shows that on the new Intel Programmable Integrated Unified Memory Architecture (PIUMA) that is designed for sparse workloads and has a balanced flops-to-byte ratio with scalar cores, SU3_Bench’s performance is determined by the total number of instructions that can be executed per cycle (pipeline throughput) rather than the usual bandwidth or flops. We show the performance analysis, porting, and optimizations of SU3_Bench on the PIUMA architecture and discuss how they are different from the standard NUMA CPUs (e.g., Xeon required NUMA optimizations whereas, on PIUMA, it was not necessary). We show iso-bandwidth and iso-power comparisons of SU3_Bench for PIUMA vs Xeon. We also show performance efficiency comparisons of SU3_Bench on PIUMA, Xeon, GPUs, and FPGAs based on pre-existing data. The lessons learned are generalizable to other similar kernels.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
We are in the power-on phase of a PIUMA system and we plan to update and integrate the simulated results with actual experimental data.

References

NUMA Balancing in RedHat. https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/virtualization_tuning_and_optimization_guide/sect-virtualization_tuning_optimization_guide-numa-auto_numa_balancing
SU3_Bench. https://gitlab.com/NERSC/nersc-proxies/su3_bench
Aananthakrishnan, S., et al.: PIUMA: programmable integrated unified memory architecture. arXiv preprint arXiv:2010.06277 (2020)
Carlson, T.E., Heirman, W., Eyerman, S., Hur, I., Eeckhout, L.: An evaluation of high-level mechanistic core models. ACM Trans. Archit. Code Optim. 11(3), 1–25 (2014). https://doi.org/10.1145/2629677
Article Google Scholar
David, S.: DARPA ERI: HIVE and Intel PUMA Graph Processor. WikiChip Fuse (2019). https://fuse.wikichip.org/news/2611/darpa-eri-hive-and-intel-puma-graph-processor/
Davis, J.H., Daley, C., Pophale, S., Huber, T., Chandrasekaran, S., Wright, N.J.: Performance assessment of OpenMP compilers targeting NVIDIA V100 GPUs. In: Bhalachandra, S., Wienke, S., Chandrasekaran, S., Juckeland, G. (eds.) WACCPD 2020. LNCS, vol. 12655, pp. 25–44. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-74224-9_2
Chapter Google Scholar
Deakin, T.: BableStream Benchmark (2017). http://uob-hpc.github.io/BabelStream/
Doerfler, D., Daley, C., Applencourt, T.: SU3_Bench, a micro-benchmark for exploring exascale era programming models, compilers and runtimes. In: 2020 Performance, Portability, and Productivity in HPC Forum (2020)
Google Scholar
Doerfler, D., et al.: Experiences porting the SU3_bench microbenchmark to the Intel Arria 10 and Xilinx Alveo U280 FPGAs. In: International Workshop on OpenCL, pp. 1–9 (2021)
Google Scholar
Jeffers, J., Reinders, J., Sodani, A.: Quantum chromodynamics. In: Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition, 2nd edn. Morgan Kaufmann Publishers Inc., San Francisco (2016)
Google Scholar
Lameter, C.: NUMA (non-uniform memory access): an overview. ACM Queue 11(7) (2013). https://dl.acm.org/ft_gateway.cfm?id=2513149&ftid=1388705&dwn=1
McCalpin, J.D.: STREAM: Sustainable Memory Bandwidth in High Performance Computers. https://www.cs.virginia.edu/stream/
McCreary, D.: Intel’s Incredible PIUMA Graph Analytics Hardware. Medium (2020). https://dmccreary.medium.com/intels-incredible-piuma-graph-analytics-hardware-a2e9c3daf8d8
MIMD Lattice Collaboration, Bernard, C., et al.: The MILC Code (2010)
Google Scholar
Tithi, J.J., Petrini, F.: A new parallel algorithm for sinkhorn word-movers distance and its performance on PIUMA and Xeon CPU. CoRR abs/2107.06433 (2021). https://arxiv.org/abs/2107.06433

Download references

Author information

Authors and Affiliations

Parallel Computing Labs, Intel Corporation, Santa Clara, CA, 95054, USA
Jesmin Jahan Tithi, Fabio Checconi & Fabrizio Petrini
Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
Douglas Doerfler

Authors

Jesmin Jahan Tithi
View author publications
You can also search for this author in PubMed Google Scholar
Fabio Checconi
View author publications
You can also search for this author in PubMed Google Scholar
Douglas Doerfler
View author publications
You can also search for this author in PubMed Google Scholar
Fabrizio Petrini
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jesmin Jahan Tithi .

Editor information

Editors and Affiliations

University of Twente, Enschede, The Netherlands
Ana-Lucia Varbanescu
University of Maryland, College Park, MD, USA
Abhinav Bhatele
University of Tennessee, Knoxville, TN, USA
Piotr Luszczek
Université Paris-Saclay, Orsay, France
Baboulin Marc

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tithi, J.J., Checconi, F., Doerfler, D., Petrini, F. (2022). SU3_Bench on a Programmable Integrated Unified Memory Architecture (PIUMA) and How that Differs from Standard NUMA CPUs. In: Varbanescu, AL., Bhatele, A., Luszczek, P., Marc, B. (eds) High Performance Computing. ISC High Performance 2022. Lecture Notes in Computer Science, vol 13289. Springer, Cham. https://doi.org/10.1007/978-3-031-07312-0_4

Download citation

DOI: https://doi.org/10.1007/978-3-031-07312-0_4
Published: 29 May 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-07311-3
Online ISBN: 978-3-031-07312-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics