Abstract
SU3_Bench explores performance portability across multiple programming models using a simple but nontrivial mathematical kernel. This kernel has been derived from the (LQCD) code used in applications such as Hadron Physics and hence should be of interest to the scientific community.
SU3_Bench has a regular compute and data access pattern and on most traditional CPU and GPU-based systems, its performance is mainly determined by the achievable memory bandwidth. However, this paper shows that on the new Intel Programmable Integrated Unified Memory Architecture (PIUMA) that is designed for sparse workloads and has a balanced flops-to-byte ratio with scalar cores, SU3_Bench’s performance is determined by the total number of instructions that can be executed per cycle (pipeline throughput) rather than the usual bandwidth or flops. We show the performance analysis, porting, and optimizations of SU3_Bench on the PIUMA architecture and discuss how they are different from the standard NUMA CPUs (e.g., Xeon required NUMA optimizations whereas, on PIUMA, it was not necessary). We show iso-bandwidth and iso-power comparisons of SU3_Bench for PIUMA vs Xeon. We also show performance efficiency comparisons of SU3_Bench on PIUMA, Xeon, GPUs, and FPGAs based on pre-existing data. The lessons learned are generalizable to other similar kernels.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
We are in the power-on phase of a PIUMA system and we plan to update and integrate the simulated results with actual experimental data.
References
NUMA Balancing in RedHat. https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/virtualization_tuning_and_optimization_guide/sect-virtualization_tuning_optimization_guide-numa-auto_numa_balancing
Aananthakrishnan, S., et al.: PIUMA: programmable integrated unified memory architecture. arXiv preprint arXiv:2010.06277 (2020)
Carlson, T.E., Heirman, W., Eyerman, S., Hur, I., Eeckhout, L.: An evaluation of high-level mechanistic core models. ACM Trans. Archit. Code Optim. 11(3), 1–25 (2014). https://doi.org/10.1145/2629677
David, S.: DARPA ERI: HIVE and Intel PUMA Graph Processor. WikiChip Fuse (2019). https://fuse.wikichip.org/news/2611/darpa-eri-hive-and-intel-puma-graph-processor/
Davis, J.H., Daley, C., Pophale, S., Huber, T., Chandrasekaran, S., Wright, N.J.: Performance assessment of OpenMP compilers targeting NVIDIA V100 GPUs. In: Bhalachandra, S., Wienke, S., Chandrasekaran, S., Juckeland, G. (eds.) WACCPD 2020. LNCS, vol. 12655, pp. 25–44. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-74224-9_2
Deakin, T.: BableStream Benchmark (2017). http://uob-hpc.github.io/BabelStream/
Doerfler, D., Daley, C., Applencourt, T.: SU3_Bench, a micro-benchmark for exploring exascale era programming models, compilers and runtimes. In: 2020 Performance, Portability, and Productivity in HPC Forum (2020)
Doerfler, D., et al.: Experiences porting the SU3_bench microbenchmark to the Intel Arria 10 and Xilinx Alveo U280 FPGAs. In: International Workshop on OpenCL, pp. 1–9 (2021)
Jeffers, J., Reinders, J., Sodani, A.: Quantum chromodynamics. In: Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition, 2nd edn. Morgan Kaufmann Publishers Inc., San Francisco (2016)
Lameter, C.: NUMA (non-uniform memory access): an overview. ACM Queue 11(7) (2013). https://dl.acm.org/ft_gateway.cfm?id=2513149&ftid=1388705&dwn=1
McCalpin, J.D.: STREAM: Sustainable Memory Bandwidth in High Performance Computers. https://www.cs.virginia.edu/stream/
McCreary, D.: Intel’s Incredible PIUMA Graph Analytics Hardware. Medium (2020). https://dmccreary.medium.com/intels-incredible-piuma-graph-analytics-hardware-a2e9c3daf8d8
MIMD Lattice Collaboration, Bernard, C., et al.: The MILC Code (2010)
Tithi, J.J., Petrini, F.: A new parallel algorithm for sinkhorn word-movers distance and its performance on PIUMA and Xeon CPU. CoRR abs/2107.06433 (2021). https://arxiv.org/abs/2107.06433
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Tithi, J.J., Checconi, F., Doerfler, D., Petrini, F. (2022). SU3_Bench on a Programmable Integrated Unified Memory Architecture (PIUMA) and How that Differs from Standard NUMA CPUs. In: Varbanescu, AL., Bhatele, A., Luszczek, P., Marc, B. (eds) High Performance Computing. ISC High Performance 2022. Lecture Notes in Computer Science, vol 13289. Springer, Cham. https://doi.org/10.1007/978-3-031-07312-0_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-07312-0_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-07311-3
Online ISBN: 978-3-031-07312-0
eBook Packages: Computer ScienceComputer Science (R0)