Evolving the HPL benchmark towards multi-GPGPU clusters

Sun, Qiao; Ma, Wenjing; Sun, Jiachang; Li, Huiyuan

doi:10.1007/s42514-022-00128-6

Evolving the HPL benchmark towards multi-GPGPU clusters

Regular Paper
Published: 26 October 2022

Volume 5, pages 84–96, (2023)
Cite this article

CCF Transactions on High Performance Computing Aims and scope Submit manuscript

Qiao Sun^1,2,
Wenjing Ma^1,2^na1,
Jiachang Sun^1,2^na1 &
…
Huiyuan Li^1,2^na1

196 Accesses
1 Citation
Explore all metrics

Abstract

HPL (High Performance Linpack) is a widely accepted benchmark for evaluating high-performance computer clusters. It produces performance results by solving large linear systems, which serves as the measurement of the Top-500 supercomputer ranking. With the increasingly wider performance gap between CPU and GPGPU, non-computing-intensive workload becomes more time-critical and impedes the sustained HPL performance more severely. Traditionally on multi-GPGPU platform, a one-to-one mapping between processes and devices is enforced in HPL. While it brings simplicity for implementation, the even share of the system resources among the processes in each node leads to lower system utilization in the major time-critical algorithmic steps of HPL. In this paper, we propose a novel device-centric HPL approach for current main-stream multi-GPGPU platforms, where each process can make full use of the resources of a node, including accelerators, CPU sockets, PCI-e buses and memory/network bandwidth etc. As a result, the workload on the CPU-end and the inter-process communication are greatly boosted due to higher system utilization, while the computation on the device-end remains efficient. Experiment shows that in the case of a single workstation with 4 GPGPUs, our approach can achieve more than \(80\%\) of the theoretical peak and nearly \(95\%\) of the dgemm performance, which is significantly higher than the state-of-the-art counterpart on the same platform. In the case of multi-GPGPU clusters, we also largely improve the sustained performance and efficiency as compared to previous works of HPL incorporating multi-GPGPU features. Further, based on both performance analysis and the experimental results, we believe that our approach may serve as a competitive cornerstone for further optimizations on future heterogeneous platforms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Accelerating Scientific Applications on Heterogeneous Systems with HybridOMP

Evaluating the NVIDIA Tegra Processor as a Low-Power Alternative for Sparse GPU Computations

GPU-STREAM v2.0: Benchmarking the Achievable Memory Bandwidth of Many-Core Processors Across Diverse Parallel Programming Models

References

Awan, A.A., Hamidouche, K., Venkatesh, A., Panda, D.K.: Efficient large message broadcast using nccl and cuda-aware mpi for deep learning. In: Proceedings of the 23rd European MPI Users’ Group Meeting, pp 15–22 (2016). https://doi.org/10.1145/2966884.2966912
Bach, M., Kretz, M., Lindenstruth, V., Rohr, D.: Optimized hpl for amd gpu and multi-core cpu usage. Comput. Sci. 26(3–4), 153–164 (2011). https://doi.org/10.1007/s00450-011-0161-5
Article Google Scholar
BLAS: Basic linear algebra subprograms. http://www.netlib.org/blas/ (2021)
cuBLAS: the CUDA basic linear algebra subroutine library. https://developer.nvidia.com/cublas (2010)
CUDA: Compute unified device architecture. https://developer.nvidia.com/cuda-downloads (2022)
Dongarra, J., Luszczek, P.: TOP500. https://www.top500.org/ (2011)
Dongarra, J., Luszczek, P., Petitet, A.: The linpack benchmark: past, present and future. Concurr. Comput. Pract. Exp. 15, 803–820 (2003). https://doi.org/10.1002/cpe.728
Article Google Scholar
Endo, T., Matsuoka, S., Nukada, A., Maruyama, N.: Linpack evaluation on a supercomputer with heterogeneous accelerators. In: 2010 IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pp. 1–8 (2010). https://doi.org/10.1109/IPDPS.2010.5470353
Fatica, M.: Accelerating linpack with cuda on heterogenous clusters, pp 46–51 (2009). https://doi.org/10.1145/1513895.1513901
Heinecke, A.: et al. Design and implementation of the linpack benchmark for single and multi-node systems based on intel ®xeon phi coprocessor. In: 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pp. 126–137 (2013). https://doi.org/10.1109/IPDPS.2013.113
Jia, Y., Luszczek, P., Dongarra, J.: Multi-gpu implementation of lu factorization. Proc. Comput. Sci. 9(106–115), (2012). https://doi.org/10.1016/j.procs.2012.04.012, Proceedings of the International Conference on Computational Science, ICCS
Jo, G., Nah, J., Lee, J., Kim, J., Lee, J.: Accelerating linpack with mpi-opencl on clusters of multi-gpu nodes. IEEE Trans. Parallel Distrib. Syst. 26(7), 1814–1825 (2015). https://doi.org/10.1109/TPDS.2014.2321742
Article Google Scholar
Kurzak, J., Luszczek, P., Faverge, M., Dongarra, J.: Lu factorization with partial pivoting for a multicore system with accelerators. IEEE Trans. Parallel Distrib. Syst. 24(8), 1613–1621 (2013). https://doi.org/10.1109/TPDS.2012.242
Article Google Scholar
Meet two of the most powerful supercomputers on the planet. https://www.ibm.com/thought-leadership/summit-supercomputer/ (2021)
NUMA support for linux. https://github.com/numactl (2022)
Open MPI: Version 4.0. https://www.open-mpi.org/software/ompi/v4.0/ (2022)
OpenCL: Open computing language. https://opencl.org/ (2022)
rocBLAS: Next generation BLAS implementation for ROCm platform. https://github.com/ROCmSoftwarePlatform/rocBLAS (2022)
ROCm developer tools and programing languages. https://github.com/ROCm-Developer-Tools (2022)
Shui, C., et al.: Revisiting linpack algorithm on large-scale cpu-gpu heterogeneous systems, pp. 411–412 (2020). https://doi.org/10.1145/3332466.3374530
Supercomputer Fugaku: Fujitsu Global. https://www.fujitsu.com/global/about/innovation/fugaku/ (2021)
Toledo, S.: Locality of reference in LU decomposition with partial pivoting. Siam J. Matrix Anal. Appl. 18(4), 1065–1081 (1997). https://doi.org/10.1137/S0895479896297744
Article MathSciNet MATH Google Scholar
Track, E., Forbes, N., Strawn, G.: The end of Moore’s law. Comput. Sci. Eng. 19(2), 4–6 (2017). https://doi.org/10.1109/MCSE.2017.25
Article Google Scholar
Zee, F.G.V., van de Geijn, R.A.: BLIS: a framework for rapidly instantiating BLAS functionality. ACM Trans. Math. Softw. 41(3), 14:1-14:33 (2015). https://doi.org/10.1145/2764454
Article MathSciNet MATH Google Scholar
Zhang, W.: Emulation and forecast of hpl test performance. J. Comput. Res. Dev. (2006). https://doi.org/10.1360/crad20060328
Article Google Scholar

Download references

Author information

Wenjing Ma, Jiachang Sun, Huiyuan Li have contributed equally to this work.

Authors and Affiliations

Institute of Software, Chinese Academy of Sciences, 4 fourth Southern Street of Zhongguancun, HaiDian Qu, Beijing, 100190, China
Qiao Sun, Wenjing Ma, Jiachang Sun & Huiyuan Li
University of Chinese Academy of Sciences, 19 Yuquan Road, Shi Jingshan Qu, Beijing, 100049, China
Qiao Sun, Wenjing Ma, Jiachang Sun & Huiyuan Li

Authors

Qiao Sun
View author publications
You can also search for this author in PubMed Google Scholar
Wenjing Ma
View author publications
You can also search for this author in PubMed Google Scholar
Jiachang Sun
View author publications
You can also search for this author in PubMed Google Scholar
Huiyuan Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qiao Sun.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Sun, Q., Ma, W., Sun, J. et al. Evolving the HPL benchmark towards multi-GPGPU clusters. CCF Trans. HPC 5, 84–96 (2023). https://doi.org/10.1007/s42514-022-00128-6

Download citation

Received: 30 October 2021
Accepted: 15 September 2022
Published: 26 October 2022
Issue Date: March 2023
DOI: https://doi.org/10.1007/s42514-022-00128-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Evolving the HPL benchmark towards multi-GPGPU clusters

Abstract

Access this article

Similar content being viewed by others

Accelerating Scientific Applications on Heterogeneous Systems with HybridOMP

Evaluating the NVIDIA Tegra Processor as a Low-Power Alternative for Sparse GPU Computations

GPU-STREAM v2.0: Benchmarking the Achievable Memory Bandwidth of Many-Core Processors Across Diverse Parallel Programming Models

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Evolving the HPL benchmark towards multi-GPGPU clusters

Abstract

Access this article

Similar content being viewed by others

Accelerating Scientific Applications on Heterogeneous Systems with HybridOMP

Evaluating the NVIDIA Tegra Processor as a Low-Power Alternative for Sparse GPU Computations

GPU-STREAM v2.0: Benchmarking the Achievable Memory Bandwidth of Many-Core Processors Across Diverse Parallel Programming Models

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation