Abstract
HPL (High Performance Linpack) is a widely accepted benchmark for evaluating high-performance computer clusters. It produces performance results by solving large linear systems, which serves as the measurement of the Top-500 supercomputer ranking. With the increasingly wider performance gap between CPU and GPGPU, non-computing-intensive workload becomes more time-critical and impedes the sustained HPL performance more severely. Traditionally on multi-GPGPU platform, a one-to-one mapping between processes and devices is enforced in HPL. While it brings simplicity for implementation, the even share of the system resources among the processes in each node leads to lower system utilization in the major time-critical algorithmic steps of HPL. In this paper, we propose a novel device-centric HPL approach for current main-stream multi-GPGPU platforms, where each process can make full use of the resources of a node, including accelerators, CPU sockets, PCI-e buses and memory/network bandwidth etc. As a result, the workload on the CPU-end and the inter-process communication are greatly boosted due to higher system utilization, while the computation on the device-end remains efficient. Experiment shows that in the case of a single workstation with 4 GPGPUs, our approach can achieve more than \(80\%\) of the theoretical peak and nearly \(95\%\) of the dgemm performance, which is significantly higher than the state-of-the-art counterpart on the same platform. In the case of multi-GPGPU clusters, we also largely improve the sustained performance and efficiency as compared to previous works of HPL incorporating multi-GPGPU features. Further, based on both performance analysis and the experimental results, we believe that our approach may serve as a competitive cornerstone for further optimizations on future heterogeneous platforms.
Similar content being viewed by others
References
Awan, A.A., Hamidouche, K., Venkatesh, A., Panda, D.K.: Efficient large message broadcast using nccl and cuda-aware mpi for deep learning. In: Proceedings of the 23rd European MPI Users’ Group Meeting, pp 15–22 (2016). https://doi.org/10.1145/2966884.2966912
Bach, M., Kretz, M., Lindenstruth, V., Rohr, D.: Optimized hpl for amd gpu and multi-core cpu usage. Comput. Sci. 26(3–4), 153–164 (2011). https://doi.org/10.1007/s00450-011-0161-5
BLAS: Basic linear algebra subprograms. http://www.netlib.org/blas/ (2021)
cuBLAS: the CUDA basic linear algebra subroutine library. https://developer.nvidia.com/cublas (2010)
CUDA: Compute unified device architecture. https://developer.nvidia.com/cuda-downloads (2022)
Dongarra, J., Luszczek, P.: TOP500. https://www.top500.org/ (2011)
Dongarra, J., Luszczek, P., Petitet, A.: The linpack benchmark: past, present and future. Concurr. Comput. Pract. Exp. 15, 803–820 (2003). https://doi.org/10.1002/cpe.728
Endo, T., Matsuoka, S., Nukada, A., Maruyama, N.: Linpack evaluation on a supercomputer with heterogeneous accelerators. In: 2010 IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pp. 1–8 (2010). https://doi.org/10.1109/IPDPS.2010.5470353
Fatica, M.: Accelerating linpack with cuda on heterogenous clusters, pp 46–51 (2009). https://doi.org/10.1145/1513895.1513901
Heinecke, A.: et al. Design and implementation of the linpack benchmark for single and multi-node systems based on intel ®xeon phi coprocessor. In: 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pp. 126–137 (2013). https://doi.org/10.1109/IPDPS.2013.113
Jia, Y., Luszczek, P., Dongarra, J.: Multi-gpu implementation of lu factorization. Proc. Comput. Sci. 9(106–115), (2012). https://doi.org/10.1016/j.procs.2012.04.012, Proceedings of the International Conference on Computational Science, ICCS
Jo, G., Nah, J., Lee, J., Kim, J., Lee, J.: Accelerating linpack with mpi-opencl on clusters of multi-gpu nodes. IEEE Trans. Parallel Distrib. Syst. 26(7), 1814–1825 (2015). https://doi.org/10.1109/TPDS.2014.2321742
Kurzak, J., Luszczek, P., Faverge, M., Dongarra, J.: Lu factorization with partial pivoting for a multicore system with accelerators. IEEE Trans. Parallel Distrib. Syst. 24(8), 1613–1621 (2013). https://doi.org/10.1109/TPDS.2012.242
Meet two of the most powerful supercomputers on the planet. https://www.ibm.com/thought-leadership/summit-supercomputer/ (2021)
NUMA support for linux. https://github.com/numactl (2022)
Open MPI: Version 4.0. https://www.open-mpi.org/software/ompi/v4.0/ (2022)
OpenCL: Open computing language. https://opencl.org/ (2022)
rocBLAS: Next generation BLAS implementation for ROCm platform. https://github.com/ROCmSoftwarePlatform/rocBLAS (2022)
ROCm developer tools and programing languages. https://github.com/ROCm-Developer-Tools (2022)
Shui, C., et al.: Revisiting linpack algorithm on large-scale cpu-gpu heterogeneous systems, pp. 411–412 (2020). https://doi.org/10.1145/3332466.3374530
Supercomputer Fugaku: Fujitsu Global. https://www.fujitsu.com/global/about/innovation/fugaku/ (2021)
Toledo, S.: Locality of reference in LU decomposition with partial pivoting. Siam J. Matrix Anal. Appl. 18(4), 1065–1081 (1997). https://doi.org/10.1137/S0895479896297744
Track, E., Forbes, N., Strawn, G.: The end of Moore’s law. Comput. Sci. Eng. 19(2), 4–6 (2017). https://doi.org/10.1109/MCSE.2017.25
Zee, F.G.V., van de Geijn, R.A.: BLIS: a framework for rapidly instantiating BLAS functionality. ACM Trans. Math. Softw. 41(3), 14:1-14:33 (2015). https://doi.org/10.1145/2764454
Zhang, W.: Emulation and forecast of hpl test performance. J. Comput. Res. Dev. (2006). https://doi.org/10.1360/crad20060328
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sun, Q., Ma, W., Sun, J. et al. Evolving the HPL benchmark towards multi-GPGPU clusters. CCF Trans. HPC 5, 84–96 (2023). https://doi.org/10.1007/s42514-022-00128-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42514-022-00128-6