Abstract
With the advent of heterogeneous systems that combine CPUs and GPUs, designing a supercomputer becomes more and more complex. The hardware characteristics of GPUs significantly impact the performance. Choosing the GPU that will maximize performance for a limited budget is tedious because it requires predicting the performance on a non-existing hardware platform.
In this paper, we propose a new methodology for predicting the performance of kernels running on GPUs. This method analyzes the behavior of an application running on an existing platform, and projects its performance on another GPU based on the target hardware characteristics. The performance projection relies on a hierarchical roofline model as well as on a comparison of the kernel’s assembly instructions of both GPUs to estimate the operational intensity of the target GPU.
We demonstrate the validity of our methodology on modern NVIDIA GPUs on several mini-applications. The experiments show that the performance is predicted with a mean absolute percentage error of 20.3 % for LULESH, 10.2 % for MiniMDock, and 5.9 % for Quicksilver.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abdelkhalik, H., Arafa, Y., Santhi, N., Badawy, A.H.: Demystifying the nvidia ampere architecture through microbenchmarking and instruction-level analysis. In: 2022 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–8 (2022)
Ardalani, N., Lestourgeon, C., Sankaralingam, K., Zhu, X.: Cross-architecture performance prediction (XAPP) using CPU code to predict GPU performance. In: Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) (2015)
Bakhoda, A., Yuan, G.L., Fung, W.W.L., Wong, H., Aamodt, T.M.: Analyzing CUDA workloads using a detailed GPU simulator. In: IEEE International Symposium on Performance Analysis of Systems and Software (2009)
Benatia, A., Ji, W., Wang, Y., Shi, F.: Machine learning approach for the predicting performance of SpMV on GPU. In: IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS) (2016)
Binkert, N., et al.: The gem5 simulator. ACM SIGARCH Comput. Archit. News 39, 1–7 (2011)
Ding, N., Awan, M., Williams, S.: Instruction roofline: an insightful visual performance model for GPUs. Concurrency Comput. Pract. Experience 34, e6591 (2022)
Domke, J., et al.: At the locus of performance: quantifying the effects of copious 3D-stacked cache on HPC workloads. ACM Trans. Archit. Code Optim. 20(4), 1–26 (2023)
Gavoille, C., Taboada, H., Carribault, P., Dupros, F., Goglin, B., Jeannot, E.: Relative Performance Projection on Arm Architectures. In: Cano, J., Trinder, P. (eds.) Euro-Par 2022: Parallel Processing. Lecture Notes in Computer Science, vol. 13440. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-12597-3_6
Gu, Y., Wu, W., Li, Y., Chen, L.: UVMBench: a comprehensive benchmark suite for researching unified virtual memory in GPUs. arXiv:2007.09822(2020)
Karlin, I., Keasler, J., Neely, R.: Lulesh 2.0 updates and changes. Tech. rep. (2013)
Khairy, M., Shen, Z., Aamodt, T.M., Rogers, T.G.: Accel-Sim: an extensible simulation framework for validated GPU modeling. In: ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA) (2020)
Konstantinidis, E., Cotronis, Y.: A quantitative roofline model for GPU kernel performance estimation using micro-benchmarks and hardware metric profiling. J. Parallel Distrib. Comput. 107, 37–56 (2017)
Kwack, J., Arnold, G., Mendes, C., Bauer, G.H.: Roofline analysis with cray performance analysis tools (CrayPat) and roofline-based performance projections for a future architecture. Concurrency Comput. Pract. Experience 31, e4963 (2019)
McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. IEEE Comput. Soc. Tech. Committee Comput. Archit. (TCCA) Newsl. 2(19–25) (1995)
NVIDIA: CUDA C++ Programming Guide (2020). https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
NVIDIA: Nvidia Nsight Compute. https://docs.nvidia.com/nsight-compute/NsightCompute/index.html
Petitet, A., et al.: HPL - a portable implementation of the high-performance linpack benchmark for distributed-memory computers (2008)
Richards, D., Brantley, P., Dawson, S., Mckenley, S., O’Brien, M.: Quicksilver, version 00 (2016). https://www.osti.gov/biblio/1313660
Thavappiragasam, M., Scheinberg, A., Elwasif, W., Hernandez, O., Sedova, A.: Performance portability of molecular docking miniapp on leadership computing platforms. In: IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC) (2020)
Wang, Q., Chu, X.: GPGPU performance estimation with core and memory frequency scaling. IEEE Trans. Parallel Distrib. Syst. 31(12), 2865–2881 (2020)
Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009)
Yang, C., Kurth, T., Williams, S.: Hierarchical roofline analysis for GPUs: accelerating performance optimization for the NERSC-9 perlmutter system. Concurrency Comput. Pract. Experience 32, e5547 (2020)
Yang, C., Wang, Y., Kurth, T., Farrell, S., Williams, S.: Hierarchical roofline performance analysis for deep learning applications. In: Intelligent Computing: Proceedings of the 2021 Computing Conference, vol. 2, pp. 473–491 (2021)
Yang, C., et al.: An empirical roofline methodology for quantitatively assessing performance portability. In: IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC) (2018)
Acknowledgments
We thank the University of Oregon and the OACISS team for the use of their machines.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Disclosure of Interests
The authors have no competing interests to declare that are relevant to the content of this article.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Van Lanker, L., Taboada, H., Brunet, E., Trahay, F. (2024). Predicting GPU Kernel’s Performance on Upcoming Architectures. In: Carretero, J., Shende, S., Garcia-Blas, J., Brandic, I., Olcoz, K., Schreiber, M. (eds) Euro-Par 2024: Parallel Processing. Euro-Par 2024. Lecture Notes in Computer Science, vol 14801. Springer, Cham. https://doi.org/10.1007/978-3-031-69577-3_6
Download citation
DOI: https://doi.org/10.1007/978-3-031-69577-3_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-69576-6
Online ISBN: 978-3-031-69577-3
eBook Packages: Computer ScienceComputer Science (R0)