Predicting GPU Kernel’s Performance on Upcoming Architectures

Van Lanker, Lucas; Taboada, Hugo; Brunet, Elisabeth; Trahay, François

doi:10.1007/978-3-031-69577-3_6

Lucas Van Lanker^13,15,
Hugo Taboada^13,14,
Elisabeth Brunet¹⁵ &
…
François Trahay¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14801))

Included in the following conference series:

European Conference on Parallel Processing

864 Accesses

Abstract

With the advent of heterogeneous systems that combine CPUs and GPUs, designing a supercomputer becomes more and more complex. The hardware characteristics of GPUs significantly impact the performance. Choosing the GPU that will maximize performance for a limited budget is tedious because it requires predicting the performance on a non-existing hardware platform.

In this paper, we propose a new methodology for predicting the performance of kernels running on GPUs. This method analyzes the behavior of an application running on an existing platform, and projects its performance on another GPU based on the target hardware characteristics. The performance projection relies on a hierarchical roofline model as well as on a comparison of the kernel’s assembly instructions of both GPUs to estimate the operational intensity of the target GPU.

We demonstrate the validity of our methodology on modern NVIDIA GPUs on several mini-applications. The experiments show that the performance is predicted with a mean absolute percentage error of 20.3 % for LULESH, 10.2 % for MiniMDock, and 5.9 % for Quicksilver.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 64.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Cache-Aware Performance Prediction Framework for GPGPU Computations

GPU-STREAM v2.0: Benchmarking the Achievable Memory Bandwidth of Many-Core Processors Across Diverse Parallel Programming Models

Identifying Optimization Opportunities Within Kernel Execution in GPU Codes

References

Abdelkhalik, H., Arafa, Y., Santhi, N., Badawy, A.H.: Demystifying the nvidia ampere architecture through microbenchmarking and instruction-level analysis. In: 2022 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–8 (2022)
Google Scholar
Ardalani, N., Lestourgeon, C., Sankaralingam, K., Zhu, X.: Cross-architecture performance prediction (XAPP) using CPU code to predict GPU performance. In: Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) (2015)
Google Scholar
Bakhoda, A., Yuan, G.L., Fung, W.W.L., Wong, H., Aamodt, T.M.: Analyzing CUDA workloads using a detailed GPU simulator. In: IEEE International Symposium on Performance Analysis of Systems and Software (2009)
Google Scholar
Benatia, A., Ji, W., Wang, Y., Shi, F.: Machine learning approach for the predicting performance of SpMV on GPU. In: IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS) (2016)
Google Scholar
Binkert, N., et al.: The gem5 simulator. ACM SIGARCH Comput. Archit. News 39, 1–7 (2011)
Article Google Scholar
Ding, N., Awan, M., Williams, S.: Instruction roofline: an insightful visual performance model for GPUs. Concurrency Comput. Pract. Experience 34, e6591 (2022)
Article Google Scholar
Domke, J., et al.: At the locus of performance: quantifying the effects of copious 3D-stacked cache on HPC workloads. ACM Trans. Archit. Code Optim. 20(4), 1–26 (2023)
Article Google Scholar
Gavoille, C., Taboada, H., Carribault, P., Dupros, F., Goglin, B., Jeannot, E.: Relative Performance Projection on Arm Architectures. In: Cano, J., Trinder, P. (eds.) Euro-Par 2022: Parallel Processing. Lecture Notes in Computer Science, vol. 13440. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-12597-3_6
Gu, Y., Wu, W., Li, Y., Chen, L.: UVMBench: a comprehensive benchmark suite for researching unified virtual memory in GPUs. arXiv:2007.09822(2020)
Karlin, I., Keasler, J., Neely, R.: Lulesh 2.0 updates and changes. Tech. rep. (2013)
Google Scholar
Khairy, M., Shen, Z., Aamodt, T.M., Rogers, T.G.: Accel-Sim: an extensible simulation framework for validated GPU modeling. In: ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA) (2020)
Google Scholar
Konstantinidis, E., Cotronis, Y.: A quantitative roofline model for GPU kernel performance estimation using micro-benchmarks and hardware metric profiling. J. Parallel Distrib. Comput. 107, 37–56 (2017)
Article Google Scholar
Kwack, J., Arnold, G., Mendes, C., Bauer, G.H.: Roofline analysis with cray performance analysis tools (CrayPat) and roofline-based performance projections for a future architecture. Concurrency Comput. Pract. Experience 31, e4963 (2019)
Article Google Scholar
McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. IEEE Comput. Soc. Tech. Committee Comput. Archit. (TCCA) Newsl. 2(19–25) (1995)
Google Scholar
NVIDIA: CUDA C++ Programming Guide (2020). https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
NVIDIA: Nvidia Nsight Compute. https://docs.nvidia.com/nsight-compute/NsightCompute/index.html
Petitet, A., et al.: HPL - a portable implementation of the high-performance linpack benchmark for distributed-memory computers (2008)
Google Scholar
Richards, D., Brantley, P., Dawson, S., Mckenley, S., O’Brien, M.: Quicksilver, version 00 (2016). https://www.osti.gov/biblio/1313660
Thavappiragasam, M., Scheinberg, A., Elwasif, W., Hernandez, O., Sedova, A.: Performance portability of molecular docking miniapp on leadership computing platforms. In: IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC) (2020)
Google Scholar
Wang, Q., Chu, X.: GPGPU performance estimation with core and memory frequency scaling. IEEE Trans. Parallel Distrib. Syst. 31(12), 2865–2881 (2020)
Article Google Scholar
Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009)
Article Google Scholar
Yang, C., Kurth, T., Williams, S.: Hierarchical roofline analysis for GPUs: accelerating performance optimization for the NERSC-9 perlmutter system. Concurrency Comput. Pract. Experience 32, e5547 (2020)
Article Google Scholar
Yang, C., Wang, Y., Kurth, T., Farrell, S., Williams, S.: Hierarchical roofline performance analysis for deep learning applications. In: Intelligent Computing: Proceedings of the 2021 Computing Conference, vol. 2, pp. 473–491 (2021)
Google Scholar
Yang, C., et al.: An empirical roofline methodology for quantitatively assessing performance portability. In: IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC) (2018)
Google Scholar

Download references

Acknowledgments

We thank the University of Oregon and the OACISS team for the use of their machines.

Author information

Authors and Affiliations

CEA, DAM, DIF, 91297, Arpajon, France
Lucas Van Lanker & Hugo Taboada
CEA, Laboratoire en Informatique Haute Performance pour le Calcul et la simulation, Université Paris-Saclay, 91680, Bruyères-le-Châtel, France
Hugo Taboada
Télécom SudParis, Institut Polytechnique de Paris, Inria, 91000, Évry, France
Lucas Van Lanker, Elisabeth Brunet & François Trahay

Authors

Lucas Van Lanker
View author publications
You can also search for this author in PubMed Google Scholar
Hugo Taboada
View author publications
You can also search for this author in PubMed Google Scholar
Elisabeth Brunet
View author publications
You can also search for this author in PubMed Google Scholar
François Trahay
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lucas Van Lanker .

Editor information

Editors and Affiliations

University Carlos III of Madrid, Madrid, Madrid, Spain
Jesus Carretero
University of Oregon, Eugene, OR, USA
Sameer Shende
University Carlos III of Madrid, Madrid, Spain
Javier Garcia-Blas
TU Wien, Vienna, Austria
Ivona Brandic
Universidad Complutense de Madrid, Madrid, Spain
Katzalin Olcoz
Université Grenoble Alpes, Saint Martin d'Hères, France
Martin Schreiber

Ethics declarations

Disclosure of Interests

The authors have no competing interests to declare that are relevant to the content of this article.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Van Lanker, L., Taboada, H., Brunet, E., Trahay, F. (2024). Predicting GPU Kernel’s Performance on Upcoming Architectures. In: Carretero, J., Shende, S., Garcia-Blas, J., Brandic, I., Olcoz, K., Schreiber, M. (eds) Euro-Par 2024: Parallel Processing. Euro-Par 2024. Lecture Notes in Computer Science, vol 14801. Springer, Cham. https://doi.org/10.1007/978-3-031-69577-3_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-69577-3_6
Published: 26 August 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-69576-6
Online ISBN: 978-3-031-69577-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Predicting GPU Kernel’s Performance on Upcoming Architectures