ABSTRACT
When executing a kernel function on a general-purpose graphics processing unit (GPGPU), it is critical to select an appropriate configuration setting for optimal performance. Configuration settings affect the allocation and utilization of GPGPU resources during the execution of a kernel function1. However, testing all possible configuration settings to find an optimal setting is time-consuming and costly. To address this challenge, we propose a prediction mechanism that can suggest a configuration setting for the kernel function to complete the operation with minimal execution time. We start by filtering the amount of data, mandatory parameters, and optional parameters, and then calculate the resource occupancy of three critical resources on the GPGPU: Warp, Register, and Shared Memory. We eliminate configuration settings with a lower average resource occupancy than the user-defined value. The remaining configuration settings have better execution performance, and we use them to execute the kernel functions and record the required execution time. Finally, we use these configuration settings and their corresponding execution times as training data to build a prediction model using the logistic regression (LR) algorithm. At runtime, the prediction model recommends a configuration setting with better performance when the amount of data to be processed is known. We have conducted experiments that confirm our proposed mechanism's ability to improve kernel function execution performance more effectively than other mechanisms. Note that the proposed mechanism can be applied to other kernel functions.
- CUDA Toolkit Documentation v11.3.0, https://docs.nvidia.com/cuda/index.html, 2021.Google Scholar
- Miroslav Kubat, An Introduction to Machine Learning, Springer, 2017, pp. 43--62.Google ScholarCross Ref
- Thanasekhar Balaiah and Ranjani Parthasarathi. 2020. Autotuning of configuration for program execution in GPUs. Concurrency and Computation: Practice and Experience 32, 9 (2020), e5635.Google ScholarCross Ref
- Yalin Baştardar and Mustafa Özuysal. 2014. Introduction to machine learning. miRNomics: MicroRNA biology and computational analysis (2014), 105--128.Google Scholar
- Ben van Werkhoven. 2019. Kernel Tuner: A search-optimizing GPU code auto- tuner. Future Generation Computer Systems 90 (2019), 347--358.Google ScholarCross Ref
Index Terms
- Improve the Performance of Parallel Reduction on General-Purpose Graphics Processor Units Using Prediction Models
Recommendations
A performance study of general-purpose applications on graphics processors using CUDA
Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of ...
Load Balancing versus Occupancy Maximization on Graphics Processing Units: The Generalized Hough Transform as a Case Study
Programs developed under the Compute Unified Device Architecture obtain the highest performance rate, when the exploitation of hardware resources on a Graphics Processing Unit (GPU) is maximized. In order to achieve this purpose, load balancing among ...
Comments