Abstract
It is challenging to optimize GPU kernels because this progress requires deep technical knowledge of the underlying hardware. Modern GPU architectures are becoming more and more diversified, which further exacerbates the already difficult problem of performance optimization. This paper presents an insightful performance tuning chain for GPUs. The goal is to help non-expert programmers with limited knowledge of GPU architectures implement high performance GPU kernels directly. We achieve it by providing performance information to identify GPU program performance bottlenecks and decide which optimization methods should be adopted, so as to facilitate the best match between algorithm features and underlying hardware characteristics. To demonstrate the usage of tuning chain, we optimize three representative GPU kernels with different compute intensity: Matrix Transpose, Laplace Transform and Integral on both NVIDIA and AMD GPUs. Experimental results demonstrate that under the guidance of our tuning chain, performance of those kernels achieves 7.8~42.4 times speedup compared to their naïve implementations on both NVIDIA and AMD GPU platforms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Zhang, Y., Owens, J.D.: A quantitative performance analysis model for GPU architectures. In: High Performance Computer Architecture, pp. 382–393 (February 2011)
Baghsorkhi, S., Delahaye, M., Patel, S.J., Gropp, W.D., Hwu, W.-M.W.: An Adaptive Performance Modeling Tool for GPU Architectures. In: Principles and Practice of Parallel Programming, pp. 105–114 (January 2010)
Daga, M., Scogland, T.R.W., Feng, W.-C.: Architecture-Aware Optimization on a 1600-core Graphics Processor. Technical Report TR-11-08. Computer Science, Virginia Tech.
Kothapalli, K., Mukherjee, R., Rehman, M.S., Patidar, S., Narayanan, P.J., Srinathan, K.: A performance prediction model for the CUDA GPGPU platform. In: International Conference on High Performance Computing, pp. 463–472 (2009)
Ryoo, S., Rodrigues, C.I., Stone, S.S., Baghsorkhi, S.S., Ueng, S., Stratton, J.A.: Program Optimization Space Pruning for a Multithreaded GPU. In: International Symposium on Code Generation and Optimization, pp. 195–204 (April 2008)
Hong, S., Kim, H.: An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness. In: International Conference on Computer Architecture, pp. 152–163 (2009)
Jang, B., Do, S., Pien, H.: Architecture-Aware Optimization Targeting Multithreaded Stream Computing. In: Second Workshop on General-Purpose on Graphics Processing Units (2009)
Meng, J., Morozov, V.A., Kumaran, K., Vishwanath, V., Uram, T.D.: GROPHECY: GPU Performance Projection from CPU Code Skeletons. In: Conference on High Performance Computing (2011)
Bauer, M., Cook, H., Khailany, B.: CudaDMA: optimizing GPU memory bandwidth via warp specialization. In: Conference on High Performance Computing, Supercomputing (2011)
Govindaraju, N.K., Larsen, S., Gray, J., Manocha, D.: A Memory Model for Scientific Algorithms on Graphics Processors. In: ACM/IEEE Conference on Supercomputing (November 2006)
Williams, S., Waterman, A., Patterson, D.: Roofline: An Insightful Visual Performance Model for Multicore Architectures. Communications of the ACM, 65–76 (2009)
Lazowska, E.D., Zahorjan, J., Graham, G.S., Sevcik, K.C.: Quantitative System Performance: Computer System Analysis using Queueing Network Models. Prentice-Hall. Inc., Upper Saddle River (1984)
Fatahalian, K., Sugerman, J., Hanrahan, P.: Understanding the Efficiency of GPU Algorithms for Matrix-matrix Multiplication. In: Conference on Graphics Hardware, pp. 133–137 (August 2004)
Taylor, R., Li, X.: A Micro-benchmark Suite for AMD GPUs. In: International Conference on Parallel Processing Workshops, pp. 387–396 (2010)
Liu, W., Muller-Wittig, W., Schmidt, B.: Performance Predictions for General-Purpose Computation on GPUs. In: International Conference on Parallel Processing, pp. 50–57 (September 2007)
Sim, J., Dasgupta, A., Kim, H.: A performance analysis framework for identifying potential benefits in GPGPU applications. In: Proceeding of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, New York (2012)
The OpenCL Specitication, http://www.khronos.org/opencl/
Optimizing Matrix Transpose in CUDA, http://www.cs.colostate.edu/~cs675/MatrixTranspose.pdf
Parallel Prefix Sum(scan) with CUDA, http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/scan/doc/scan.pdf
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Jia, H., Zhang, Y., Long, G., Yan, S. (2012). An Insightful Program Performance Tuning Chain for GPU Computing. In: Xiang, Y., Stojmenovic, I., Apduhan, B.O., Wang, G., Nakano, K., Zomaya, A. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2012. Lecture Notes in Computer Science, vol 7439. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33078-0_36
Download citation
DOI: https://doi.org/10.1007/978-3-642-33078-0_36
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33077-3
Online ISBN: 978-3-642-33078-0
eBook Packages: Computer ScienceComputer Science (R0)