An Insightful Program Performance Tuning Chain for GPU Computing

Jia, Haipeng; Zhang, Yunquan; Long, Guoping; Yan, Shengen

doi:10.1007/978-3-642-33078-0_36

Haipeng Jia^22,23,
Yunquan Zhang^22,24,
Guoping Long²² &
…
Shengen Yan^22,24,25

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7439))

Included in the following conference series:

International Conference on Algorithms and Architectures for Parallel Processing

1981 Accesses
1 Citations

Abstract

It is challenging to optimize GPU kernels because this progress requires deep technical knowledge of the underlying hardware. Modern GPU architectures are becoming more and more diversified, which further exacerbates the already difficult problem of performance optimization. This paper presents an insightful performance tuning chain for GPUs. The goal is to help non-expert programmers with limited knowledge of GPU architectures implement high performance GPU kernels directly. We achieve it by providing performance information to identify GPU program performance bottlenecks and decide which optimization methods should be adopted, so as to facilitate the best match between algorithm features and underlying hardware characteristics. To demonstrate the usage of tuning chain, we optimize three representative GPU kernels with different compute intensity: Matrix Transpose, Laplace Transform and Integral on both NVIDIA and AMD GPUs. Experimental results demonstrate that under the guidance of our tuning chain, performance of those kernels achieves 7.8~42.4 times speedup compared to their naïve implementations on both NVIDIA and AMD GPU platforms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Zhang, Y., Owens, J.D.: A quantitative performance analysis model for GPU architectures. In: High Performance Computer Architecture, pp. 382–393 (February 2011)
Google Scholar
Baghsorkhi, S., Delahaye, M., Patel, S.J., Gropp, W.D., Hwu, W.-M.W.: An Adaptive Performance Modeling Tool for GPU Architectures. In: Principles and Practice of Parallel Programming, pp. 105–114 (January 2010)
Google Scholar
Daga, M., Scogland, T.R.W., Feng, W.-C.: Architecture-Aware Optimization on a 1600-core Graphics Processor. Technical Report TR-11-08. Computer Science, Virginia Tech.
Google Scholar
Kothapalli, K., Mukherjee, R., Rehman, M.S., Patidar, S., Narayanan, P.J., Srinathan, K.: A performance prediction model for the CUDA GPGPU platform. In: International Conference on High Performance Computing, pp. 463–472 (2009)
Google Scholar
Ryoo, S., Rodrigues, C.I., Stone, S.S., Baghsorkhi, S.S., Ueng, S., Stratton, J.A.: Program Optimization Space Pruning for a Multithreaded GPU. In: International Symposium on Code Generation and Optimization, pp. 195–204 (April 2008)
Google Scholar
Hong, S., Kim, H.: An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness. In: International Conference on Computer Architecture, pp. 152–163 (2009)
Google Scholar
Jang, B., Do, S., Pien, H.: Architecture-Aware Optimization Targeting Multithreaded Stream Computing. In: Second Workshop on General-Purpose on Graphics Processing Units (2009)
Google Scholar
Meng, J., Morozov, V.A., Kumaran, K., Vishwanath, V., Uram, T.D.: GROPHECY: GPU Performance Projection from CPU Code Skeletons. In: Conference on High Performance Computing (2011)
Google Scholar
Bauer, M., Cook, H., Khailany, B.: CudaDMA: optimizing GPU memory bandwidth via warp specialization. In: Conference on High Performance Computing, Supercomputing (2011)
Google Scholar
Govindaraju, N.K., Larsen, S., Gray, J., Manocha, D.: A Memory Model for Scientific Algorithms on Graphics Processors. In: ACM/IEEE Conference on Supercomputing (November 2006)
Google Scholar
Williams, S., Waterman, A., Patterson, D.: Roofline: An Insightful Visual Performance Model for Multicore Architectures. Communications of the ACM, 65–76 (2009)
Google Scholar
Lazowska, E.D., Zahorjan, J., Graham, G.S., Sevcik, K.C.: Quantitative System Performance: Computer System Analysis using Queueing Network Models. Prentice-Hall. Inc., Upper Saddle River (1984)
Google Scholar
Fatahalian, K., Sugerman, J., Hanrahan, P.: Understanding the Efficiency of GPU Algorithms for Matrix-matrix Multiplication. In: Conference on Graphics Hardware, pp. 133–137 (August 2004)
Google Scholar
Taylor, R., Li, X.: A Micro-benchmark Suite for AMD GPUs. In: International Conference on Parallel Processing Workshops, pp. 387–396 (2010)
Google Scholar
Liu, W., Muller-Wittig, W., Schmidt, B.: Performance Predictions for General-Purpose Computation on GPUs. In: International Conference on Parallel Processing, pp. 50–57 (September 2007)
Google Scholar
Sim, J., Dasgupta, A., Kim, H.: A performance analysis framework for identifying potential benefits in GPGPU applications. In: Proceeding of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, New York (2012)
Google Scholar
The OpenCL Specitication, http://www.khronos.org/opencl/
Optimizing Matrix Transpose in CUDA, http://www.cs.colostate.edu/~cs675/MatrixTranspose.pdf
Parallel Prefix Sum(scan) with CUDA, http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/scan/doc/scan.pdf

Download references

Author information

Authors and Affiliations

Lab. of Parallel Software and Computational Science, Institute of Software, Chinese Academy of Sciences, China
Haipeng Jia, Yunquan Zhang, Guoping Long & Shengen Yan
College of Information Science and Engineering, The Ocean University of China, China
Haipeng Jia
State Key Laboratory of Computing Science, The Chinese Academy of Sciences, China
Yunquan Zhang & Shengen Yan
Graduate University of Chinese Academy of Sciences, China
Shengen Yan

Authors

Haipeng Jia
View author publications
You can also search for this author in PubMed Google Scholar
Yunquan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Guoping Long
View author publications
You can also search for this author in PubMed Google Scholar
Shengen Yan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Information Technology, Deakin University, Melbourne Burwood Campus, 221 Burwood Highway, 3125, Burwood, VIC, Australia
Yang Xiang
SEECS, University of Ottawa, 8, King Edward Ave, K1N 6N5, Ottawa, ON, Canada
Ivan Stojmenovic
Department of Intelligent Informatics, Kyushu Sangyo University, 2-3-1 Matsukadai, Higashi-ku, 813-8503, Fukuoka, Japan
Bernady O. Apduhan
School of Information Science and Engineering, Central South University, 410083, Changsha, Hunan Province, P.R. China
Guojun Wang
Department of Information Engineering, Hiroshima University, 1-4-1, Kagamiyama, 739-8527, Higashi-Hiroshima, Japan
Koji Nakano
School of Information Technologies, University of Sydney, Building J12, 2006, Sydney, NSW, Australia
Albert Zomaya

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jia, H., Zhang, Y., Long, G., Yan, S. (2012). An Insightful Program Performance Tuning Chain for GPU Computing. In: Xiang, Y., Stojmenovic, I., Apduhan, B.O., Wang, G., Nakano, K., Zomaya, A. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2012. Lecture Notes in Computer Science, vol 7439. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33078-0_36

Download citation

DOI: https://doi.org/10.1007/978-3-642-33078-0_36
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33077-3
Online ISBN: 978-3-642-33078-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics