Skip to main content
Log in

A statistical performance analyzer framework for OpenCL kernels on Nvidia GPUs

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Understanding performance bottlenecks of applications in high-performance computing can lead to dramatic improvements in their performances. For example, a key problem in GPU programming is finding performance bottlenecks and solving them to reach the best possible performance. These bottlenecks in GPU architectures include a number of factors such as memory access latency, branch divergence, utilization, and the amount of existing parallelism. In addition, a simple profiling cannot demonstrate the relations between these bottlenecks. In this paper, we propose a statistical performance analyzer framework that not only helps us find bottlenecks, but also indicates the relations between them, which is not possible using a profiler. Recently, OpenCL has been proposed to be used in a variety of platforms, e.g., CPUs and GPUs, enabling a program written in one platform to be imported to other platforms with minimal effort. Therefore, we selected OpenCL to design our performance model for Nvidia GPUs. To construct the model, the values of GPU performance counters for the selected benchmarks are measured. Then, well-known statistical techniques such as regression and principal component analysis use these results to find the most significant parameters and to construct a performance model with up to 99 % accuracy. Finally, this method can be leveraged to characterize unknown applications based on their performance similarities with an existing database of benchmarks to predict their likely performance bottlenecks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Baghsorkhi SS, Delahaye M, Patel SJ, Gropp WD, Hwu WW (2010) An adaptive performance modeling tool for GPU architectures. In: ACM SIGPLAN notices, vol 45, pp 105–114. ACM, New York

  2. Bakhoda A, Yuan GL, Fung WWL, Wong H, Aamodt TM (2009) Analyzing CUDA workloads using a detailed GPU simulator. In: 2009 IEEE international symposium on performance analysis of systems and software, pp 163–174

  3. Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee S-H, Skadron K (2009) Rodinia: a benchmark suite for heterogeneous computing. In: 2009 IEEE international symposium on workload characterization (IISWC), vol 2009, pp 44–54

  4. Du Peng, Weber R, Luszczek P, Tomov S, Peterson G, Dongarra J (2012) From CUDA to OpenCL: towards a performance-portable solution for multi-platform GPU programming. Parallel Comput 38(8):391–407

    Article  Google Scholar 

  5. Goswami N, Shankar R, Joshi M, Li T (2010) Exploring GPGPU workloads: characterization methodology, analysis and microarchitecture evaluation implications. In: Proceedings of the IEEE international symposium on workload characterization (IISWC’10), pp 1–10, Washington, DC

  6. Jia W, Shaw KA, Martonosi M (2012) Stargazer: automated regression-based GPU design space exploration. In: IEEE international symposium on performance analysis of systems and software ISPASS, pp 2–13

  7. Joseph PJ, Vaswani K, Thazhuthaveetil MJ (2006) Construction and use of linear regression models for processor performance analysis. In: The 12th international symposium on high-performance computer architecture, pp 99–108

  8. Kerr A, Anger E, Hendry G, Yalamanchili S (2012) Eiger: a framework for the automated synthesis of statistical performance models. In: 19th international conference on high performance computing, pp 1–6, Los Alamitos. IEEE Computer Society

  9. Kerr A, Diamos G, Yalamanchili S (2009) A characterization and analysis of PTX kernels. In: 2009 IEEE international symposium on workload characterization (IISWC), pp 3–12

  10. Kerr A, Diamos G, Yalamanchili S (2010) Modeling GPU–CPU workloads and systems. In: Proceedings of the 3rd workshop on general-purpose computation on graphics processing units (GPGPU ’10), pp 31–42, New York. ACM Press, New York

  11. Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th international joint conference on artificial intelligence (IJCAI’95), vol 2, pp 1137–1143, San Francisco. Morgan Kaufmann Publishers Inc., Menlo Park

  12. Lopez-Novoa U, Mendiburu A, Miguel-Alonso J (2014) A survey of performance modeling and simulation techniques for accelerator-based computing. IEEE Trans Parallel Distrib Syst 9219(c):1–1

  13. Manly BFJ (2004) Multivariate statistical methods: a primer, 3rd edn. Chapman and Hall, London

  14. Montgomery DC, Runger GC (2010) Applied statistics and probability for engineers, 5th edn. Wiley, New York

  15. Munshi A (2011) The OpenCL specification

  16. Nguyen H (2007) Gpu gems 3, 1st edn. Addison-Wesley Professional, Menlo Park

  17. NVIDIA (2011) CUDA tools SDK CUPTI users guide

  18. NVIDIA (2012) CUDA SDK 4.1

  19. NVIDIA (2014) CUDA C programming guide

  20. NVIDIA (2014) NVIDIA visual profiler

  21. Purnomo B, Rubin N, Houston M (2010) ATI stream profiler: a tool to optimize an OpenCL kernel on ATI Radeon GPUs. In: ACM SIGGRAPH 2010 Posters (SIGGRAPH ’10), New York. ACM, New York

  22. Stone JE, Gohara D, Shi G (2010) OpenCL: a parallel programming standard for heterogeneous computing systems. Comput Sci Eng 12(3):66–72

    Article  Google Scholar 

  23. Wittenbrink CM, Kilgariff E, Prabhu A (2011) Fermi GF100 GPU architecture. IEEE Micro 31:50–59

    Article  Google Scholar 

  24. Wold S, Esbensen K, Geladi P (1987) Principal component analysis. Chemom Intell Lab Syst 2(13):37–52

    Article  Google Scholar 

  25. Zhang Y, Owens JD (2011) A quantitative performance analysis model for GPU architectures. In: IEEE 17th international symposium on high performance computer architecture, pp 382–393

  26. Zhang Y, Hu Y, Li B, Peng L (2011) Performance and power analysis of ATI GPU: a statistical approach. In: 6th IEEE international conference on networking, architecture and storage (NAS), pp 149–158

  27. Zhang Y, Peng L, Li B, Peir J-K, Chen J (2011) Architecture comparisons between Nvidia and ATI GPUs: computation parallelism and data communications. In: 2011 IEEE international symposium on workload characterization (IISWC), pp 205–215

Download references

Acknowledgments

We would like to thank Dr. Reza Sameni for his thoughtful suggestions and comments, which helped us improve the manuscript. This research was supported in part by School of Computer Science, Institute for Research in Fundamental Sciences (IPM) under Grant Number CS1392-4-28.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Farshad Khunjush.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Karami, A., Khunjush, F. & Mirsoleimani, S.A. A statistical performance analyzer framework for OpenCL kernels on Nvidia GPUs. J Supercomput 71, 2900–2921 (2015). https://doi.org/10.1007/s11227-014-1338-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-014-1338-z

Keywords

Navigation