Abstract
Understanding performance bottlenecks of applications in high-performance computing can lead to dramatic improvements in their performances. For example, a key problem in GPU programming is finding performance bottlenecks and solving them to reach the best possible performance. These bottlenecks in GPU architectures include a number of factors such as memory access latency, branch divergence, utilization, and the amount of existing parallelism. In addition, a simple profiling cannot demonstrate the relations between these bottlenecks. In this paper, we propose a statistical performance analyzer framework that not only helps us find bottlenecks, but also indicates the relations between them, which is not possible using a profiler. Recently, OpenCL has been proposed to be used in a variety of platforms, e.g., CPUs and GPUs, enabling a program written in one platform to be imported to other platforms with minimal effort. Therefore, we selected OpenCL to design our performance model for Nvidia GPUs. To construct the model, the values of GPU performance counters for the selected benchmarks are measured. Then, well-known statistical techniques such as regression and principal component analysis use these results to find the most significant parameters and to construct a performance model with up to 99 % accuracy. Finally, this method can be leveraged to characterize unknown applications based on their performance similarities with an existing database of benchmarks to predict their likely performance bottlenecks.
Similar content being viewed by others
References
Baghsorkhi SS, Delahaye M, Patel SJ, Gropp WD, Hwu WW (2010) An adaptive performance modeling tool for GPU architectures. In: ACM SIGPLAN notices, vol 45, pp 105–114. ACM, New York
Bakhoda A, Yuan GL, Fung WWL, Wong H, Aamodt TM (2009) Analyzing CUDA workloads using a detailed GPU simulator. In: 2009 IEEE international symposium on performance analysis of systems and software, pp 163–174
Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee S-H, Skadron K (2009) Rodinia: a benchmark suite for heterogeneous computing. In: 2009 IEEE international symposium on workload characterization (IISWC), vol 2009, pp 44–54
Du Peng, Weber R, Luszczek P, Tomov S, Peterson G, Dongarra J (2012) From CUDA to OpenCL: towards a performance-portable solution for multi-platform GPU programming. Parallel Comput 38(8):391–407
Goswami N, Shankar R, Joshi M, Li T (2010) Exploring GPGPU workloads: characterization methodology, analysis and microarchitecture evaluation implications. In: Proceedings of the IEEE international symposium on workload characterization (IISWC’10), pp 1–10, Washington, DC
Jia W, Shaw KA, Martonosi M (2012) Stargazer: automated regression-based GPU design space exploration. In: IEEE international symposium on performance analysis of systems and software ISPASS, pp 2–13
Joseph PJ, Vaswani K, Thazhuthaveetil MJ (2006) Construction and use of linear regression models for processor performance analysis. In: The 12th international symposium on high-performance computer architecture, pp 99–108
Kerr A, Anger E, Hendry G, Yalamanchili S (2012) Eiger: a framework for the automated synthesis of statistical performance models. In: 19th international conference on high performance computing, pp 1–6, Los Alamitos. IEEE Computer Society
Kerr A, Diamos G, Yalamanchili S (2009) A characterization and analysis of PTX kernels. In: 2009 IEEE international symposium on workload characterization (IISWC), pp 3–12
Kerr A, Diamos G, Yalamanchili S (2010) Modeling GPU–CPU workloads and systems. In: Proceedings of the 3rd workshop on general-purpose computation on graphics processing units (GPGPU ’10), pp 31–42, New York. ACM Press, New York
Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th international joint conference on artificial intelligence (IJCAI’95), vol 2, pp 1137–1143, San Francisco. Morgan Kaufmann Publishers Inc., Menlo Park
Lopez-Novoa U, Mendiburu A, Miguel-Alonso J (2014) A survey of performance modeling and simulation techniques for accelerator-based computing. IEEE Trans Parallel Distrib Syst 9219(c):1–1
Manly BFJ (2004) Multivariate statistical methods: a primer, 3rd edn. Chapman and Hall, London
Montgomery DC, Runger GC (2010) Applied statistics and probability for engineers, 5th edn. Wiley, New York
Munshi A (2011) The OpenCL specification
Nguyen H (2007) Gpu gems 3, 1st edn. Addison-Wesley Professional, Menlo Park
NVIDIA (2011) CUDA tools SDK CUPTI users guide
NVIDIA (2012) CUDA SDK 4.1
NVIDIA (2014) CUDA C programming guide
NVIDIA (2014) NVIDIA visual profiler
Purnomo B, Rubin N, Houston M (2010) ATI stream profiler: a tool to optimize an OpenCL kernel on ATI Radeon GPUs. In: ACM SIGGRAPH 2010 Posters (SIGGRAPH ’10), New York. ACM, New York
Stone JE, Gohara D, Shi G (2010) OpenCL: a parallel programming standard for heterogeneous computing systems. Comput Sci Eng 12(3):66–72
Wittenbrink CM, Kilgariff E, Prabhu A (2011) Fermi GF100 GPU architecture. IEEE Micro 31:50–59
Wold S, Esbensen K, Geladi P (1987) Principal component analysis. Chemom Intell Lab Syst 2(13):37–52
Zhang Y, Owens JD (2011) A quantitative performance analysis model for GPU architectures. In: IEEE 17th international symposium on high performance computer architecture, pp 382–393
Zhang Y, Hu Y, Li B, Peng L (2011) Performance and power analysis of ATI GPU: a statistical approach. In: 6th IEEE international conference on networking, architecture and storage (NAS), pp 149–158
Zhang Y, Peng L, Li B, Peir J-K, Chen J (2011) Architecture comparisons between Nvidia and ATI GPUs: computation parallelism and data communications. In: 2011 IEEE international symposium on workload characterization (IISWC), pp 205–215
Acknowledgments
We would like to thank Dr. Reza Sameni for his thoughtful suggestions and comments, which helped us improve the manuscript. This research was supported in part by School of Computer Science, Institute for Research in Fundamental Sciences (IPM) under Grant Number CS1392-4-28.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Karami, A., Khunjush, F. & Mirsoleimani, S.A. A statistical performance analyzer framework for OpenCL kernels on Nvidia GPUs. J Supercomput 71, 2900–2921 (2015). https://doi.org/10.1007/s11227-014-1338-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-014-1338-z