A statistical performance analyzer framework for OpenCL kernels on Nvidia GPUs

Karami, Ali; Khunjush, Farshad; Mirsoleimani, Seyyed Ali

doi:10.1007/s11227-014-1338-z

A statistical performance analyzer framework for OpenCL kernels on Nvidia GPUs

Published: 13 December 2014

Volume 71, pages 2900–2921, (2015)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Ali Karami¹,
Farshad Khunjush^1,2 &
Seyyed Ali Mirsoleimani¹

340 Accesses
11 Citations
Explore all metrics

Abstract

Understanding performance bottlenecks of applications in high-performance computing can lead to dramatic improvements in their performances. For example, a key problem in GPU programming is finding performance bottlenecks and solving them to reach the best possible performance. These bottlenecks in GPU architectures include a number of factors such as memory access latency, branch divergence, utilization, and the amount of existing parallelism. In addition, a simple profiling cannot demonstrate the relations between these bottlenecks. In this paper, we propose a statistical performance analyzer framework that not only helps us find bottlenecks, but also indicates the relations between them, which is not possible using a profiler. Recently, OpenCL has been proposed to be used in a variety of platforms, e.g., CPUs and GPUs, enabling a program written in one platform to be imported to other platforms with minimal effort. Therefore, we selected OpenCL to design our performance model for Nvidia GPUs. To construct the model, the values of GPU performance counters for the selected benchmarks are measured. Then, well-known statistical techniques such as regression and principal component analysis use these results to find the most significant parameters and to construct a performance model with up to 99 % accuracy. Finally, this method can be leveraged to characterize unknown applications based on their performance similarities with an existing database of benchmarks to predict their likely performance bottlenecks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Can GPU performance increase faster than the code error rate?

Article Open access 18 April 2024

Fernando Fernandes dos Santos & Paolo Rech

A survey on platforms for big data analytics

Article Open access 09 October 2014

Dilpreet Singh & Chandan K Reddy

Efficient High-Level Programming in Plain Java

Article 05 December 2022

Rui S. Silva & João L. Sobral

References

Baghsorkhi SS, Delahaye M, Patel SJ, Gropp WD, Hwu WW (2010) An adaptive performance modeling tool for GPU architectures. In: ACM SIGPLAN notices, vol 45, pp 105–114. ACM, New York
Bakhoda A, Yuan GL, Fung WWL, Wong H, Aamodt TM (2009) Analyzing CUDA workloads using a detailed GPU simulator. In: 2009 IEEE international symposium on performance analysis of systems and software, pp 163–174
Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee S-H, Skadron K (2009) Rodinia: a benchmark suite for heterogeneous computing. In: 2009 IEEE international symposium on workload characterization (IISWC), vol 2009, pp 44–54
Du Peng, Weber R, Luszczek P, Tomov S, Peterson G, Dongarra J (2012) From CUDA to OpenCL: towards a performance-portable solution for multi-platform GPU programming. Parallel Comput 38(8):391–407
Article Google Scholar
Goswami N, Shankar R, Joshi M, Li T (2010) Exploring GPGPU workloads: characterization methodology, analysis and microarchitecture evaluation implications. In: Proceedings of the IEEE international symposium on workload characterization (IISWC’10), pp 1–10, Washington, DC
Jia W, Shaw KA, Martonosi M (2012) Stargazer: automated regression-based GPU design space exploration. In: IEEE international symposium on performance analysis of systems and software ISPASS, pp 2–13
Joseph PJ, Vaswani K, Thazhuthaveetil MJ (2006) Construction and use of linear regression models for processor performance analysis. In: The 12th international symposium on high-performance computer architecture, pp 99–108
Kerr A, Anger E, Hendry G, Yalamanchili S (2012) Eiger: a framework for the automated synthesis of statistical performance models. In: 19th international conference on high performance computing, pp 1–6, Los Alamitos. IEEE Computer Society
Kerr A, Diamos G, Yalamanchili S (2009) A characterization and analysis of PTX kernels. In: 2009 IEEE international symposium on workload characterization (IISWC), pp 3–12
Kerr A, Diamos G, Yalamanchili S (2010) Modeling GPU–CPU workloads and systems. In: Proceedings of the 3rd workshop on general-purpose computation on graphics processing units (GPGPU ’10), pp 31–42, New York. ACM Press, New York
Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th international joint conference on artificial intelligence (IJCAI’95), vol 2, pp 1137–1143, San Francisco. Morgan Kaufmann Publishers Inc., Menlo Park
Lopez-Novoa U, Mendiburu A, Miguel-Alonso J (2014) A survey of performance modeling and simulation techniques for accelerator-based computing. IEEE Trans Parallel Distrib Syst 9219(c):1–1
Manly BFJ (2004) Multivariate statistical methods: a primer, 3rd edn. Chapman and Hall, London
Montgomery DC, Runger GC (2010) Applied statistics and probability for engineers, 5th edn. Wiley, New York
Munshi A (2011) The OpenCL specification
Nguyen H (2007) Gpu gems 3, 1st edn. Addison-Wesley Professional, Menlo Park
NVIDIA (2011) CUDA tools SDK CUPTI users guide
NVIDIA (2012) CUDA SDK 4.1
NVIDIA (2014) CUDA C programming guide
NVIDIA (2014) NVIDIA visual profiler
Purnomo B, Rubin N, Houston M (2010) ATI stream profiler: a tool to optimize an OpenCL kernel on ATI Radeon GPUs. In: ACM SIGGRAPH 2010 Posters (SIGGRAPH ’10), New York. ACM, New York
Stone JE, Gohara D, Shi G (2010) OpenCL: a parallel programming standard for heterogeneous computing systems. Comput Sci Eng 12(3):66–72
Article Google Scholar
Wittenbrink CM, Kilgariff E, Prabhu A (2011) Fermi GF100 GPU architecture. IEEE Micro 31:50–59
Article Google Scholar
Wold S, Esbensen K, Geladi P (1987) Principal component analysis. Chemom Intell Lab Syst 2(13):37–52
Article Google Scholar
Zhang Y, Owens JD (2011) A quantitative performance analysis model for GPU architectures. In: IEEE 17th international symposium on high performance computer architecture, pp 382–393
Zhang Y, Hu Y, Li B, Peng L (2011) Performance and power analysis of ATI GPU: a statistical approach. In: 6th IEEE international conference on networking, architecture and storage (NAS), pp 149–158
Zhang Y, Peng L, Li B, Peir J-K, Chen J (2011) Architecture comparisons between Nvidia and ATI GPUs: computation parallelism and data communications. In: 2011 IEEE international symposium on workload characterization (IISWC), pp 205–215

Download references

Acknowledgments

We would like to thank Dr. Reza Sameni for his thoughtful suggestions and comments, which helped us improve the manuscript. This research was supported in part by School of Computer Science, Institute for Research in Fundamental Sciences (IPM) under Grant Number CS1392-4-28.

Author information

Authors and Affiliations

Department of Computer Science, Engineering, and IT, School of Electrical and Computer Engineering, Shiraz University, Shiraz, Iran
Ali Karami, Farshad Khunjush & Seyyed Ali Mirsoleimani
School of Computer Science, Institute for Research in Fundamental Sciences (IPM), P.o.Box 19395-5746, Tehran, Iran
Farshad Khunjush

Authors

Ali Karami
View author publications
You can also search for this author in PubMed Google Scholar
Farshad Khunjush
View author publications
You can also search for this author in PubMed Google Scholar
Seyyed Ali Mirsoleimani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Farshad Khunjush.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Karami, A., Khunjush, F. & Mirsoleimani, S.A. A statistical performance analyzer framework for OpenCL kernels on Nvidia GPUs. J Supercomput 71, 2900–2921 (2015). https://doi.org/10.1007/s11227-014-1338-z

Download citation

Published: 13 December 2014
Issue Date: August 2015
DOI: https://doi.org/10.1007/s11227-014-1338-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A statistical performance analyzer framework for OpenCL kernels on Nvidia GPUs

Abstract

Access this article

Similar content being viewed by others

Can GPU performance increase faster than the code error rate?

A survey on platforms for big data analytics

Efficient High-Level Programming in Plain Java

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A statistical performance analyzer framework for OpenCL kernels on Nvidia GPUs

Abstract

Access this article

Similar content being viewed by others

Can GPU performance increase faster than the code error rate?

A survey on platforms for big data analytics

Efficient High-Level Programming in Plain Java

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation