Automatic Tuning of CUDA Execution Parameters for Stencil Processing

Sato, Katsuto; Takizawa, Hiroyuki; Komatsu, Kazuhiko; Kobayashi, Hiroaki

doi:10.1007/978-1-4419-6935-4_13

Katsuto Sato,
Hiroyuki Takizawa⁵,
Kazuhiko Komatsu &
…
Hiroaki Kobayashi

681 Accesses
4 Citations

Abstract

Recently, Compute Unified Device Architecture (CUDA) has enabled Graphics Processing Units (GPUs) to accelerate various applications. However, to exploit the GPU’s computing power fully, a programmer has to carefully adjust some CUDA execution parameters even for simple stencil processing kernels. Hence, this paper develops an automatic parameter tuning mechanism based on profiling to predict the optimal execution parameters. This paper first discusses the scope of the parameter exploration space determined by GPU’s architectural restrictions. To find the optimal execution parameters, performance models are created by profiling execution times of kernel using each promising parameter configuration. The execution parameters are determined by using those performance models. This paper evaluates the performance improvement due to the proposed mechanism using two benchmark programs. From the evaluation results, it is clarified that the proposed mechanism can appropriately select a suboptimal Cooperative Thread Array (CTA) configuration whose performance is comparable to the optimal one.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The current SPRAT compiler does not generate a code that dynamically allocates the shared memory, and therefore the dynamically-allocated shared memory size is not considered here.

References

GPGPU.org : GPGPU General-Purpose Computation on Graphics Hardware http://gpgpu.org
NVIDIA Corporation : CUDA ZONE http://www.nvidia.com/object/cuda_home.html
NVIDIA Corporation (2008) NVIDIA CUDA Compute Unified Device Architecture programming guide version 2.0
Google Scholar
AMD Corporation (2009) ATI STREAM ATI stream computing user guide version 1.4 beta
Google Scholar
Papakipos M (2006) SC06 GPGPU Course: PeakStream Platform. In: the ACM/IEEE SC06 tutorial
Google Scholar
McCool MD et al (2006) Performance Evaluation of GPUs Using the RapidMind Development Platform. In: poster reception at the ACM/IEEE SC06
Google Scholar
Ueng SZ, Lathara M, Baghsorkhi SS, Hwu WMW (2008) CUDA-Lite: Reducing GPU Programming Complexity. In: Languages and Compilers for Parallel Computing: 21th International Workshop, LCPC 2008, Edmonton, Canada, July 31–Aug 2, 2008, Revised Selected Papers, Springer, Berlin, pp 1–15
Google Scholar
Ryoo S, Rodrigues CI, Baghsorkhi SS, Stone SS, Kirk DB, Hwu WMW (2008) Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In: PPoPP ’08: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, ACM, New York, pp 73–82
Google Scholar
Buck I et al (2004) Brook for GPUs: Stream Computing on Graphics Hardware. ACM Trans Graph 23(3):777–786
Article Google Scholar
Han TD, Abdelrahman TS (2009) hiCUDA: a high-level directive-based language for GPU programming. In: GPGPU-2: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, ACM, New York, 52–61
Google Scholar
Takizawa H, Sato K, Kobayashi H (2008) SPRAT: Runtime processor selection for energy-aware computing. 2008 IEEE International Conference on Cluster Computing (29 2008–Oct. 1 2008) pp 386–393
Google Scholar
Flynn MJ (1972) Some computer organizations and their effectiveness. Comput IEEE Trans C-21(9):948–960
Google Scholar
Lindholm E, Nickolls J, Oberman S, Montrym J (2008) NVIDIA tesla: a unified graphics and computing architecture. IEEE Micro 28:39–55
Article Google Scholar
Kongetira P, Aingaran K, Olukotun K (2005) Niagara: a 32-way multithreaded Sparc processor. Micro IEEE 25(2):21–29
Article Google Scholar
Cormen TH, Leiserson CE, Rivest LR, Stein C (2001) In: Introduction to algorithms, 2 edn. MIT, Cambridge, Massachusetts 02142, 762–766
Google Scholar
Khronos OpenCL Working Group : The OpenCL Specification version 1.0 http://www.khronos.org/opencl/.

Download references

Acknowledgement

The authors would like to acknowledge support from the Tohoku University Global COE Program on World Center of Education and Research for Trans-disciplinary Flow Dynamics. This work was partially supported by Grants-in-Aid for Young Scientists(B) #21700049 and Scientific Research (B) #21300007, by NAKAYAMA HAYAO Foundation for Science & Technology and Culture, and by JST, CREST.

Author information

Authors and Affiliations

Graduate School of Information Sciences, Tohoku University, 4F, 6-3, Aramaki-aza-aoba, Aoba-ku, Sendai, 980-8578, Japan
Hiroyuki Takizawa

Authors

Katsuto Sato
View author publications
You can also search for this author in PubMed Google Scholar
Hiroyuki Takizawa
View author publications
You can also search for this author in PubMed Google Scholar
Kazuhiko Komatsu
View author publications
You can also search for this author in PubMed Google Scholar
Hiroaki Kobayashi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hiroyuki Takizawa .

Editor information

Editors and Affiliations

Central Research Laboratory, Hitachi Ltd., Higashi-Koigakubo 1-280, Kokubunji-shi, Tokyo, 185-8601, Japan
Ken Naono
Cray, Inc., Jackson St. 380, St Paul, 55101, Minnesota, USA
Keita Teranishi
Dept. Computer & Information Sciences, University of Delaware, Smith Hall 101, Newark, 19716, Delaware, USA
John Cavazos
Dept. Computer Science, University of Tokyo, Hongo 7-3-1, Tokyo, 113-0033, Japan
Reiji Suda

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Sato, K., Takizawa, H., Komatsu, K., Kobayashi, H. (2011). Automatic Tuning of CUDA Execution Parameters for Stencil Processing. In: Naono, K., Teranishi, K., Cavazos, J., Suda, R. (eds) Software Automatic Tuning. Springer, New York, NY. https://doi.org/10.1007/978-1-4419-6935-4_13

Download citation

DOI: https://doi.org/10.1007/978-1-4419-6935-4_13
Published: 13 August 2010
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4419-6934-7
Online ISBN: 978-1-4419-6935-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics