Automatic Thread Block Size Selection Strategy in GPU Parallel Code Generation

Hu, Weifang; Han, Lin; Han, Pu; Shang, Jiandong

doi:10.1007/978-981-16-0010-4_34

Weifang Hu⁸,
Lin Han⁸,
Pu Han⁸ &
…
Jiandong Shang⁸

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1362))

Included in the following conference series:

International Symposium on Parallel Architectures, Algorithms and Programming

793 Accesses

Abstract

In order to improve the flexibility of thread configuration during GPU parallel code automatic generation, an automatic selection strategy of thread block size based on the occupancy of multiprocessors is proposed, and a analysis method based on the polyhedral model is employed to analysis array accesses of GPU kernels. This strategy evaluates the occupancy of device kernels under different thread block sizes, and selects the thread block size with the highest occupancy. For kernels with complex array access or synchronization statements, it automatically selects smaller thread block size to increase schedule spaces of thread warps. It is implemented in the Polly-Acc module of the LLVM compilation framework, and tested on a Tesla architecture GPU using the PolyBench test set. Compared with using fixed block sizes of 32 × 8 and 32 × 16, this strategy makes the generated code achieved an average performance improvement of 9.7% and 15.5% respectively. At the same time, compared with using the thread block selection suggestions in the CUDA API, an average performance improvement of 21% is obtained.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Chen, T., Moreau, T., et al.: TVM: an automated end-to-end optimizing compiler for deep learning. In: OSDI'18: Proceedings of the 13th USENIX conference on Operating Systems Design and Implementation, pp. 579–594. ACM Press, New York (2018)
Google Scholar
Vasilache, N., Zinenko, O., Theodoridis, T., et al.: Tensor comprehensions: framework-agnostic high-performance machine learning abstractions (2018)
Google Scholar
Baghdadi, R., Ray, J., Romdhane, M.B., et al.: Tiramisu: a polyhedral compiler for expressing fast and portable code (2018)
Google Scholar
Baskaran, M.M., Ramanujam, J., Sadayappan, P.: Automatic C-to-CUDA code generation for affine programs. In: Gupta, R. (ed.) Compiler Construction. Lecture Notes in Computer Science, vol. 6011, pp. 244–263. Springer, Berlin (2010). https://doi.org/10.1007/978-3-642-11970-5_14
Chapter Google Scholar
Rudy, G.: CUDA-CHiLL: a programming language interface for GPGPU optimizations and code generation. Dissertations & Theses – Gradworks (2010)
Google Scholar
Verdoolaege, S., Juega, J., Cohen, A., Gomez, J.I., Tenllado, C., Catthoor, F.: Polyhedral parallel code generation for CUDA. ACM Trans. Archit. Code Optim. (TACO) 9 (2013). Article no. 54
Google Scholar
Zhao, J., Li, Y.Y., Zhao, R.C.: “Black magic” of polyhedral compilation. J. Softw. 29(8), 2371–2396 (2018). (in Chinese)
MathSciNet Google Scholar
Bondhugula, U., Baskaran, M., Krishnamoorthy, S., Ramanujam, J., Rountev, A., Sadayappan, P.: Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model. In: Hendren, L. (ed.) CC 2008. LNCS, vol. 4959, pp. 132–146. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-78791-4_9
Chapter Google Scholar
Leung, A., Vasilache, N., Meister, B., et al.: A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction, p. 51 (2010)
Google Scholar
Pouchet, L.N., Grlinger, A., Simbürger, A., et al.: Polly-polyhedral optimization in LLVM. In: International Workshop on Polyhedral Compilation Techniques (IMPACT) (2011)
Google Scholar
Grosser, T., Hoefler, T.: Polly-ACC transparent compilation to heterogeneous hardware. In: International Conference on Supercomputing. ACM (2016)
Google Scholar
Amini, M., Coelho, F., Irigoin, F., Keryell, R.: Static compilation analysis for host-accelerator communication optimization. In: Rajopadhye, S., Mills Strout, M. (eds.) LCPC 2011. LNCS, vol. 7146, pp. 237–251. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-36036-7_16
Chapter Google Scholar
Shobaki, G., Kerbow, A., Mekhanoshin, S.: Optimizing occupancy and ILP on the GPU using a combinatorial approach. In: Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization (CGO 2020), pp. 133–144. Association for Computing Machinery, New York (2020)
Google Scholar
Nickolls, J.: Scalable parallel programming with CUDA introduction. In: 2008 IEEE Hot Chips 20 Symposium (HCS), Stanford, CA, pp. 1–9 (2008). https://doi.org/10.1109/HOTCHIPS.2008.7476518
Feautrier, P.: Some efficient solutions to the affine scheduling problem. Part II .Multidimensional time. Int. J. Parallel Prog. 21(6), 389–420 (1997)
Article MathSciNet Google Scholar
Grosser, T., Verdoolaege, S., Cohen, A.: Polyhedral AST generation is more than scanning polyhedra. ACM Trans. Program. Lang. Syst. 37(4), 12 (2015)
Article Google Scholar
Hayes, A., Li, L., Chavarría-Miranda, D., Song, S., Zhang, E.: Orion: A Framework for GPU Occupancy Tuning, pp. 1–13 (2016). https://doi.org/10.1145/2988336.2988355
Fauzia, N., Pouchet, L.-N., Sadayappan, P.: Characterizing and enhancing global memory data coalescing on GPUs. In: Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO 2015), pp. 12–22. IEEE Computer Society, USA (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

Supercomputing Center of Henan, Zhengzhou University, Zhengzhou, China
Weifang Hu, Lin Han, Pu Han & Jiandong Shang

Authors

Weifang Hu
View author publications
You can also search for this author in PubMed Google Scholar
Lin Han
View author publications
You can also search for this author in PubMed Google Scholar
Pu Han
View author publications
You can also search for this author in PubMed Google Scholar
Jiandong Shang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pu Han .

Editor information

Editors and Affiliations

Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
Li Ning
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
Vincent Chau
The University of Hong Kong, Hong Kong, China
Francis Lau

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hu, W., Han, L., Han, P., Shang, J. (2021). Automatic Thread Block Size Selection Strategy in GPU Parallel Code Generation. In: Ning, L., Chau, V., Lau, F. (eds) Parallel Architectures, Algorithms and Programming. PAAP 2020. Communications in Computer and Information Science, vol 1362. Springer, Singapore. https://doi.org/10.1007/978-981-16-0010-4_34

Download citation

DOI: https://doi.org/10.1007/978-981-16-0010-4_34
Published: 07 February 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-0009-8
Online ISBN: 978-981-16-0010-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics