Abstract
In order to improve the flexibility of thread configuration during GPU parallel code automatic generation, an automatic selection strategy of thread block size based on the occupancy of multiprocessors is proposed, and a analysis method based on the polyhedral model is employed to analysis array accesses of GPU kernels. This strategy evaluates the occupancy of device kernels under different thread block sizes, and selects the thread block size with the highest occupancy. For kernels with complex array access or synchronization statements, it automatically selects smaller thread block size to increase schedule spaces of thread warps. It is implemented in the Polly-Acc module of the LLVM compilation framework, and tested on a Tesla architecture GPU using the PolyBench test set. Compared with using fixed block sizes of 32 × 8 and 32 × 16, this strategy makes the generated code achieved an average performance improvement of 9.7% and 15.5% respectively. At the same time, compared with using the thread block selection suggestions in the CUDA API, an average performance improvement of 21% is obtained.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Chen, T., Moreau, T., et al.: TVM: an automated end-to-end optimizing compiler for deep learning. In: OSDI'18: Proceedings of the 13th USENIX conference on Operating Systems Design and Implementation, pp. 579–594. ACM Press, New York (2018)
Vasilache, N., Zinenko, O., Theodoridis, T., et al.: Tensor comprehensions: framework-agnostic high-performance machine learning abstractions (2018)
Baghdadi, R., Ray, J., Romdhane, M.B., et al.: Tiramisu: a polyhedral compiler for expressing fast and portable code (2018)
Baskaran, M.M., Ramanujam, J., Sadayappan, P.: Automatic C-to-CUDA code generation for affine programs. In: Gupta, R. (ed.) Compiler Construction. Lecture Notes in Computer Science, vol. 6011, pp. 244–263. Springer, Berlin (2010). https://doi.org/10.1007/978-3-642-11970-5_14
Rudy, G.: CUDA-CHiLL: a programming language interface for GPGPU optimizations and code generation. Dissertations & Theses – Gradworks (2010)
Verdoolaege, S., Juega, J., Cohen, A., Gomez, J.I., Tenllado, C., Catthoor, F.: Polyhedral parallel code generation for CUDA. ACM Trans. Archit. Code Optim. (TACO) 9 (2013). Article no. 54
Zhao, J., Li, Y.Y., Zhao, R.C.: “Black magic” of polyhedral compilation. J. Softw. 29(8), 2371–2396 (2018). (in Chinese)
Bondhugula, U., Baskaran, M., Krishnamoorthy, S., Ramanujam, J., Rountev, A., Sadayappan, P.: Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model. In: Hendren, L. (ed.) CC 2008. LNCS, vol. 4959, pp. 132–146. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-78791-4_9
Leung, A., Vasilache, N., Meister, B., et al.: A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction, p. 51 (2010)
Pouchet, L.N., Grlinger, A., Simbürger, A., et al.: Polly-polyhedral optimization in LLVM. In: International Workshop on Polyhedral Compilation Techniques (IMPACT) (2011)
Grosser, T., Hoefler, T.: Polly-ACC transparent compilation to heterogeneous hardware. In: International Conference on Supercomputing. ACM (2016)
Amini, M., Coelho, F., Irigoin, F., Keryell, R.: Static compilation analysis for host-accelerator communication optimization. In: Rajopadhye, S., Mills Strout, M. (eds.) LCPC 2011. LNCS, vol. 7146, pp. 237–251. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-36036-7_16
Shobaki, G., Kerbow, A., Mekhanoshin, S.: Optimizing occupancy and ILP on the GPU using a combinatorial approach. In: Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization (CGO 2020), pp. 133–144. Association for Computing Machinery, New York (2020)
Nickolls, J.: Scalable parallel programming with CUDA introduction. In: 2008 IEEE Hot Chips 20 Symposium (HCS), Stanford, CA, pp. 1–9 (2008). https://doi.org/10.1109/HOTCHIPS.2008.7476518
Feautrier, P.: Some efficient solutions to the affine scheduling problem. Part II .Multidimensional time. Int. J. Parallel Prog. 21(6), 389–420 (1997)
Grosser, T., Verdoolaege, S., Cohen, A.: Polyhedral AST generation is more than scanning polyhedra. ACM Trans. Program. Lang. Syst. 37(4), 12 (2015)
Hayes, A., Li, L., Chavarría-Miranda, D., Song, S., Zhang, E.: Orion: A Framework for GPU Occupancy Tuning, pp. 1–13 (2016). https://doi.org/10.1145/2988336.2988355
Fauzia, N., Pouchet, L.-N., Sadayappan, P.: Characterizing and enhancing global memory data coalescing on GPUs. In: Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO 2015), pp. 12–22. IEEE Computer Society, USA (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Hu, W., Han, L., Han, P., Shang, J. (2021). Automatic Thread Block Size Selection Strategy in GPU Parallel Code Generation. In: Ning, L., Chau, V., Lau, F. (eds) Parallel Architectures, Algorithms and Programming. PAAP 2020. Communications in Computer and Information Science, vol 1362. Springer, Singapore. https://doi.org/10.1007/978-981-16-0010-4_34
Download citation
DOI: https://doi.org/10.1007/978-981-16-0010-4_34
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-0009-8
Online ISBN: 978-981-16-0010-4
eBook Packages: Computer ScienceComputer Science (R0)