Skip to main content

Automatic Thread Block Size Selection Strategy in GPU Parallel Code Generation

  • Conference paper
  • First Online:
Parallel Architectures, Algorithms and Programming (PAAP 2020)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1362))

  • 793 Accesses

Abstract

In order to improve the flexibility of thread configuration during GPU parallel code automatic generation, an automatic selection strategy of thread block size based on the occupancy of multiprocessors is proposed, and a analysis method based on the polyhedral model is employed to analysis array accesses of GPU kernels. This strategy evaluates the occupancy of device kernels under different thread block sizes, and selects the thread block size with the highest occupancy. For kernels with complex array access or synchronization statements, it automatically selects smaller thread block size to increase schedule spaces of thread warps. It is implemented in the Polly-Acc module of the LLVM compilation framework, and tested on a Tesla architecture GPU using the PolyBench test set. Compared with using fixed block sizes of 32 × 8 and 32 × 16, this strategy makes the generated code achieved an average performance improvement of 9.7% and 15.5% respectively. At the same time, compared with using the thread block selection suggestions in the CUDA API, an average performance improvement of 21% is obtained.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Chen, T., Moreau, T., et al.: TVM: an automated end-to-end optimizing compiler for deep learning. In: OSDI'18: Proceedings of the 13th USENIX conference on Operating Systems Design and Implementation, pp. 579–594. ACM Press, New York (2018)

    Google Scholar 

  2. Vasilache, N., Zinenko, O., Theodoridis, T., et al.: Tensor comprehensions: framework-agnostic high-performance machine learning abstractions (2018)

    Google Scholar 

  3. Baghdadi, R., Ray, J., Romdhane, M.B., et al.: Tiramisu: a polyhedral compiler for expressing fast and portable code (2018)

    Google Scholar 

  4. Baskaran, M.M., Ramanujam, J., Sadayappan, P.: Automatic C-to-CUDA code generation for affine programs. In: Gupta, R. (ed.) Compiler Construction. Lecture Notes in Computer Science, vol. 6011, pp. 244–263. Springer, Berlin (2010). https://doi.org/10.1007/978-3-642-11970-5_14

    Chapter  Google Scholar 

  5. Rudy, G.: CUDA-CHiLL: a programming language interface for GPGPU optimizations and code generation. Dissertations & Theses – Gradworks (2010)

    Google Scholar 

  6. Verdoolaege, S., Juega, J., Cohen, A., Gomez, J.I., Tenllado, C., Catthoor, F.: Polyhedral parallel code generation for CUDA. ACM Trans. Archit. Code Optim. (TACO) 9 (2013). Article no. 54

    Google Scholar 

  7. Zhao, J., Li, Y.Y., Zhao, R.C.: “Black magic” of polyhedral compilation. J. Softw. 29(8), 2371–2396 (2018). (in Chinese)

    MathSciNet  Google Scholar 

  8. Bondhugula, U., Baskaran, M., Krishnamoorthy, S., Ramanujam, J., Rountev, A., Sadayappan, P.: Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model. In: Hendren, L. (ed.) CC 2008. LNCS, vol. 4959, pp. 132–146. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-78791-4_9

    Chapter  Google Scholar 

  9. Leung, A., Vasilache, N., Meister, B., et al.: A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction, p. 51 (2010)

    Google Scholar 

  10. Pouchet, L.N., Grlinger, A., Simbürger, A., et al.: Polly-polyhedral optimization in LLVM. In: International Workshop on Polyhedral Compilation Techniques (IMPACT) (2011)

    Google Scholar 

  11. Grosser, T., Hoefler, T.: Polly-ACC transparent compilation to heterogeneous hardware. In: International Conference on Supercomputing. ACM (2016)

    Google Scholar 

  12. Amini, M., Coelho, F., Irigoin, F., Keryell, R.: Static compilation analysis for host-accelerator communication optimization. In: Rajopadhye, S., Mills Strout, M. (eds.) LCPC 2011. LNCS, vol. 7146, pp. 237–251. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-36036-7_16

    Chapter  Google Scholar 

  13. Shobaki, G., Kerbow, A., Mekhanoshin, S.: Optimizing occupancy and ILP on the GPU using a combinatorial approach. In: Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization (CGO 2020), pp. 133–144. Association for Computing Machinery, New York (2020)

    Google Scholar 

  14. Nickolls, J.: Scalable parallel programming with CUDA introduction. In: 2008 IEEE Hot Chips 20 Symposium (HCS), Stanford, CA, pp. 1–9 (2008). https://doi.org/10.1109/HOTCHIPS.2008.7476518

  15. Feautrier, P.: Some efficient solutions to the affine scheduling problem. Part II .Multidimensional time. Int. J. Parallel Prog. 21(6), 389–420 (1997)

    Article  MathSciNet  Google Scholar 

  16. Grosser, T., Verdoolaege, S., Cohen, A.: Polyhedral AST generation is more than scanning polyhedra. ACM Trans. Program. Lang. Syst. 37(4), 12 (2015)

    Article  Google Scholar 

  17. Hayes, A., Li, L., Chavarría-Miranda, D., Song, S., Zhang, E.: Orion: A Framework for GPU Occupancy Tuning, pp. 1–13 (2016). https://doi.org/10.1145/2988336.2988355

  18. Fauzia, N., Pouchet, L.-N., Sadayappan, P.: Characterizing and enhancing global memory data coalescing on GPUs. In: Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO 2015), pp. 12–22. IEEE Computer Society, USA (2015)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pu Han .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hu, W., Han, L., Han, P., Shang, J. (2021). Automatic Thread Block Size Selection Strategy in GPU Parallel Code Generation. In: Ning, L., Chau, V., Lau, F. (eds) Parallel Architectures, Algorithms and Programming. PAAP 2020. Communications in Computer and Information Science, vol 1362. Springer, Singapore. https://doi.org/10.1007/978-981-16-0010-4_34

Download citation

  • DOI: https://doi.org/10.1007/978-981-16-0010-4_34

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-16-0009-8

  • Online ISBN: 978-981-16-0010-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics