Abstract
OpenMP has supported target offloading since version 4.0, and LLVM/Clang supports its compilation and optimization. There have been several optimizing transformations in LLVM aiming to improve the performance of the offloaded region, especially for targeting GPUs. Although using the memory efficiently is essential for high performance on a GPU, there has not been much work done to automatically optimize memory transactions inside the target region at compile time.
In this work, we develop an inter-procedural LLVM transformation to improve the performance of OpenMP target regions by optimizing memory transactions. This transformation pass effectively prefetches some of the read-only input data to the fast shared memory via compile time code injection. Especially if there is reuse, accesses to shared memory far outpace global memory accesses. Consequently, our method can significantly improve performance if the right data is placed in shared memory.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
It is better to use the default option for cases where some (but not all) of the team’s chunk iterations read the same locations. The reason is that avoiding prefetching redundant data in these cases complicates the copy_to_shared_mem function in different ways (e.g., adds conditional branches to it) that degrades the performance.
References
CUDA programming guide. https://docs.nvidia.com/cuda/cuda-c-programming-guide
LLVM version 11. https://releases.llvm.org/download.html#11.0.0
OpenMP application programming interface version 4.0. https://www.openmp.org/wp-content/uploads/OpenMP4.0.0.pdf
OpenMP application programming interface version 5.0. https://www.openmp.org/spec-html/5.0/openmp.html
SCEV Class Reference. https://llvm.org/doxygen/classllvm_1_1SCEV.html
SCEVAddRecExpr Class Reference. https://llvm.org/doxygen/classllvm_1_1SCEVAddRecExpr.html
SCEVExpander Class Reference. https://llvm.org/doxygen/classllvm_1_1SCEVExpander.html
Using Shared Memory in CUDA C/C++. https://developer.nvidia.com/blog/using-shared-memory-cuda-cc/
Value Class Reference. https://llvm.org/doxygen/classllvm_1_1Value.html
Antao, S.F., et al.: Offloading support for OpenMP in Clang and LLVM. In: 2016 Third Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC), pp. 1–11. IEEE (2016)
Bataev, A., Bokhanko, A., Cownie, J.: Towards OpenMP support in LLVM. In: 2013 European LLVM Conference (2013)
Bertolli, C., et al.: Integrating GPU support for OpenMP offloading directives into Clang. In: Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, pp. 1–11 (2015)
Hayashi, A., Shirako, J., Tiotto, E., Ho, R., Sarkar, V.: Performance evaluation of OpenMP’s target construct on GPUs-exploring compiler optimisations. Int. J. High Perform. Comput. Networking 13(1), 54–69 (2019)
Huber, J., et al.: Efficient execution of OpenMP on GPUs. In: 2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 41–52. IEEE (2022)
Lattner, C., Adve, V.: LLVM: a compilation framework for lifelong program analysis and transformation. In: International Symposium on Code Generation and Optimization, CGO 2004, pp. 75–86. IEEE (2004)
Tian, S., Chesterfield, J., Doerfert, J., Chapman, B.: Experience report: writing a portable GPU runtime with OpenMP 5.1. In: McIntosh-Smith, S., de Supinski, B.R., Klinkenberg, J. (eds.) IWOMP 2021. LNCS, vol. 12870, pp. 159–169. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-85262-7_11
Tramm, J.R., Siegel, A.R., Islam, T., Schulz, M.: XSBench - the development and verification of a performance abstraction for Monte Carlo reactor analysis. The Role of Reactor Physics toward a Sustainable Future (PHYSOR) (2014)
Acknowledgements
First and second authors thank NSERC of Canada (Grant RGPIN-2018-06534) for their support. Also, part of this research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of two U.S. Department of Energy organizations (Office of Science and the National Nuclear Security Administration) responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering, and early testbed platforms, in support of the nation’s exascale computing imperative. Part of this research was supported by the Lawrence Livermore National Security, LLC (“LLNS”) via MPO No. B642066.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Talaashrafi, D., Maza, M.M., Doerfert, J. (2022). Towards Automatic OpenMP-Aware Utilization of Fast GPU Memory. In: Klemm, M., de Supinski, B.R., Klinkenberg, J., Neth, B. (eds) OpenMP in a Modern World: From Multi-device Support to Meta Programming. IWOMP 2022. Lecture Notes in Computer Science, vol 13527. Springer, Cham. https://doi.org/10.1007/978-3-031-15922-0_5
Download citation
DOI: https://doi.org/10.1007/978-3-031-15922-0_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-15921-3
Online ISBN: 978-3-031-15922-0
eBook Packages: Computer ScienceComputer Science (R0)