Just-in-Time Compilation and Link-Time Optimization for OpenMP Target Offloading

Tian, Shilei; Huber, Joseph; Tramm, John; Chapman, Barbara; Doerfert, Johannes

doi:10.1007/978-3-031-15922-0_10

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13527))

Included in the following conference series:

International Workshop on OpenMP

504 Accesses
2 Citations

Abstract

Following the mass adoption of external accelerators for high performance computing, the overall performance of many applications has become increasingly dependent on relatively small accelerated kernels. As static analysis is fundamentally limited by dynamic values and external definitions, standard ahead-of-time compilation is not always sufficient to achieve the best performance. Furthermore, many users looking to port an existing application to run on an external accelerator will not want to fundamentally restructure their programs. These and other problems can be addressed through both link-time optimization (LTO) and just-in-time (JIT) compilation, but until now had sparse and inconsistent support from the compiler.

In this work, we present a new compilation method that enables device-side LTO as well as a transparent JIT compilation tool-chain for OpenMP target offloading. Our contributions include an entirely new device linking and embedding scheme to enable LTO as well as a novel JIT engine to efficiently optimize OpenMP offloading regions at run-time. We also introduce a persistent caching system to improve end-to-end runtime using the JIT engine and minimize kernel launching overheads. We measure the performance of our LTO and JIT implementation via several real-world scientific applications. With our optimizations we observe significant improvements through LTO on large applications as well as significant end-to-end execution time improvement using JIT.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 49.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Technically, this does not have to be limited to JIT time but LTO time is sufficient.

References

Huber, J., et al.: Efficient Execution of OpenMP on GPUs. In: IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2022, Seoul, Republic of Korea, 2–6 April 2022, pp. 41–52 (2022)
Google Scholar
Juckeland, G., et al.: SPEC ACCEL: A standard application suite for measuring hardware accelerator performance. In: High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation - 5th International Workshop, PMBS 2014, New Orleans, LA, USA, 16 November 2014. Revised Selected Papers. vol. 8966, pp. 46–67 (2014)
Google Scholar
Romano, P.K., Horelik, N.E., Herman, B.R., Nelson, A.G., Forget, B.: OpenMC: a state-of-the-art Monte Carlo code for research and development. Ann. Nucl. Energy 82, 90–97 (2015). https://doi.org/10.1016/j.anucene.2014.07.048, https://doi.org/10.1016/j.anucene.2014.07.048
Tramm, J., et al.: Toward portable GPU acceleration of the OpenMC Monte Carlo particle transport code. In: International Conference on Physics of Reactors (PHYSOR 2022). Pittsburgh, USA (2022)
Google Scholar
Tramm, J.R., Siegel, A.R., Islam, T., Schulz, M.: XSBench - the development and verification of a performance abstraction for Monte Carlo reactor analysis. In: PHYSOR (2014)
Google Scholar
Tramm, J.R., Siegel, A.R., Forget, B., Josey, C.: Performance analysis of a reduced data movement algorithm for Neutron cross Section data in Monte Carlo simulations. In: Solving Software Challenges for Exascale - International Conference on Exascale Applications and Software, EASC 2014, Stockholm, Sweden, 2–3 April 2014, Revised Selected Papers. vol. 8759, pp. 39–56 (2014)
Google Scholar
Atkinson, P., McIntosh-Smith, S.: On the performance of parallel tasking runtimes for an irregular fast multipole method application. In: de Supinski, B.R., Olivier, S.L., Terboven, C., Chapman, B.M., Müller, M.S. (eds.) IWOMP 2017. LNCS, vol. 10468, pp. 92–106. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-65578-9_7
Chapter Google Scholar
Fattebert, J.L., Wickett, M., Turchi, P.: Phase-field modeling of coring during solidification of au-ni alloy using quaternions and calphad input. Acta Materialia 62, 89–104 (2014). https://doi.org/10.1016/j.actamat.2013.09.036
Article Google Scholar
Bertolli, C., et al.: Coordinating GPU threads for OpenMP 4.0 in LLVM. In: Proceedings of the 2014 LLVM Compiler Infrastructure in HPC, LLVM 2014, New Orleans, LA, USA, 17 November 2014, pp. 12–21 (2014)
Google Scholar
Bertolli, C., et al.: Integrating GPU support for OpenMP offloading directives into clang. In: Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, LLVM 2015, Austin, Texas, USA, 15 November 2015. pp. 5:1–5:11 (2015)
Google Scholar
Özen, G., Atzeni, S., Wolfe, M., Southwell, A., Klimowicz, G.: OpenMP GPU Offload in Flang and LLVM. In: 2018 IEEE/ACM 5th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC), pp. 1–9 (2018)
Google Scholar
Aycock, J.: A brief history of just-in-time. ACM Comput. Surv. 35(2), 97–113 (2003)
Article Google Scholar
The Khronos Group Inc.: SPIR Overview (2022). https://www.khronos.org/spir/
Peng, H., Shann, J.J.: Translating OpenACC to LLVM IR with SPIR kernels. In: 15th IEEE/ACIS International Conference on Computer and Information Science, ICIS 2016, Okayama, Japan, 26–29 June 2016. pp. 1–6 (2016)
Google Scholar
Ha, O., Kuh, I., Tchamgoue, G.M., Jun, Y.: On-the-fly detection of data races in OpenMP programs. In: Proceedings of the 10th Workshop on Parallel and Distributed Systems: Testing, Analysis, and Debugging, PADTAD 2012, Minneapolis, MN, USA, 16 July 2012. pp. 1–10 (2012)
Google Scholar
Luk, C., Pin: building customized program analysis tools with dynamic instrumentation. In: Proceedings of the ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation, Chicago, IL, USA, 12–15 June 2005. pp. 190–200 (2005)
Google Scholar
Gaikwad, S., Nisbet, A., Luján, M.: Hosting OpenMP programs on Java virtual machines. In: Proceedings of the 16th ACM SIGPLAN International Conference on Managed Programming Languages and Runtimes, MPLR 2019, Athens, Greece, 21–22 October 2019. pp. 63–71 (2019)
Google Scholar
Glek, T., Hubicka, J.: Optimizing real world applications with GCC link time optimization. arXiv preprint arXiv:1010.2196 (2010)
Murphy, M., Sundaram, A.: Improving GPU application performance with NVIDIA CUDA 11.2 device link time optimization, February 2021. https://developer.nvidia.com/blog/improving-gpu-app-performance-with-cuda-11-2-device-lto/
Antão, S.F., et al.: Offloading support for OpenMP in clang and LLVM. In: Third Workshop on the LLVM Compiler Infrastructure in HPC, LLVM-HPC@SC 2016, Salt Lake City, UT, USA, 14 November 2016. pp. 1–11 (2016)
Google Scholar
Doerfert, J., Diaz, J.M.M., Finkel, H.: The TRegion interface and compiler optimizations for OpenMP target regions. In: Fan, X., de Supinski, B.R., Sinnen, O., Giacaman, N. (eds.) IWOMP 2019. LNCS, vol. 11718, pp. 153–167. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-28596-8_11
Chapter Google Scholar
Tiotto, E., Mahjour, B., Tsang, W., Xue, X., Islam, T., Chen, W.: OpenMP 4.5 Compiler optimization for GPU offloading. IBM J. Res. Dev. 64(3/4), 14:1–14:11 (2020)
Google Scholar
Doerfert, J., Patel, A., Huber, J., Tian, S., Diaz, J.M.M., Chapman, B., Georgakoudis, G.: Co-Designing an OpenMP GPU runtime and optimizations for near-zero overhead execution. In: 36th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2022, St. Petersburg, FL USA, 15–19 May 2023. IEEE (2022)
Google Scholar
Doerfert, J., et al.: Breaking the vendor lock – performance portable programming through OpenMP as target independent runtime layer. In: International Conference on Parallel Architectures and Compilation Techniques, PACT (2022, to appear)
Google Scholar
Moses, W.S., Ivanov, I.R., Domke, J., Endo, T., Doerfert, J., Zinenko, O.: High-performance GPU-to-CPU transpilation and optimization via high-level parallel constructs (2022). https://doi.org/10.48550/ARXIV.2207.00257, https://arxiv.org/abs/2207.00257

Download references

Acknowledgement

This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of two U.S. Department of Energy organizations (Office of Science and the National Nuclear Security Administration) responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering, and early testbed platforms, in support of the nation’s exascale computing imperative. We also gratefully acknowledge the computing resources provided and operated by the Joint Laboratory for System Evaluation (JLSE) at Argonne National Laboratory.

This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

This manuscript has been authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The publisher acknowledges the US government license to provide public access under the DOE Public Access Plan (https://energy.gov/downloads/doe-public-access-plan).

Author information

Authors and Affiliations

Stony Brook University, Stony Brook, USA
Shilei Tian & Barbara Chapman
Oak Ridge National Laboratory, Oak Ridge, USA
Joseph Huber
Argonne National Laboratory, Lemont, USA
John Tramm & Johannes Doerfert

Authors

Shilei Tian
View author publications
You can also search for this author in PubMed Google Scholar
Joseph Huber
View author publications
You can also search for this author in PubMed Google Scholar
John Tramm
View author publications
You can also search for this author in PubMed Google Scholar
Barbara Chapman
View author publications
You can also search for this author in PubMed Google Scholar
Johannes Doerfert
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Shilei Tian or Johannes Doerfert .

Editor information

Editors and Affiliations

OpenMP ARB, Beaverton, OR, USA
Michael Klemm
Lawrence Livermore National Laboratory, Livermore, CA, USA
Bronis R. de Supinski
RWTH Aachen University, Aachen, Germany
Jannis Klinkenberg
University of Arizona, Tucson, AZ, USA
Brandon Neth

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tian, S., Huber, J., Tramm, J., Chapman, B., Doerfert, J. (2022). Just-in-Time Compilation and Link-Time Optimization for OpenMP Target Offloading. In: Klemm, M., de Supinski, B.R., Klinkenberg, J., Neth, B. (eds) OpenMP in a Modern World: From Multi-device Support to Meta Programming. IWOMP 2022. Lecture Notes in Computer Science, vol 13527. Springer, Cham. https://doi.org/10.1007/978-3-031-15922-0_10

Download citation

DOI: https://doi.org/10.1007/978-3-031-15922-0_10
Published: 20 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-15921-3
Online ISBN: 978-3-031-15922-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Just-in-Time Compilation and Link-Time Optimization for OpenMP Target Offloading