Skip to main content

Just-in-Time Compilation and Link-Time Optimization for OpenMP Target Offloading

  • Conference paper
  • First Online:
OpenMP in a Modern World: From Multi-device Support to Meta Programming (IWOMP 2022)

Abstract

Following the mass adoption of external accelerators for high performance computing, the overall performance of many applications has become increasingly dependent on relatively small accelerated kernels. As static analysis is fundamentally limited by dynamic values and external definitions, standard ahead-of-time compilation is not always sufficient to achieve the best performance. Furthermore, many users looking to port an existing application to run on an external accelerator will not want to fundamentally restructure their programs. These and other problems can be addressed through both link-time optimization (LTO) and just-in-time (JIT) compilation, but until now had sparse and inconsistent support from the compiler.

In this work, we present a new compilation method that enables device-side LTO as well as a transparent JIT compilation tool-chain for OpenMP target offloading. Our contributions include an entirely new device linking and embedding scheme to enable LTO as well as a novel JIT engine to efficiently optimize OpenMP offloading regions at run-time. We also introduce a persistent caching system to improve end-to-end runtime using the JIT engine and minimize kernel launching overheads. We measure the performance of our LTO and JIT implementation via several real-world scientific applications. With our optimizations we observe significant improvements through LTO on large applications as well as significant end-to-end execution time improvement using JIT.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 49.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Technically, this does not have to be limited to JIT time but LTO time is sufficient.

References

  1. Huber, J., et al.: Efficient Execution of OpenMP on GPUs. In: IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2022, Seoul, Republic of Korea, 2–6 April 2022, pp. 41–52 (2022)

    Google Scholar 

  2. Juckeland, G., et al.: SPEC ACCEL: A standard application suite for measuring hardware accelerator performance. In: High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation - 5th International Workshop, PMBS 2014, New Orleans, LA, USA, 16 November 2014. Revised Selected Papers. vol. 8966, pp. 46–67 (2014)

    Google Scholar 

  3. Romano, P.K., Horelik, N.E., Herman, B.R., Nelson, A.G., Forget, B.: OpenMC: a state-of-the-art Monte Carlo code for research and development. Ann. Nucl. Energy 82, 90–97 (2015). https://doi.org/10.1016/j.anucene.2014.07.048, https://doi.org/10.1016/j.anucene.2014.07.048

  4. Tramm, J., et al.: Toward portable GPU acceleration of the OpenMC Monte Carlo particle transport code. In: International Conference on Physics of Reactors (PHYSOR 2022). Pittsburgh, USA (2022)

    Google Scholar 

  5. Tramm, J.R., Siegel, A.R., Islam, T., Schulz, M.: XSBench - the development and verification of a performance abstraction for Monte Carlo reactor analysis. In: PHYSOR (2014)

    Google Scholar 

  6. Tramm, J.R., Siegel, A.R., Forget, B., Josey, C.: Performance analysis of a reduced data movement algorithm for Neutron cross Section data in Monte Carlo simulations. In: Solving Software Challenges for Exascale - International Conference on Exascale Applications and Software, EASC 2014, Stockholm, Sweden, 2–3 April 2014, Revised Selected Papers. vol. 8759, pp. 39–56 (2014)

    Google Scholar 

  7. Atkinson, P., McIntosh-Smith, S.: On the performance of parallel tasking runtimes for an irregular fast multipole method application. In: de Supinski, B.R., Olivier, S.L., Terboven, C., Chapman, B.M., Müller, M.S. (eds.) IWOMP 2017. LNCS, vol. 10468, pp. 92–106. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-65578-9_7

    Chapter  Google Scholar 

  8. Fattebert, J.L., Wickett, M., Turchi, P.: Phase-field modeling of coring during solidification of au-ni alloy using quaternions and calphad input. Acta Materialia 62, 89–104 (2014). https://doi.org/10.1016/j.actamat.2013.09.036

    Article  Google Scholar 

  9. Bertolli, C., et al.: Coordinating GPU threads for OpenMP 4.0 in LLVM. In: Proceedings of the 2014 LLVM Compiler Infrastructure in HPC, LLVM 2014, New Orleans, LA, USA, 17 November 2014, pp. 12–21 (2014)

    Google Scholar 

  10. Bertolli, C., et al.: Integrating GPU support for OpenMP offloading directives into clang. In: Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, LLVM 2015, Austin, Texas, USA, 15 November 2015. pp. 5:1–5:11 (2015)

    Google Scholar 

  11. Özen, G., Atzeni, S., Wolfe, M., Southwell, A., Klimowicz, G.: OpenMP GPU Offload in Flang and LLVM. In: 2018 IEEE/ACM 5th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC), pp. 1–9 (2018)

    Google Scholar 

  12. Aycock, J.: A brief history of just-in-time. ACM Comput. Surv. 35(2), 97–113 (2003)

    Article  Google Scholar 

  13. The Khronos Group Inc.: SPIR Overview (2022). https://www.khronos.org/spir/

  14. Peng, H., Shann, J.J.: Translating OpenACC to LLVM IR with SPIR kernels. In: 15th IEEE/ACIS International Conference on Computer and Information Science, ICIS 2016, Okayama, Japan, 26–29 June 2016. pp. 1–6 (2016)

    Google Scholar 

  15. Ha, O., Kuh, I., Tchamgoue, G.M., Jun, Y.: On-the-fly detection of data races in OpenMP programs. In: Proceedings of the 10th Workshop on Parallel and Distributed Systems: Testing, Analysis, and Debugging, PADTAD 2012, Minneapolis, MN, USA, 16 July 2012. pp. 1–10 (2012)

    Google Scholar 

  16. Luk, C., Pin: building customized program analysis tools with dynamic instrumentation. In: Proceedings of the ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation, Chicago, IL, USA, 12–15 June 2005. pp. 190–200 (2005)

    Google Scholar 

  17. Gaikwad, S., Nisbet, A., Luján, M.: Hosting OpenMP programs on Java virtual machines. In: Proceedings of the 16th ACM SIGPLAN International Conference on Managed Programming Languages and Runtimes, MPLR 2019, Athens, Greece, 21–22 October 2019. pp. 63–71 (2019)

    Google Scholar 

  18. Glek, T., Hubicka, J.: Optimizing real world applications with GCC link time optimization. arXiv preprint arXiv:1010.2196 (2010)

  19. Murphy, M., Sundaram, A.: Improving GPU application performance with NVIDIA CUDA 11.2 device link time optimization, February 2021. https://developer.nvidia.com/blog/improving-gpu-app-performance-with-cuda-11-2-device-lto/

  20. Antão, S.F., et al.: Offloading support for OpenMP in clang and LLVM. In: Third Workshop on the LLVM Compiler Infrastructure in HPC, LLVM-HPC@SC 2016, Salt Lake City, UT, USA, 14 November 2016. pp. 1–11 (2016)

    Google Scholar 

  21. Doerfert, J., Diaz, J.M.M., Finkel, H.: The TRegion interface and compiler optimizations for OpenMP target regions. In: Fan, X., de Supinski, B.R., Sinnen, O., Giacaman, N. (eds.) IWOMP 2019. LNCS, vol. 11718, pp. 153–167. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-28596-8_11

    Chapter  Google Scholar 

  22. Tiotto, E., Mahjour, B., Tsang, W., Xue, X., Islam, T., Chen, W.: OpenMP 4.5 Compiler optimization for GPU offloading. IBM J. Res. Dev. 64(3/4), 14:1–14:11 (2020)

    Google Scholar 

  23. Doerfert, J., Patel, A., Huber, J., Tian, S., Diaz, J.M.M., Chapman, B., Georgakoudis, G.: Co-Designing an OpenMP GPU runtime and optimizations for near-zero overhead execution. In: 36th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2022, St. Petersburg, FL USA, 15–19 May 2023. IEEE (2022)

    Google Scholar 

  24. Doerfert, J., et al.: Breaking the vendor lock – performance portable programming through OpenMP as target independent runtime layer. In: International Conference on Parallel Architectures and Compilation Techniques, PACT (2022, to appear)

    Google Scholar 

  25. Moses, W.S., Ivanov, I.R., Domke, J., Endo, T., Doerfert, J., Zinenko, O.: High-performance GPU-to-CPU transpilation and optimization via high-level parallel constructs (2022). https://doi.org/10.48550/ARXIV.2207.00257, https://arxiv.org/abs/2207.00257

Download references

Acknowledgement

This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of two U.S. Department of Energy organizations (Office of Science and the National Nuclear Security Administration) responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering, and early testbed platforms, in support of the nation’s exascale computing imperative. We also gratefully acknowledge the computing resources provided and operated by the Joint Laboratory for System Evaluation (JLSE) at Argonne National Laboratory.

This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

This manuscript has been authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The publisher acknowledges the US government license to provide public access under the DOE Public Access Plan (https://energy.gov/downloads/doe-public-access-plan).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Shilei Tian or Johannes Doerfert .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tian, S., Huber, J., Tramm, J., Chapman, B., Doerfert, J. (2022). Just-in-Time Compilation and Link-Time Optimization for OpenMP Target Offloading. In: Klemm, M., de Supinski, B.R., Klinkenberg, J., Neth, B. (eds) OpenMP in a Modern World: From Multi-device Support to Meta Programming. IWOMP 2022. Lecture Notes in Computer Science, vol 13527. Springer, Cham. https://doi.org/10.1007/978-3-031-15922-0_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-15922-0_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-15921-3

  • Online ISBN: 978-3-031-15922-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics