Skip to main content

Concurrent Execution of Deferred OpenMP Target Tasks with Hidden Helper Threads

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 13149))

Abstract

In this paper, we introduce a novel approach to support concurrent offloading for OpenMP tasks based on hidden helper threads. We contrast our design to alternative implementations and explain why the approach we have chosen provides the most consistent performance across a wide range of use cases. In addition to a theoretical discussion of the trade-offs, we detail our implementation in the LLVM compiler infrastructure. Finally, we provide evaluation results of four extreme offloading situations on the Summit supercomputer, showing that we achieve speedup of up to \(6.7\times \) over synchronous offloading, and provide comparable speedup to the commercial IBM XL C/C++ compiler.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    The fallback case, execution on the issuing device, is sufficiently similar.

  2. 2.

    This is CUDA terminology, but almost all heterogeneous programming models have a similar concept, such as the command queue in OpenCL.

References

  1. Antao, S.F., et al.: Offloading support for OpenMP in clang and LLVM. In: The Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC), Salt Lake City, UT, USA, pp. 1–11 (2016)

    Google Scholar 

  2. Oak Ridge Leadership Computing Facility: Summit - oak ridge leadership computing facility. https://www.olcf.ornl.gov/summit/

  3. Group, L.D.: OpenMP support – clang 11 documentation - LLVM. https://clang.llvm.org/docs/OpenMPSupport.html

  4. IBM: OpenMP support in XL C/C++. https://www.ibm.com/support/knowledgecenter/SSXVZZ_16.1.1/com.ibm.xlcpp1611.lelinux.doc/getstart/omp_v1611.html

  5. Jiao, Q., Lu, M., Huynh, H.P., Mitra, T.: Improving GPGPU energy-efficiency through concurrent kernel execution and DVFS. In: IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 1–11. IEEE, San Francisco (2015)

    Google Scholar 

  6. NVIDIA: CUDA C best practices guide. https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html

  7. NVIDIA: Nvidia PTX optimizing assembler. https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html

  8. NVIDIA: Nvidia visual profiler. https://developer.nvidia.com/nvidia-visual-profiler

  9. Project, G.: Offloading support in GCC. https://gcc.gnu.org/wiki/Offloading

  10. Wang, L., Huang, M., El-Ghazawi, T.: Exploiting concurrent kernel execution on graphic processing units. In: International Conference on High Performance Computing & Simulation, pp. 24–32. IEEE, Istanbul, July 2011

    Google Scholar 

  11. Wen, Y., O’Boyle, M.F., Fensch, C.: MaxPair: enhance OpenCL concurrent kernel execution by weighted maximum matching. In: Workshop on General Purpose GPUs, pp. 40–49. ACM, Vienna (2018)

    Google Scholar 

  12. Wende, F., Cordes, F., Steinke, T.: On improving the performance of multi-threaded CUDA applications with concurrent kernel execution by kernel reordering. In: Symposium on Application Accelerators in High Performance Computing, pp. 74–83. IEEE, Chicago (2012)

    Google Scholar 

Download references

Acknowledgments

This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of two U.S. Department of Energy organizations (Office of Science and the National Nuclear Security Administration) responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering, and early testbed platforms, in support of the nation’s exascale computing imperative.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shilei Tian .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tian, S., Doerfert, J., Chapman, B. (2022). Concurrent Execution of Deferred OpenMP Target Tasks with Hidden Helper Threads. In: Chapman, B., Moreira, J. (eds) Languages and Compilers for Parallel Computing. LCPC 2020. Lecture Notes in Computer Science(), vol 13149. Springer, Cham. https://doi.org/10.1007/978-3-030-95953-1_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-95953-1_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-95952-4

  • Online ISBN: 978-3-030-95953-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics