skip to main content
10.1145/3497776.3517780acmconferencesArticle/Chapter ViewAbstractPublication PagesccConference Proceedingsconference-collections
research-article
Open access

Performant portable OpenMP

Published: 18 March 2022 Publication History

Abstract

Accelerated computing has increased the need to specialize how a program is parallelized depending on the target. Fully exploiting a highly parallel accelerator, such as a GPU, demands more parallelism and sometimes more levels of parallelism than a multicore CPU. OpenMP has a directive for each level of parallelism, but choosing directives for each target can incur a significant productivity cost. We argue that using the new OpenMP loop directive with an appropriate compiler decision process can achieve the same performance benefits of target-specific parallelization with the productivity advantage of a single directive for all targets. In this paper, we introduce a fully descriptive model and demonstrate its benefits with an implementation of the loop directive, comparing performance, productivity, and portability against other production compilers using the SPEC ACCEL benchmark suite. We provide an implementation of our proposal in NVIDIA's HPC compiler. It yields up to 56X speedup and an average of 1.91x-1.79x speedup compared to the baseline performance (depending on the host system) on GPUs, and preserves CPU performance. In addition, our proposal requires 60% fewer parallelism directives.

References

[1]
Samuel F. Antão, Alexey Bataev, Arpith C. Jacob, Gheorghe-Teodor Bercea, Alexandre E. Eichenberger, Georgios Rokos, Matt Martineau, Tian Jin, Guray Ozen, Zehra Sura, Tong Chen, Hyojin Sung, Carlo Bertolli, and Kevin O’Brien. 2016. Offloading Support for OpenMP in Clang and LLVM. In Third Workshop on the LLVM Compiler Infrastructure in HPC, LLVM-HPC@SC 2016, Salt Lake City, UT, USA, November 14, 2016. 1–11. https://doi.org/10.1109/LLVM-HPC.2016.006
[2]
Carlo Bertolli, Samuel Antão, Gheorghe-Teodor Bercea, Arpith C. Jacob, Alexandre E. Eichenberger, Tong Chen, Zehra Sura, Hyojin Sung, Georgios Rokos, David Appelhans, and Kevin O’Brien. 2015. Integrating GPU support for OpenMP offloading directives into Clang. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, LLVM 2015, Austin, Texas, USA, November 15, 2015. 5:1–5:11. https://doi.org/10.1145/2833157.2833161
[3]
Carlo Bertolli, Samuel Antão, Alexandre E. Eichenberger, Kevin O’Brien, Zehra Sura, Arpith C. Jacob, Tong Chen, and Olivier Sallenave. 2014. Coordinating GPU threads for OpenMP 4.0 in LLVM. In Proceedings of the 2014 LLVM Compiler Infrastructure in HPC, LLVM 2014, New Orleans, LA, USA, November 17, 2014, Hal Finkel and Jeff R. Hammond (Eds.). IEEE Computer Society, 12–21. https://doi.org/10.1109/LLVM-HPC.2014.10
[4]
David R. Butenhof. 1997. Programming with POSIX Threads. Addison-Wesley.
[5]
Christopher Daley, Hadia Ahmed, Samuel Williams, and Nicholas Wright. 2020. A Case Study of Porting HPGMG from CUDA to OpenMP Target Offload. In OpenMP: Portable Multi-Level Parallelism on Modern Systems, Kent Milfeld, Bronis R. de Supinski, Lars Koesterke, and Jannis Klinkenberg (Eds.). Springer International Publishing, Cham. 37–51. isbn:978-3-030-58144-2
[6]
Christopher S. Daley, Annemarie Southwell, Rahulkumar Gayatri, Scott Biersdorfff, Craig Toepfer, Güray Özen, and Nicholas J. Wright. 2021. Non-recurring engineering (NRE) best practices: a case study with the NERSC/NVIDIA OpenMP contract. In SC ’21: The International Conference for High Performance Computing, Networking, Storage and Analysis, St. Louis, Missouri, USA, November 14 - 19, 2021, Bronis R. de Supinski, Mary W. Hall, and Todd Gamblin (Eds.). ACM, 31:1–31:14. https://doi.org/10.1145/3458817.3476213
[7]
Joshua Hoke Davis, Christopher S. Daley, Swaroop Pophale, Thomas Huber, Sunita Chandrasekaran, and Nicholas J. Wright. 2020. Performance Assessment of OpenMP Compilers Targeting NVIDIA V100 GPUs. In Accelerator Programming Using Directives - 7th International Workshop, WACCPD 2020, Virtual Event, November 20, 2020, Proceedings, Sridutt Bhalachandra, Sandra Wienke, Sunita Chandrasekaran, and Guido Juckeland (Eds.) (Lecture Notes in Computer Science, Vol. 12655). Springer, 25–44. https://doi.org/10.1007/978-3-030-74224-9_2
[8]
Bronis R. de Supinski, Thomas R. W. Scogland, Alejandro Duran, Michael Klemm, Sergi Mateo Bellido, Stephen L. Olivier, Christian Terboven, and Timothy G. Mattson. 2018. The Ongoing Evolution of OpenMP. Proc. IEEE, 106, 11 (2018), 2004–2019. https://doi.org/10.1109/JPROC.2018.2853600
[9]
Johannes Doerfert, Jose Manuel Monsalve Diaz, and Hal Finkel. 2019. The TRegion Interface and Compiler Optimizations for OpenMP Target Regions. In OpenMP: Conquering the Full Hardware Spectrum - 15th International Workshop on OpenMP, IWOMP 2019, Auckland, New Zealand, September 11-13, 2019, Proceedings, Xing Fan, Bronis R. de Supinski, Oliver Sinnen, and Nasser Giacaman (Eds.) (Lecture Notes in Computer Science, Vol. 11718). Springer, 153–167. https://doi.org/10.1007/978-3-030-28596-8_11
[10]
Free Software Foundation. [n.d.]. GCC, the GNU Compiler Collection, Offload Support. https://gcc.gnu.org/wiki/Offloading
[11]
Rahulkumar Gayatri, Charlene Yang, Thorsten Kurth, and Jack Deslippe. 2018. A Case Study for Performance Portability Using OpenMP 4.5. In Accelerator Programming Using Directives - 5th International Workshop, WACCPD 2018, Dallas, TX, USA, November 11-17, 2018, Proceedings, Sunita Chandrasekaran, Guido Juckeland, and Sandra Wienke (Eds.) (Lecture Notes in Computer Science, Vol. 11381). Springer, 75–95. https://doi.org/10.1007/978-3-030-12274-4_4
[12]
Guray Ozen. 2017. Compiler and Runtime Based Parallelization and Optimization for GPUs. Ph.D. Dissertation.
[13]
Akihiro Hayashi, Jun Shirako, Ettore Tiotto, Robert Ho, and Vivek Sarkar. 2019. Performance evaluation of OpenMP’s target construct on GPUs - exploring compiler optimisations. Int. J. High Perform. Comput. Netw., 13, 1 (2019), 54–69. https://doi.org/10.1504/IJHPCN.2019.097051
[14]
IBM. [n.d.]. XL Compiler for C, C++ and Fortran. https://www.ibm.com/products/xl-cpp-linux-compiler-power
[15]
Intel Corp. 2021. Intel C++ Compiler Classic Developer Guide and Reference.
[16]
Arpith Chacko Jacob, Alexandre E. Eichenberger, Hyojin Sung, Samuel F. Antão, Gheorghe-Teodor Bercea, Carlo Bertolli, Alexey Bataev, Tian Jin, Tong Chen, Zehra Sura, Georgios Rokos, and Kevin O’Brien. 2017. Efficient Fork-Join on GPUs Through Warp Specialization. In 24th IEEE International Conference on High Performance Computing, HiPC 2017, Jaipur, India, December 18-21, 2017. IEEE Computer Society, 358–367. https://doi.org/10.1109/HiPC.2017.00048
[17]
Guido Juckeland, William C. Brantley, Sunita Chandrasekaran, Barbara M. Chapman, Shuai Che, Mathew E. Colgrove, Huiyu Feng, Alexander Grund, Robert Henschel, Wen-mei W. Hwu, Huian Li, Matthias S. Müller, Wolfgang E. Nagel, Maxim Perminov, Pavel Shelepugin, Kevin Skadron, John A. Stratton, Alexey Titov, Ke Wang, G. Matthijs van Waveren, Brian Whitney, Sandra Wienke, Rengan Xu, and Kalyan Kumaran. 2014. SPEC ACCEL: A Standard Application Suite for Measuring Hardware Accelerator Performance. In High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation - 5th International Workshop, PMBS 2014, New Orleans, LA, USA, November 16, 2014. Revised Selected Papers. 46–67. https://doi.org/10.1007/978-3-319-17248-4_3
[18]
Khronos OpenCL Working Group. 2020. The OpenCL Specification, version 3.0.
[19]
Seyong Lee, Seung-Jai Min, and Rudolf Eigenmann. 2009. OpenMP to GPGPU: a Compiler Framework for Automatic Translation and Optimization. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2009, Raleigh, NC, USA, February 14-18, 2009. 101–110. https://doi.org/10.1145/1504176.1504194
[20]
Seyong Lee and Jeffrey S. Vetter. 2014. OpenARC: extensible OpenACC compiler framework for directive-based accelerator programming study. In Proceedings of the First Workshop on Accelerator Programming using Directives, WACCPD ’14, New Orleans, Louisiana, USA, November 16-21, 2014, Sunita Chandrasekaran, Fernanda S. Foertter, and Oscar R. Hernandez (Eds.). IEEE Computer Society, 1–11. https://doi.org/10.1109/WACCPD.2014.7
[21]
Chunhua Liao, Yonghong Yan, Bronis R. de Supinski, Daniel J. Quinlan, and Barbara M. Chapman. 2013. Early Experiences with the OpenMP Accelerator Model. In OpenMP in the Era of Low Power Devices and Accelerators - 9th International Workshop on OpenMP, IWOMP 2013, Canberra, ACT, Australia, September 16-18, 2013. Proceedings, Alistair P. Rendell, Barbara M. Chapman, and Matthias S. Müller (Eds.) (Lecture Notes in Computer Science, Vol. 8122). Springer, 84–98. https://doi.org/10.1007/978-3-642-40698-0_7
[22]
LLVM Team. [n.d.]. 2021. [Online]. The LLVM Compiler Infrastructure. https://github.com/llvm/llvm-project
[23]
Matt Martineau and Simon McIntosh-Smith. 2017. The Productivity, Portability and Performance of OpenMP 4.5 for Scientific Applications Targeting Intel CPUs, IBM CPUs, and NVIDIA GPUs. In Scaling OpenMP for Exascale Performance and Portability - 13th International Workshop on OpenMP, IWOMP 2017, Stony Brook, NY, USA, September 20-22, 2017, Proceedings, Bronis R. de Supinski, Stephen L. Olivier, Christian Terboven, Barbara M. Chapman, and Matthias S. Müller (Eds.) (Lecture Notes in Computer Science, Vol. 10468). Springer, 185–200. https://doi.org/10.1007/978-3-319-65578-9_13
[24]
NVIDIA Corp. [n.d.]. CUDA Dynamic Parallelism Programming Guide, 2013.
[25]
NVIDIA Corp. 2021. CUDA C++ Programming Guide Version 11.2.
[26]
OpenMP ARB. 2020. OpenMP Application Program Interface, v. 5.1. http://www.openmp.org
[27]
Guray Ozen, Simone Atzeni, Michael Wolfe, Annemarie Southwell, and Gary Klimowicz. 2018. OpenMP GPU Offload in Flang and LLVM. In 2018 IEEE/ACM 5th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC). 1–9. https://doi.org/10.1109/LLVM-HPC.2018.8639434
[28]
Guray Ozen, Eduard Ayguadé, and Jesús Labarta. 2014. On the Roles of the Programmer, the Compiler and the Runtime System When Programming Accelerators in OpenMP. In Using and Improving OpenMP for Devices, Tasks, and More - 10th International Workshop on OpenMP, IWOMP 2014, Salvador, Brazil, September 28-30, 2014. Proceedings. 215–229. https://doi.org/10.1007/978-3-319-11454-5_16
[29]
Guray Ozen, Eduard Ayguadé, and Jesús Labarta. 2015. Exploring Dynamic Parallelism in OpenMP. In Proceedings of the Second Workshop on Accelerator Programming using Directives, WACCPD 2015, Austin, Texas, USA, November 15, 2015. 5:1–5:8. https://doi.org/10.1145/2832105.2832113
[30]
Ettore Tiotto, Bardia Mahjour, Whitney Tsang, Xing Xue, Tarique Islam, and Wang Chen. 2020. OpenMP 4.5 compiler optimization for GPU offloading. IBM J. Res. Dev., 64, 3/4 (2020), 14:1–14:11. https://doi.org/10.1147/JRD.2019.2962428
[31]
Jin Wang and Sudhakar Yalamanchili. 2014. Characterization and Analysis of Dynamic Parallelism in Unstructured GPU Applications. In 2014 IEEE International Symposium on Workload Characterization, IISWC 2014, Raleigh, NC, USA, October 26-28, 2014. 51–60. https://doi.org/10.1109/IISWC.2014.6983039
[32]
Michael Wolfe. 2010. Implementing the PGI Accelerator model. In Proceedings of 3rd Workshop on General Purpose Processing on Graphics Processing Units, GPGPU 2010, Pittsburgh, Pennsylvania, USA, March 14, 2010. 43–50. https://doi.org/10.1145/1735688.1735697
[33]
M. Wolfe and C.-W. Tseng. 1992. The power test for data dependence. IEEE Transactions on Parallel and Distributed Systems, 3, 5 (1992), 591–601. https://doi.org/10.1109/71.159042

Cited By

View all
  • (2024)Unleashing CPU Potential for Executing GPU Programs Through Compiler/Runtime Optimizations2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00023(186-200)Online publication date: 2-Nov-2024
  • (2023)Maximizing Parallelism and GPU Utilization For Direct GPU Compilation Through Ensemble ExecutionProceedings of the 52nd International Conference on Parallel Processing Workshops10.1145/3605731.3606016(112-118)Online publication date: 7-Aug-2023
  • (2023)OpenMP Offload Features and Strategies for High Performance across Architectures and Compilers2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW59300.2023.00098(564-573)Online publication date: May-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CC 2022: Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction
March 2022
253 pages
ISBN:9781450391832
DOI:10.1145/3497776
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 March 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Compilers
  2. GPUs
  3. OpenMP
  4. Parallel Programming Languages

Qualifiers

  • Research-article

Conference

CC '22
Sponsor:

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)189
  • Downloads (Last 6 weeks)20
Reflects downloads up to 24 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Unleashing CPU Potential for Executing GPU Programs Through Compiler/Runtime Optimizations2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00023(186-200)Online publication date: 2-Nov-2024
  • (2023)Maximizing Parallelism and GPU Utilization For Direct GPU Compilation Through Ensemble ExecutionProceedings of the 52nd International Conference on Parallel Processing Workshops10.1145/3605731.3606016(112-118)Online publication date: 7-Aug-2023
  • (2023)OpenMP Offload Features and Strategies for High Performance across Architectures and Compilers2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW59300.2023.00098(564-573)Online publication date: May-2023
  • (2023)Exploring the Limits of Generic Code Execution on GPUs via Direct (OpenMP) OffloadOpenMP: Advanced Task-Based, Device and Compiler Programming10.1007/978-3-031-40744-4_12(179-192)Online publication date: 13-Sep-2023
  • (2022)Implementing a GPU-Portable Field Line Tracing Application with OpenMP OffloadHigh Performance Computing10.1007/978-3-031-23821-5_3(31-46)Online publication date: 21-Dec-2022

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media