research-article

Open access

Performant portable OpenMP

Authors:

Michael WolfeAuthors Info & Claims

CC 2022: Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction

Pages 156 - 168

https://doi.org/10.1145/3497776.3517780

Published: 18 March 2022 Publication History

Abstract

Accelerated computing has increased the need to specialize how a program is parallelized depending on the target. Fully exploiting a highly parallel accelerator, such as a GPU, demands more parallelism and sometimes more levels of parallelism than a multicore CPU. OpenMP has a directive for each level of parallelism, but choosing directives for each target can incur a significant productivity cost. We argue that using the new OpenMP loop directive with an appropriate compiler decision process can achieve the same performance benefits of target-specific parallelization with the productivity advantage of a single directive for all targets. In this paper, we introduce a fully descriptive model and demonstrate its benefits with an implementation of the loop directive, comparing performance, productivity, and portability against other production compilers using the SPEC ACCEL benchmark suite. We provide an implementation of our proposal in NVIDIA's HPC compiler. It yields up to 56X speedup and an average of 1.91x-1.79x speedup compared to the baseline performance (depending on the host system) on GPUs, and preserves CPU performance. In addition, our proposal requires 60% fewer parallelism directives.

References

[1]

Samuel F. Antão, Alexey Bataev, Arpith C. Jacob, Gheorghe-Teodor Bercea, Alexandre E. Eichenberger, Georgios Rokos, Matt Martineau, Tian Jin, Guray Ozen, Zehra Sura, Tong Chen, Hyojin Sung, Carlo Bertolli, and Kevin O’Brien. 2016. Offloading Support for OpenMP in Clang and LLVM. In Third Workshop on the LLVM Compiler Infrastructure in HPC, LLVM-HPC@SC 2016, Salt Lake City, UT, USA, November 14, 2016. 1–11. https://doi.org/10.1109/LLVM-HPC.2016.006

[2]

Carlo Bertolli, Samuel Antão, Gheorghe-Teodor Bercea, Arpith C. Jacob, Alexandre E. Eichenberger, Tong Chen, Zehra Sura, Hyojin Sung, Georgios Rokos, David Appelhans, and Kevin O’Brien. 2015. Integrating GPU support for OpenMP offloading directives into Clang. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, LLVM 2015, Austin, Texas, USA, November 15, 2015. 5:1–5:11. https://doi.org/10.1145/2833157.2833161

Digital Library

[3]

Carlo Bertolli, Samuel Antão, Alexandre E. Eichenberger, Kevin O’Brien, Zehra Sura, Arpith C. Jacob, Tong Chen, and Olivier Sallenave. 2014. Coordinating GPU threads for OpenMP 4.0 in LLVM. In Proceedings of the 2014 LLVM Compiler Infrastructure in HPC, LLVM 2014, New Orleans, LA, USA, November 17, 2014, Hal Finkel and Jeff R. Hammond (Eds.). IEEE Computer Society, 12–21. https://doi.org/10.1109/LLVM-HPC.2014.10

Digital Library

[4]

David R. Butenhof. 1997. Programming with POSIX Threads. Addison-Wesley.

Digital Library

[5]

Christopher Daley, Hadia Ahmed, Samuel Williams, and Nicholas Wright. 2020. A Case Study of Porting HPGMG from CUDA to OpenMP Target Offload. In OpenMP: Portable Multi-Level Parallelism on Modern Systems, Kent Milfeld, Bronis R. de Supinski, Lars Koesterke, and Jannis Klinkenberg (Eds.). Springer International Publishing, Cham. 37–51. isbn:978-3-030-58144-2

[6]

Christopher S. Daley, Annemarie Southwell, Rahulkumar Gayatri, Scott Biersdorfff, Craig Toepfer, Güray Özen, and Nicholas J. Wright. 2021. Non-recurring engineering (NRE) best practices: a case study with the NERSC/NVIDIA OpenMP contract. In SC ’21: The International Conference for High Performance Computing, Networking, Storage and Analysis, St. Louis, Missouri, USA, November 14 - 19, 2021, Bronis R. de Supinski, Mary W. Hall, and Todd Gamblin (Eds.). ACM, 31:1–31:14. https://doi.org/10.1145/3458817.3476213

Digital Library

[7]

Joshua Hoke Davis, Christopher S. Daley, Swaroop Pophale, Thomas Huber, Sunita Chandrasekaran, and Nicholas J. Wright. 2020. Performance Assessment of OpenMP Compilers Targeting NVIDIA V100 GPUs. In Accelerator Programming Using Directives - 7th International Workshop, WACCPD 2020, Virtual Event, November 20, 2020, Proceedings, Sridutt Bhalachandra, Sandra Wienke, Sunita Chandrasekaran, and Guido Juckeland (Eds.) (Lecture Notes in Computer Science, Vol. 12655). Springer, 25–44. https://doi.org/10.1007/978-3-030-74224-9_2

Digital Library

[8]

Bronis R. de Supinski, Thomas R. W. Scogland, Alejandro Duran, Michael Klemm, Sergi Mateo Bellido, Stephen L. Olivier, Christian Terboven, and Timothy G. Mattson. 2018. The Ongoing Evolution of OpenMP. Proc. IEEE, 106, 11 (2018), 2004–2019. https://doi.org/10.1109/JPROC.2018.2853600

[9]

Johannes Doerfert, Jose Manuel Monsalve Diaz, and Hal Finkel. 2019. The TRegion Interface and Compiler Optimizations for OpenMP Target Regions. In OpenMP: Conquering the Full Hardware Spectrum - 15th International Workshop on OpenMP, IWOMP 2019, Auckland, New Zealand, September 11-13, 2019, Proceedings, Xing Fan, Bronis R. de Supinski, Oliver Sinnen, and Nasser Giacaman (Eds.) (Lecture Notes in Computer Science, Vol. 11718). Springer, 153–167. https://doi.org/10.1007/978-3-030-28596-8_11

Digital Library

[10]

Free Software Foundation. [n.d.]. GCC, the GNU Compiler Collection, Offload Support. https://gcc.gnu.org/wiki/Offloading

[11]

Rahulkumar Gayatri, Charlene Yang, Thorsten Kurth, and Jack Deslippe. 2018. A Case Study for Performance Portability Using OpenMP 4.5. In Accelerator Programming Using Directives - 5th International Workshop, WACCPD 2018, Dallas, TX, USA, November 11-17, 2018, Proceedings, Sunita Chandrasekaran, Guido Juckeland, and Sandra Wienke (Eds.) (Lecture Notes in Computer Science, Vol. 11381). Springer, 75–95. https://doi.org/10.1007/978-3-030-12274-4_4

[12]

Guray Ozen. 2017. Compiler and Runtime Based Parallelization and Optimization for GPUs. Ph.D. Dissertation.

[13]

Akihiro Hayashi, Jun Shirako, Ettore Tiotto, Robert Ho, and Vivek Sarkar. 2019. Performance evaluation of OpenMP’s target construct on GPUs - exploring compiler optimisations. Int. J. High Perform. Comput. Netw., 13, 1 (2019), 54–69. https://doi.org/10.1504/IJHPCN.2019.097051

[14]

IBM. [n.d.]. XL Compiler for C, C++ and Fortran. https://www.ibm.com/products/xl-cpp-linux-compiler-power

[15]

Intel Corp. 2021. Intel C++ Compiler Classic Developer Guide and Reference.

[16]

Arpith Chacko Jacob, Alexandre E. Eichenberger, Hyojin Sung, Samuel F. Antão, Gheorghe-Teodor Bercea, Carlo Bertolli, Alexey Bataev, Tian Jin, Tong Chen, Zehra Sura, Georgios Rokos, and Kevin O’Brien. 2017. Efficient Fork-Join on GPUs Through Warp Specialization. In 24th IEEE International Conference on High Performance Computing, HiPC 2017, Jaipur, India, December 18-21, 2017. IEEE Computer Society, 358–367. https://doi.org/10.1109/HiPC.2017.00048

[17]

Guido Juckeland, William C. Brantley, Sunita Chandrasekaran, Barbara M. Chapman, Shuai Che, Mathew E. Colgrove, Huiyu Feng, Alexander Grund, Robert Henschel, Wen-mei W. Hwu, Huian Li, Matthias S. Müller, Wolfgang E. Nagel, Maxim Perminov, Pavel Shelepugin, Kevin Skadron, John A. Stratton, Alexey Titov, Ke Wang, G. Matthijs van Waveren, Brian Whitney, Sandra Wienke, Rengan Xu, and Kalyan Kumaran. 2014. SPEC ACCEL: A Standard Application Suite for Measuring Hardware Accelerator Performance. In High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation - 5th International Workshop, PMBS 2014, New Orleans, LA, USA, November 16, 2014. Revised Selected Papers. 46–67. https://doi.org/10.1007/978-3-319-17248-4_3

Digital Library

[18]

Khronos OpenCL Working Group. 2020. The OpenCL Specification, version 3.0.

[19]

Seyong Lee, Seung-Jai Min, and Rudolf Eigenmann. 2009. OpenMP to GPGPU: a Compiler Framework for Automatic Translation and Optimization. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2009, Raleigh, NC, USA, February 14-18, 2009. 101–110. https://doi.org/10.1145/1504176.1504194

Digital Library

[20]

Seyong Lee and Jeffrey S. Vetter. 2014. OpenARC: extensible OpenACC compiler framework for directive-based accelerator programming study. In Proceedings of the First Workshop on Accelerator Programming using Directives, WACCPD ’14, New Orleans, Louisiana, USA, November 16-21, 2014, Sunita Chandrasekaran, Fernanda S. Foertter, and Oscar R. Hernandez (Eds.). IEEE Computer Society, 1–11. https://doi.org/10.1109/WACCPD.2014.7

[21]

Chunhua Liao, Yonghong Yan, Bronis R. de Supinski, Daniel J. Quinlan, and Barbara M. Chapman. 2013. Early Experiences with the OpenMP Accelerator Model. In OpenMP in the Era of Low Power Devices and Accelerators - 9th International Workshop on OpenMP, IWOMP 2013, Canberra, ACT, Australia, September 16-18, 2013. Proceedings, Alistair P. Rendell, Barbara M. Chapman, and Matthias S. Müller (Eds.) (Lecture Notes in Computer Science, Vol. 8122). Springer, 84–98. https://doi.org/10.1007/978-3-642-40698-0_7

[22]

LLVM Team. [n.d.]. 2021. [Online]. The LLVM Compiler Infrastructure. https://github.com/llvm/llvm-project

[23]

Matt Martineau and Simon McIntosh-Smith. 2017. The Productivity, Portability and Performance of OpenMP 4.5 for Scientific Applications Targeting Intel CPUs, IBM CPUs, and NVIDIA GPUs. In Scaling OpenMP for Exascale Performance and Portability - 13th International Workshop on OpenMP, IWOMP 2017, Stony Brook, NY, USA, September 20-22, 2017, Proceedings, Bronis R. de Supinski, Stephen L. Olivier, Christian Terboven, Barbara M. Chapman, and Matthias S. Müller (Eds.) (Lecture Notes in Computer Science, Vol. 10468). Springer, 185–200. https://doi.org/10.1007/978-3-319-65578-9_13

[24]

NVIDIA Corp. [n.d.]. CUDA Dynamic Parallelism Programming Guide, 2013.

[25]

NVIDIA Corp. 2021. CUDA C++ Programming Guide Version 11.2.

[26]

OpenMP ARB. 2020. OpenMP Application Program Interface, v. 5.1. http://www.openmp.org

[27]

Guray Ozen, Simone Atzeni, Michael Wolfe, Annemarie Southwell, and Gary Klimowicz. 2018. OpenMP GPU Offload in Flang and LLVM. In 2018 IEEE/ACM 5th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC). 1–9. https://doi.org/10.1109/LLVM-HPC.2018.8639434

[28]

Guray Ozen, Eduard Ayguadé, and Jesús Labarta. 2014. On the Roles of the Programmer, the Compiler and the Runtime System When Programming Accelerators in OpenMP. In Using and Improving OpenMP for Devices, Tasks, and More - 10th International Workshop on OpenMP, IWOMP 2014, Salvador, Brazil, September 28-30, 2014. Proceedings. 215–229. https://doi.org/10.1007/978-3-319-11454-5_16

[29]

Guray Ozen, Eduard Ayguadé, and Jesús Labarta. 2015. Exploring Dynamic Parallelism in OpenMP. In Proceedings of the Second Workshop on Accelerator Programming using Directives, WACCPD 2015, Austin, Texas, USA, November 15, 2015. 5:1–5:8. https://doi.org/10.1145/2832105.2832113

Digital Library

[30]

Ettore Tiotto, Bardia Mahjour, Whitney Tsang, Xing Xue, Tarique Islam, and Wang Chen. 2020. OpenMP 4.5 compiler optimization for GPU offloading. IBM J. Res. Dev., 64, 3/4 (2020), 14:1–14:11. https://doi.org/10.1147/JRD.2019.2962428

[31]

Jin Wang and Sudhakar Yalamanchili. 2014. Characterization and Analysis of Dynamic Parallelism in Unstructured GPU Applications. In 2014 IEEE International Symposium on Workload Characterization, IISWC 2014, Raleigh, NC, USA, October 26-28, 2014. 51–60. https://doi.org/10.1109/IISWC.2014.6983039

[32]

Michael Wolfe. 2010. Implementing the PGI Accelerator model. In Proceedings of 3rd Workshop on General Purpose Processing on Graphics Processing Units, GPGPU 2010, Pittsburgh, Pennsylvania, USA, March 14, 2010. 43–50. https://doi.org/10.1145/1735688.1735697

Digital Library

[33]

M. Wolfe and C.-W. Tseng. 1992. The power test for data dependence. IEEE Transactions on Parallel and Distributed Systems, 3, 5 (1992), 591–601. https://doi.org/10.1109/71.159042

Digital Library

Cited By

Han RZhao JKim H(2024)Unleashing CPU Potential for Executing GPU Programs Through Compiler/Runtime Optimizations2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00023(186-200)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00023
Tian SChapman BDoerfert J(2023)Maximizing Parallelism and GPU Utilization For Direct GPU Compilation Through Ensemble ExecutionProceedings of the 52nd International Conference on Parallel Processing Workshops10.1145/3605731.3606016(112-118)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3605731.3606016
Bhattacharjee ADaley CJannesari A(2023)OpenMP Offload Features and Strategies for High Performance across Architectures and Compilers2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW59300.2023.00098(564-573)Online publication date: May-2023
https://doi.org/10.1109/IPDPSW59300.2023.00098
Show More Cited By

Index Terms

Performant portable OpenMP
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
    2. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

Assessing One-to-One Parallelism Levels Mapping for OpenMP Offloading to GPUs
PMAM'17: Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores

The proliferation of accelerators in modern clusters makes efficient coprocessor programming a key requirement if application codes are to achieve high levels of performance with acceptable energy consumption on such platforms. This has led to ...
Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: Programming Productivity, Performance, and Energy Consumption
ARMS-CC '17: Proceedings of the 2017 Workshop on Adaptive Resource Management and Scheduling for Cloud Computing

Many modern parallel computing systems are heterogeneous at their node level. Such nodes may comprise general purpose CPUs and accelerators (such as, GPU, or Intel Xeon Phi) that provide high performance with suitable energy-consumption characteristics. ...
Exploring Programming Multi-GPUs Using OpenMP and OpenACC-Based Hybrid Model
IPDPSW '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum

Heterogeneous computing come with tremendous potential and is a leading candidate for scientific applications that are becoming more and more complex. Accelerators such as GPUs whose computing momentum is growing faster than ever offer application ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CC 2022: Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction

March 2022

253 pages

ISBN:9781450391832

DOI:10.1145/3497776

General Chairs:
Bernhard Egger
Seoul National University, South Korea
,
Aaron Smith
Microsoft, USA

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 March 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CC '22

Sponsor:

SIGPLAN

CC '22: 31st ACM SIGPLAN International Conference on Compiler Construction

April 2 - 3, 2022

Seoul, South Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
459
Total Downloads

Downloads (Last 12 months)189
Downloads (Last 6 weeks)20

Reflects downloads up to 24 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Han RZhao JKim H(2024)Unleashing CPU Potential for Executing GPU Programs Through Compiler/Runtime Optimizations2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00023(186-200)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00023
Tian SChapman BDoerfert J(2023)Maximizing Parallelism and GPU Utilization For Direct GPU Compilation Through Ensemble ExecutionProceedings of the 52nd International Conference on Parallel Processing Workshops10.1145/3605731.3606016(112-118)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3605731.3606016
Bhattacharjee ADaley CJannesari A(2023)OpenMP Offload Features and Strategies for High Performance across Architectures and Compilers2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW59300.2023.00098(564-573)Online publication date: May-2023
https://doi.org/10.1109/IPDPSW59300.2023.00098
Tian SChapman BDoerfert J(2023)Exploring the Limits of Generic Code Execution on GPUs via Direct (OpenMP) OffloadOpenMP: Advanced Task-Based, Device and Compiler Programming10.1007/978-3-031-40744-4_12(179-192)Online publication date: 13-Sep-2023
https://dl.acm.org/doi/10.1007/978-3-031-40744-4_12
Jiménez DHerrera-Mora JRampp MLaure EMeneses E(2022)Implementing a GPU-Portable Field Line Tracing Application with OpenMP OffloadHigh Performance Computing10.1007/978-3-031-23821-5_3(31-46)Online publication date: 21-Dec-2022
https://doi.org/10.1007/978-3-031-23821-5_3

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten