skip to main content
10.1145/3497776.3517780acmconferencesArticle/Chapter ViewAbstractPublication PagesccConference Proceedingsconference-collections
research-article
Open Access

Performant portable OpenMP

Published:18 March 2022Publication History

ABSTRACT

Accelerated computing has increased the need to specialize how a program is parallelized depending on the target. Fully exploiting a highly parallel accelerator, such as a GPU, demands more parallelism and sometimes more levels of parallelism than a multicore CPU. OpenMP has a directive for each level of parallelism, but choosing directives for each target can incur a significant productivity cost. We argue that using the new OpenMP loop directive with an appropriate compiler decision process can achieve the same performance benefits of target-specific parallelization with the productivity advantage of a single directive for all targets. In this paper, we introduce a fully descriptive model and demonstrate its benefits with an implementation of the loop directive, comparing performance, productivity, and portability against other production compilers using the SPEC ACCEL benchmark suite. We provide an implementation of our proposal in NVIDIA's HPC compiler. It yields up to 56X speedup and an average of 1.91x-1.79x speedup compared to the baseline performance (depending on the host system) on GPUs, and preserves CPU performance. In addition, our proposal requires 60% fewer parallelism directives.

References

  1. Samuel F. Antão, Alexey Bataev, Arpith C. Jacob, Gheorghe-Teodor Bercea, Alexandre E. Eichenberger, Georgios Rokos, Matt Martineau, Tian Jin, Guray Ozen, Zehra Sura, Tong Chen, Hyojin Sung, Carlo Bertolli, and Kevin O’Brien. 2016. Offloading Support for OpenMP in Clang and LLVM. In Third Workshop on the LLVM Compiler Infrastructure in HPC, LLVM-HPC@SC 2016, Salt Lake City, UT, USA, November 14, 2016. 1–11. https://doi.org/10.1109/LLVM-HPC.2016.006 Google ScholarGoogle ScholarCross RefCross Ref
  2. Carlo Bertolli, Samuel Antão, Gheorghe-Teodor Bercea, Arpith C. Jacob, Alexandre E. Eichenberger, Tong Chen, Zehra Sura, Hyojin Sung, Georgios Rokos, David Appelhans, and Kevin O’Brien. 2015. Integrating GPU support for OpenMP offloading directives into Clang. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, LLVM 2015, Austin, Texas, USA, November 15, 2015. 5:1–5:11. https://doi.org/10.1145/2833157.2833161 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Carlo Bertolli, Samuel Antão, Alexandre E. Eichenberger, Kevin O’Brien, Zehra Sura, Arpith C. Jacob, Tong Chen, and Olivier Sallenave. 2014. Coordinating GPU threads for OpenMP 4.0 in LLVM. In Proceedings of the 2014 LLVM Compiler Infrastructure in HPC, LLVM 2014, New Orleans, LA, USA, November 17, 2014, Hal Finkel and Jeff R. Hammond (Eds.). IEEE Computer Society, 12–21. https://doi.org/10.1109/LLVM-HPC.2014.10 Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. David R. Butenhof. 1997. Programming with POSIX Threads. Addison-Wesley.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Christopher Daley, Hadia Ahmed, Samuel Williams, and Nicholas Wright. 2020. A Case Study of Porting HPGMG from CUDA to OpenMP Target Offload. In OpenMP: Portable Multi-Level Parallelism on Modern Systems, Kent Milfeld, Bronis R. de Supinski, Lars Koesterke, and Jannis Klinkenberg (Eds.). Springer International Publishing, Cham. 37–51. isbn:978-3-030-58144-2Google ScholarGoogle Scholar
  6. Christopher S. Daley, Annemarie Southwell, Rahulkumar Gayatri, Scott Biersdorfff, Craig Toepfer, Güray Özen, and Nicholas J. Wright. 2021. Non-recurring engineering (NRE) best practices: a case study with the NERSC/NVIDIA OpenMP contract. In SC ’21: The International Conference for High Performance Computing, Networking, Storage and Analysis, St. Louis, Missouri, USA, November 14 - 19, 2021, Bronis R. de Supinski, Mary W. Hall, and Todd Gamblin (Eds.). ACM, 31:1–31:14. https://doi.org/10.1145/3458817.3476213 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Joshua Hoke Davis, Christopher S. Daley, Swaroop Pophale, Thomas Huber, Sunita Chandrasekaran, and Nicholas J. Wright. 2020. Performance Assessment of OpenMP Compilers Targeting NVIDIA V100 GPUs. In Accelerator Programming Using Directives - 7th International Workshop, WACCPD 2020, Virtual Event, November 20, 2020, Proceedings, Sridutt Bhalachandra, Sandra Wienke, Sunita Chandrasekaran, and Guido Juckeland (Eds.) (Lecture Notes in Computer Science, Vol. 12655). Springer, 25–44. https://doi.org/10.1007/978-3-030-74224-9_2 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Bronis R. de Supinski, Thomas R. W. Scogland, Alejandro Duran, Michael Klemm, Sergi Mateo Bellido, Stephen L. Olivier, Christian Terboven, and Timothy G. Mattson. 2018. The Ongoing Evolution of OpenMP. Proc. IEEE, 106, 11 (2018), 2004–2019. https://doi.org/10.1109/JPROC.2018.2853600 Google ScholarGoogle ScholarCross RefCross Ref
  9. Johannes Doerfert, Jose Manuel Monsalve Diaz, and Hal Finkel. 2019. The TRegion Interface and Compiler Optimizations for OpenMP Target Regions. In OpenMP: Conquering the Full Hardware Spectrum - 15th International Workshop on OpenMP, IWOMP 2019, Auckland, New Zealand, September 11-13, 2019, Proceedings, Xing Fan, Bronis R. de Supinski, Oliver Sinnen, and Nasser Giacaman (Eds.) (Lecture Notes in Computer Science, Vol. 11718). Springer, 153–167. https://doi.org/10.1007/978-3-030-28596-8_11 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Free Software Foundation. [n.d.]. GCC, the GNU Compiler Collection, Offload Support. https://gcc.gnu.org/wiki/OffloadingGoogle ScholarGoogle Scholar
  11. Rahulkumar Gayatri, Charlene Yang, Thorsten Kurth, and Jack Deslippe. 2018. A Case Study for Performance Portability Using OpenMP 4.5. In Accelerator Programming Using Directives - 5th International Workshop, WACCPD 2018, Dallas, TX, USA, November 11-17, 2018, Proceedings, Sunita Chandrasekaran, Guido Juckeland, and Sandra Wienke (Eds.) (Lecture Notes in Computer Science, Vol. 11381). Springer, 75–95. https://doi.org/10.1007/978-3-030-12274-4_4 Google ScholarGoogle ScholarCross RefCross Ref
  12. Guray Ozen. 2017. Compiler and Runtime Based Parallelization and Optimization for GPUs. Ph.D. Dissertation.Google ScholarGoogle Scholar
  13. Akihiro Hayashi, Jun Shirako, Ettore Tiotto, Robert Ho, and Vivek Sarkar. 2019. Performance evaluation of OpenMP’s target construct on GPUs - exploring compiler optimisations. Int. J. High Perform. Comput. Netw., 13, 1 (2019), 54–69. https://doi.org/10.1504/IJHPCN.2019.097051 Google ScholarGoogle ScholarCross RefCross Ref
  14. IBM. [n.d.]. XL Compiler for C, C++ and Fortran. https://www.ibm.com/products/xl-cpp-linux-compiler-powerGoogle ScholarGoogle Scholar
  15. Intel Corp.. 2021. Intel C++ Compiler Classic Developer Guide and Reference.Google ScholarGoogle Scholar
  16. Arpith Chacko Jacob, Alexandre E. Eichenberger, Hyojin Sung, Samuel F. Antão, Gheorghe-Teodor Bercea, Carlo Bertolli, Alexey Bataev, Tian Jin, Tong Chen, Zehra Sura, Georgios Rokos, and Kevin O’Brien. 2017. Efficient Fork-Join on GPUs Through Warp Specialization. In 24th IEEE International Conference on High Performance Computing, HiPC 2017, Jaipur, India, December 18-21, 2017. IEEE Computer Society, 358–367. https://doi.org/10.1109/HiPC.2017.00048 Google ScholarGoogle ScholarCross RefCross Ref
  17. Guido Juckeland, William C. Brantley, Sunita Chandrasekaran, Barbara M. Chapman, Shuai Che, Mathew E. Colgrove, Huiyu Feng, Alexander Grund, Robert Henschel, Wen-mei W. Hwu, Huian Li, Matthias S. Müller, Wolfgang E. Nagel, Maxim Perminov, Pavel Shelepugin, Kevin Skadron, John A. Stratton, Alexey Titov, Ke Wang, G. Matthijs van Waveren, Brian Whitney, Sandra Wienke, Rengan Xu, and Kalyan Kumaran. 2014. SPEC ACCEL: A Standard Application Suite for Measuring Hardware Accelerator Performance. In High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation - 5th International Workshop, PMBS 2014, New Orleans, LA, USA, November 16, 2014. Revised Selected Papers. 46–67. https://doi.org/10.1007/978-3-319-17248-4_3 Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Khronos OpenCL Working Group. 2020. The OpenCL Specification, version 3.0.Google ScholarGoogle Scholar
  19. Seyong Lee, Seung-Jai Min, and Rudolf Eigenmann. 2009. OpenMP to GPGPU: a Compiler Framework for Automatic Translation and Optimization. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2009, Raleigh, NC, USA, February 14-18, 2009. 101–110. https://doi.org/10.1145/1504176.1504194 Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Seyong Lee and Jeffrey S. Vetter. 2014. OpenARC: extensible OpenACC compiler framework for directive-based accelerator programming study. In Proceedings of the First Workshop on Accelerator Programming using Directives, WACCPD ’14, New Orleans, Louisiana, USA, November 16-21, 2014, Sunita Chandrasekaran, Fernanda S. Foertter, and Oscar R. Hernandez (Eds.). IEEE Computer Society, 1–11. https://doi.org/10.1109/WACCPD.2014.7 Google ScholarGoogle ScholarCross RefCross Ref
  21. Chunhua Liao, Yonghong Yan, Bronis R. de Supinski, Daniel J. Quinlan, and Barbara M. Chapman. 2013. Early Experiences with the OpenMP Accelerator Model. In OpenMP in the Era of Low Power Devices and Accelerators - 9th International Workshop on OpenMP, IWOMP 2013, Canberra, ACT, Australia, September 16-18, 2013. Proceedings, Alistair P. Rendell, Barbara M. Chapman, and Matthias S. Müller (Eds.) (Lecture Notes in Computer Science, Vol. 8122). Springer, 84–98. https://doi.org/10.1007/978-3-642-40698-0_7 Google ScholarGoogle ScholarCross RefCross Ref
  22. LLVM Team. [n.d.]. 2021. [Online]. The LLVM Compiler Infrastructure.. https://github.com/llvm/llvm-projectGoogle ScholarGoogle Scholar
  23. Matt Martineau and Simon McIntosh-Smith. 2017. The Productivity, Portability and Performance of OpenMP 4.5 for Scientific Applications Targeting Intel CPUs, IBM CPUs, and NVIDIA GPUs. In Scaling OpenMP for Exascale Performance and Portability - 13th International Workshop on OpenMP, IWOMP 2017, Stony Brook, NY, USA, September 20-22, 2017, Proceedings, Bronis R. de Supinski, Stephen L. Olivier, Christian Terboven, Barbara M. Chapman, and Matthias S. Müller (Eds.) (Lecture Notes in Computer Science, Vol. 10468). Springer, 185–200. https://doi.org/10.1007/978-3-319-65578-9_13 Google ScholarGoogle ScholarCross RefCross Ref
  24. NVIDIA Corp.. [n.d.]. CUDA Dynamic Parallelism Programming Guide, 2013..Google ScholarGoogle Scholar
  25. NVIDIA Corp.. 2021. CUDA C++ Programming Guide Version 11.2.Google ScholarGoogle Scholar
  26. OpenMP ARB. 2020. OpenMP Application Program Interface, v. 5.1. http://www.openmp.orgGoogle ScholarGoogle Scholar
  27. Guray Ozen, Simone Atzeni, Michael Wolfe, Annemarie Southwell, and Gary Klimowicz. 2018. OpenMP GPU Offload in Flang and LLVM. In 2018 IEEE/ACM 5th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC). 1–9. https://doi.org/10.1109/LLVM-HPC.2018.8639434 Google ScholarGoogle ScholarCross RefCross Ref
  28. Guray Ozen, Eduard Ayguadé, and Jesús Labarta. 2014. On the Roles of the Programmer, the Compiler and the Runtime System When Programming Accelerators in OpenMP. In Using and Improving OpenMP for Devices, Tasks, and More - 10th International Workshop on OpenMP, IWOMP 2014, Salvador, Brazil, September 28-30, 2014. Proceedings. 215–229. https://doi.org/10.1007/978-3-319-11454-5_16 Google ScholarGoogle ScholarCross RefCross Ref
  29. Guray Ozen, Eduard Ayguadé, and Jesús Labarta. 2015. Exploring Dynamic Parallelism in OpenMP. In Proceedings of the Second Workshop on Accelerator Programming using Directives, WACCPD 2015, Austin, Texas, USA, November 15, 2015. 5:1–5:8. https://doi.org/10.1145/2832105.2832113 Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Ettore Tiotto, Bardia Mahjour, Whitney Tsang, Xing Xue, Tarique Islam, and Wang Chen. 2020. OpenMP 4.5 compiler optimization for GPU offloading. IBM J. Res. Dev., 64, 3/4 (2020), 14:1–14:11. https://doi.org/10.1147/JRD.2019.2962428 Google ScholarGoogle ScholarCross RefCross Ref
  31. Jin Wang and Sudhakar Yalamanchili. 2014. Characterization and Analysis of Dynamic Parallelism in Unstructured GPU Applications. In 2014 IEEE International Symposium on Workload Characterization, IISWC 2014, Raleigh, NC, USA, October 26-28, 2014. 51–60. https://doi.org/10.1109/IISWC.2014.6983039 Google ScholarGoogle ScholarCross RefCross Ref
  32. Michael Wolfe. 2010. Implementing the PGI Accelerator model. In Proceedings of 3rd Workshop on General Purpose Processing on Graphics Processing Units, GPGPU 2010, Pittsburgh, Pennsylvania, USA, March 14, 2010. 43–50. https://doi.org/10.1145/1735688.1735697 Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. M. Wolfe and C.-W. Tseng. 1992. The power test for data dependence. IEEE Transactions on Parallel and Distributed Systems, 3, 5 (1992), 591–601. https://doi.org/10.1109/71.159042 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Performant portable OpenMP

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          CC 2022: Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction
          March 2022
          253 pages
          ISBN:9781450391832
          DOI:10.1145/3497776

          Copyright © 2022 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 18 March 2022

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader