ABSTRACT
Accelerated computing has increased the need to specialize how a program is parallelized depending on the target. Fully exploiting a highly parallel accelerator, such as a GPU, demands more parallelism and sometimes more levels of parallelism than a multicore CPU. OpenMP has a directive for each level of parallelism, but choosing directives for each target can incur a significant productivity cost. We argue that using the new OpenMP loop directive with an appropriate compiler decision process can achieve the same performance benefits of target-specific parallelization with the productivity advantage of a single directive for all targets. In this paper, we introduce a fully descriptive model and demonstrate its benefits with an implementation of the loop directive, comparing performance, productivity, and portability against other production compilers using the SPEC ACCEL benchmark suite. We provide an implementation of our proposal in NVIDIA's HPC compiler. It yields up to 56X speedup and an average of 1.91x-1.79x speedup compared to the baseline performance (depending on the host system) on GPUs, and preserves CPU performance. In addition, our proposal requires 60% fewer parallelism directives.
- Samuel F. Antão, Alexey Bataev, Arpith C. Jacob, Gheorghe-Teodor Bercea, Alexandre E. Eichenberger, Georgios Rokos, Matt Martineau, Tian Jin, Guray Ozen, Zehra Sura, Tong Chen, Hyojin Sung, Carlo Bertolli, and Kevin O’Brien. 2016. Offloading Support for OpenMP in Clang and LLVM. In Third Workshop on the LLVM Compiler Infrastructure in HPC, LLVM-HPC@SC 2016, Salt Lake City, UT, USA, November 14, 2016. 1–11. https://doi.org/10.1109/LLVM-HPC.2016.006 Google ScholarCross Ref
- Carlo Bertolli, Samuel Antão, Gheorghe-Teodor Bercea, Arpith C. Jacob, Alexandre E. Eichenberger, Tong Chen, Zehra Sura, Hyojin Sung, Georgios Rokos, David Appelhans, and Kevin O’Brien. 2015. Integrating GPU support for OpenMP offloading directives into Clang. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, LLVM 2015, Austin, Texas, USA, November 15, 2015. 5:1–5:11. https://doi.org/10.1145/2833157.2833161 Google ScholarDigital Library
- Carlo Bertolli, Samuel Antão, Alexandre E. Eichenberger, Kevin O’Brien, Zehra Sura, Arpith C. Jacob, Tong Chen, and Olivier Sallenave. 2014. Coordinating GPU threads for OpenMP 4.0 in LLVM. In Proceedings of the 2014 LLVM Compiler Infrastructure in HPC, LLVM 2014, New Orleans, LA, USA, November 17, 2014, Hal Finkel and Jeff R. Hammond (Eds.). IEEE Computer Society, 12–21. https://doi.org/10.1109/LLVM-HPC.2014.10 Google ScholarDigital Library
- David R. Butenhof. 1997. Programming with POSIX Threads. Addison-Wesley.Google ScholarDigital Library
- Christopher Daley, Hadia Ahmed, Samuel Williams, and Nicholas Wright. 2020. A Case Study of Porting HPGMG from CUDA to OpenMP Target Offload. In OpenMP: Portable Multi-Level Parallelism on Modern Systems, Kent Milfeld, Bronis R. de Supinski, Lars Koesterke, and Jannis Klinkenberg (Eds.). Springer International Publishing, Cham. 37–51. isbn:978-3-030-58144-2Google Scholar
- Christopher S. Daley, Annemarie Southwell, Rahulkumar Gayatri, Scott Biersdorfff, Craig Toepfer, Güray Özen, and Nicholas J. Wright. 2021. Non-recurring engineering (NRE) best practices: a case study with the NERSC/NVIDIA OpenMP contract. In SC ’21: The International Conference for High Performance Computing, Networking, Storage and Analysis, St. Louis, Missouri, USA, November 14 - 19, 2021, Bronis R. de Supinski, Mary W. Hall, and Todd Gamblin (Eds.). ACM, 31:1–31:14. https://doi.org/10.1145/3458817.3476213 Google ScholarDigital Library
- Joshua Hoke Davis, Christopher S. Daley, Swaroop Pophale, Thomas Huber, Sunita Chandrasekaran, and Nicholas J. Wright. 2020. Performance Assessment of OpenMP Compilers Targeting NVIDIA V100 GPUs. In Accelerator Programming Using Directives - 7th International Workshop, WACCPD 2020, Virtual Event, November 20, 2020, Proceedings, Sridutt Bhalachandra, Sandra Wienke, Sunita Chandrasekaran, and Guido Juckeland (Eds.) (Lecture Notes in Computer Science, Vol. 12655). Springer, 25–44. https://doi.org/10.1007/978-3-030-74224-9_2 Google ScholarDigital Library
- Bronis R. de Supinski, Thomas R. W. Scogland, Alejandro Duran, Michael Klemm, Sergi Mateo Bellido, Stephen L. Olivier, Christian Terboven, and Timothy G. Mattson. 2018. The Ongoing Evolution of OpenMP. Proc. IEEE, 106, 11 (2018), 2004–2019. https://doi.org/10.1109/JPROC.2018.2853600 Google ScholarCross Ref
- Johannes Doerfert, Jose Manuel Monsalve Diaz, and Hal Finkel. 2019. The TRegion Interface and Compiler Optimizations for OpenMP Target Regions. In OpenMP: Conquering the Full Hardware Spectrum - 15th International Workshop on OpenMP, IWOMP 2019, Auckland, New Zealand, September 11-13, 2019, Proceedings, Xing Fan, Bronis R. de Supinski, Oliver Sinnen, and Nasser Giacaman (Eds.) (Lecture Notes in Computer Science, Vol. 11718). Springer, 153–167. https://doi.org/10.1007/978-3-030-28596-8_11 Google ScholarDigital Library
- Free Software Foundation. [n.d.]. GCC, the GNU Compiler Collection, Offload Support. https://gcc.gnu.org/wiki/OffloadingGoogle Scholar
- Rahulkumar Gayatri, Charlene Yang, Thorsten Kurth, and Jack Deslippe. 2018. A Case Study for Performance Portability Using OpenMP 4.5. In Accelerator Programming Using Directives - 5th International Workshop, WACCPD 2018, Dallas, TX, USA, November 11-17, 2018, Proceedings, Sunita Chandrasekaran, Guido Juckeland, and Sandra Wienke (Eds.) (Lecture Notes in Computer Science, Vol. 11381). Springer, 75–95. https://doi.org/10.1007/978-3-030-12274-4_4 Google ScholarCross Ref
- Guray Ozen. 2017. Compiler and Runtime Based Parallelization and Optimization for GPUs. Ph.D. Dissertation.Google Scholar
- Akihiro Hayashi, Jun Shirako, Ettore Tiotto, Robert Ho, and Vivek Sarkar. 2019. Performance evaluation of OpenMP’s target construct on GPUs - exploring compiler optimisations. Int. J. High Perform. Comput. Netw., 13, 1 (2019), 54–69. https://doi.org/10.1504/IJHPCN.2019.097051 Google ScholarCross Ref
- IBM. [n.d.]. XL Compiler for C, C++ and Fortran. https://www.ibm.com/products/xl-cpp-linux-compiler-powerGoogle Scholar
- Intel Corp.. 2021. Intel C++ Compiler Classic Developer Guide and Reference.Google Scholar
- Arpith Chacko Jacob, Alexandre E. Eichenberger, Hyojin Sung, Samuel F. Antão, Gheorghe-Teodor Bercea, Carlo Bertolli, Alexey Bataev, Tian Jin, Tong Chen, Zehra Sura, Georgios Rokos, and Kevin O’Brien. 2017. Efficient Fork-Join on GPUs Through Warp Specialization. In 24th IEEE International Conference on High Performance Computing, HiPC 2017, Jaipur, India, December 18-21, 2017. IEEE Computer Society, 358–367. https://doi.org/10.1109/HiPC.2017.00048 Google ScholarCross Ref
- Guido Juckeland, William C. Brantley, Sunita Chandrasekaran, Barbara M. Chapman, Shuai Che, Mathew E. Colgrove, Huiyu Feng, Alexander Grund, Robert Henschel, Wen-mei W. Hwu, Huian Li, Matthias S. Müller, Wolfgang E. Nagel, Maxim Perminov, Pavel Shelepugin, Kevin Skadron, John A. Stratton, Alexey Titov, Ke Wang, G. Matthijs van Waveren, Brian Whitney, Sandra Wienke, Rengan Xu, and Kalyan Kumaran. 2014. SPEC ACCEL: A Standard Application Suite for Measuring Hardware Accelerator Performance. In High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation - 5th International Workshop, PMBS 2014, New Orleans, LA, USA, November 16, 2014. Revised Selected Papers. 46–67. https://doi.org/10.1007/978-3-319-17248-4_3 Google ScholarDigital Library
- Khronos OpenCL Working Group. 2020. The OpenCL Specification, version 3.0.Google Scholar
- Seyong Lee, Seung-Jai Min, and Rudolf Eigenmann. 2009. OpenMP to GPGPU: a Compiler Framework for Automatic Translation and Optimization. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2009, Raleigh, NC, USA, February 14-18, 2009. 101–110. https://doi.org/10.1145/1504176.1504194 Google ScholarDigital Library
- Seyong Lee and Jeffrey S. Vetter. 2014. OpenARC: extensible OpenACC compiler framework for directive-based accelerator programming study. In Proceedings of the First Workshop on Accelerator Programming using Directives, WACCPD ’14, New Orleans, Louisiana, USA, November 16-21, 2014, Sunita Chandrasekaran, Fernanda S. Foertter, and Oscar R. Hernandez (Eds.). IEEE Computer Society, 1–11. https://doi.org/10.1109/WACCPD.2014.7 Google ScholarCross Ref
- Chunhua Liao, Yonghong Yan, Bronis R. de Supinski, Daniel J. Quinlan, and Barbara M. Chapman. 2013. Early Experiences with the OpenMP Accelerator Model. In OpenMP in the Era of Low Power Devices and Accelerators - 9th International Workshop on OpenMP, IWOMP 2013, Canberra, ACT, Australia, September 16-18, 2013. Proceedings, Alistair P. Rendell, Barbara M. Chapman, and Matthias S. Müller (Eds.) (Lecture Notes in Computer Science, Vol. 8122). Springer, 84–98. https://doi.org/10.1007/978-3-642-40698-0_7 Google ScholarCross Ref
- LLVM Team. [n.d.]. 2021. [Online]. The LLVM Compiler Infrastructure.. https://github.com/llvm/llvm-projectGoogle Scholar
- Matt Martineau and Simon McIntosh-Smith. 2017. The Productivity, Portability and Performance of OpenMP 4.5 for Scientific Applications Targeting Intel CPUs, IBM CPUs, and NVIDIA GPUs. In Scaling OpenMP for Exascale Performance and Portability - 13th International Workshop on OpenMP, IWOMP 2017, Stony Brook, NY, USA, September 20-22, 2017, Proceedings, Bronis R. de Supinski, Stephen L. Olivier, Christian Terboven, Barbara M. Chapman, and Matthias S. Müller (Eds.) (Lecture Notes in Computer Science, Vol. 10468). Springer, 185–200. https://doi.org/10.1007/978-3-319-65578-9_13 Google ScholarCross Ref
- NVIDIA Corp.. [n.d.]. CUDA Dynamic Parallelism Programming Guide, 2013..Google Scholar
- NVIDIA Corp.. 2021. CUDA C++ Programming Guide Version 11.2.Google Scholar
- OpenMP ARB. 2020. OpenMP Application Program Interface, v. 5.1. http://www.openmp.orgGoogle Scholar
- Guray Ozen, Simone Atzeni, Michael Wolfe, Annemarie Southwell, and Gary Klimowicz. 2018. OpenMP GPU Offload in Flang and LLVM. In 2018 IEEE/ACM 5th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC). 1–9. https://doi.org/10.1109/LLVM-HPC.2018.8639434 Google ScholarCross Ref
- Guray Ozen, Eduard Ayguadé, and Jesús Labarta. 2014. On the Roles of the Programmer, the Compiler and the Runtime System When Programming Accelerators in OpenMP. In Using and Improving OpenMP for Devices, Tasks, and More - 10th International Workshop on OpenMP, IWOMP 2014, Salvador, Brazil, September 28-30, 2014. Proceedings. 215–229. https://doi.org/10.1007/978-3-319-11454-5_16 Google ScholarCross Ref
- Guray Ozen, Eduard Ayguadé, and Jesús Labarta. 2015. Exploring Dynamic Parallelism in OpenMP. In Proceedings of the Second Workshop on Accelerator Programming using Directives, WACCPD 2015, Austin, Texas, USA, November 15, 2015. 5:1–5:8. https://doi.org/10.1145/2832105.2832113 Google ScholarDigital Library
- Ettore Tiotto, Bardia Mahjour, Whitney Tsang, Xing Xue, Tarique Islam, and Wang Chen. 2020. OpenMP 4.5 compiler optimization for GPU offloading. IBM J. Res. Dev., 64, 3/4 (2020), 14:1–14:11. https://doi.org/10.1147/JRD.2019.2962428 Google ScholarCross Ref
- Jin Wang and Sudhakar Yalamanchili. 2014. Characterization and Analysis of Dynamic Parallelism in Unstructured GPU Applications. In 2014 IEEE International Symposium on Workload Characterization, IISWC 2014, Raleigh, NC, USA, October 26-28, 2014. 51–60. https://doi.org/10.1109/IISWC.2014.6983039 Google ScholarCross Ref
- Michael Wolfe. 2010. Implementing the PGI Accelerator model. In Proceedings of 3rd Workshop on General Purpose Processing on Graphics Processing Units, GPGPU 2010, Pittsburgh, Pennsylvania, USA, March 14, 2010. 43–50. https://doi.org/10.1145/1735688.1735697 Google ScholarDigital Library
- M. Wolfe and C.-W. Tseng. 1992. The power test for data dependence. IEEE Transactions on Parallel and Distributed Systems, 3, 5 (1992), 591–601. https://doi.org/10.1109/71.159042 Google ScholarDigital Library
Index Terms
- Performant portable OpenMP
Recommendations
Assessing One-to-One Parallelism Levels Mapping for OpenMP Offloading to GPUs
PMAM'17: Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and ManycoresThe proliferation of accelerators in modern clusters makes efficient coprocessor programming a key requirement if application codes are to achieve high levels of performance with acceptable energy consumption on such platforms. This has led to ...
Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: Programming Productivity, Performance, and Energy Consumption
ARMS-CC '17: Proceedings of the 2017 Workshop on Adaptive Resource Management and Scheduling for Cloud ComputingMany modern parallel computing systems are heterogeneous at their node level. Such nodes may comprise general purpose CPUs and accelerators (such as, GPU, or Intel Xeon Phi) that provide high performance with suitable energy-consumption characteristics. ...
Exploring Programming Multi-GPUs Using OpenMP and OpenACC-Based Hybrid Model
IPDPSW '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD ForumHeterogeneous computing come with tremendous potential and is a leading candidate for scientific applications that are becoming more and more complex. Accelerators such as GPUs whose computing momentum is growing faster than ever offer application ...
Comments