skip to main content
10.1145/3582514.3582519acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

MPI-based Remote OpenMP Offloading: A More Efficient and Easy-to-use Implementation

Published: 25 February 2023 Publication History

Abstract

MPI+X is the most popular hybrid programming model for distributed computation on modern heterogeneous HPC systems. Nonetheless, for simplicity, HPC developers ideally would like to implement multi-node distributed parallel computing through a single coherent programming model. As de facto standard for parallel programming, OpenMP has been one of the most influential programming models in parallel computing. Recent work has proven that the OpenMP target offloading model could be used to program distributed accelerator-based HPC systems with marginal changes to the application. However, the UCX-based version of remote OpenMP offloading still has many limitations in terms of performance overhead and ease of use of the plugin.
In this work, we have implemented a new MPI-based remote OpenMP offloading plugin. By comparing it with the UCX-based version, the new MPI-based plugin has been significantly improved in terms of performance, scalability, and ease of use. Evaluation of our work is conducted using one proxy-app, XSBench and an industrial-level seismic modeling code, Minimod. Results show that, compared to the optimized UCX-based version, our optimizations can reduce offloading latency by up to 70%, and raise application parallel efficiency by 68% when running with 16 GPUs on data-bound applications. In particular, the introduction of the concept of locality-aware offloading gives developers of HPC programs greater possibilities to take full advantage of modern hierarchical heterogeneous computing devices.

References

[1]
B. Acun, A. Gupta, N. Jain, A. Langer, H. Menon, E. Mikida, X. Ni, M. Robson, Y. Sun, E. Totoni, L. Wesolowski, and L. Kale. 2014. Parallel Programming with Migratable Objects: Charm++ in Practice. In SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 647--658.
[2]
John Bachan, Scott B. Baden, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Dan Bonachea, Paul H. Hargrove, and Hadia Ahmed. 2019. UPC++: A High-Performance Communication Framework for Asynchronous Computation. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 963--973.
[3]
M. Bauer, S. Treichler, E. Slaughter, and A. Aiken. 2012. Legion: Expressing locality and independence with logical regions. In SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 1--11.
[4]
B.L. Chamberlain, D. Callahan, and H.P. Zima. 2007. Parallel Programmability and the Chapel Language. The International Journal of High Performance Computing Applications 21, 3 (2007), 291--312. arXiv:https://doi.org/10.1177/1094342007078442
[5]
gRPC community. [n.d.]. gRPC. https://grpc.io/about/.
[6]
Chung-Hsing Hsu, Neena Imam, Akhil Langer, Sreeram Potluri, and Chris J. Newburn. 2020. An Initial Assessment of NVSHMEM for High Performance Computing. In 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 1--10.
[7]
Arpith C. Jacob, Ravi Nair, Alexandre E. Eichenberger, Samuel F. Antao, Carlo Bertolli, Tong Chen, Zehra Sura, Kevin O'Brien, and Michael Wong. 2015. Exploiting Fine- and Coarse-Grained Parallelism Using a Directive Based Approach. In OpenMP: Heterogenous Execution and Data Movements, Christian Terboven, Bronis R. de Supinski, Pablo Reble, Barbara M. Chapman, and Matthias S. Müller (Eds.). Springer International Publishing, Cham, 30--41.
[8]
Kokkos. [n.d.]. Kokkos Remote Spaces. https://github.com/kokkos/kokkos-remote-spaces
[9]
Wenbin Lu, Tony Curtis, and Barbara Chapman. 2019. Enabling Low-Overhead Communication in Multi-threaded OpenSHMEM Applications using Contexts. In 2019 IEEE/ACM Parallel Applications Workshop, Alternatives To MPI (PAW-ATM). 47--57.
[10]
Wenbin Lu, Baodi Shan, Eric Raut, Jie Meng, Mauricio Araya-Polo, Johannes Doerfert, Abid M. Malik, and Barbara Chapman. 2022. Towards Efficient Remote OpenMP Offloading. In OpenMP in a Modern World: From Multi-device Support to Meta Programming, Michael Klemm, Bronis R. de Supinski, Jannis Klinkenberg, and Brandon Neth (Eds.). Springer International Publishing, Cham, 17--31.
[11]
Jie Meng, Andreas Atle, Henri Calandra, and Mauricio Araya-Polo. 2020. Minimod: A Finite Difference solver for Seismic Modeling. arXiv (2020). arXiv:2007.06048 [cs.DC] https://arxiv.org/abs/2007.06048
[12]
NVIDIA. [n.d.]. NVIDIA CUDA GPUDirect RDMA. https://docs.nvidia.com/cuda/gpudirect-rdma/index.html.
[13]
NVIDIA. [n.d.]. NVIDIA Nsight Systems. https://developer.nvidia.com/nsight-systems.
[14]
OpenMP Architecture Review Board. 2018. OpenMP Application Programming Interface. https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0.pdf Version 5.0.
[15]
Atmn Patel and Johannes Doerfert. 2022. Remote OpenMP Offloading. In High Performance Computing, Ana-Lucia Varbanescu, Abhinav Bhatele, Piotr Luszczek, and Baboulin Marc (Eds.). Springer International Publishing, Cham, 315--333.
[16]
Eric Raut, Jonathon Anderson, Mauricio Araya-Polo, and Jie Meng. 2021. Evaluation of Distributed Tasks in Stencil-based Application on GPUs. In 2021 IEEE/ACM 6th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2). 45--52.
[17]
Eric Raut, Jonathon Anderson, Mauricio Araya-Polo, and Jie Meng. 2021. Evaluation of Distributed Tasks in Stencil-based Application on GPUs. In 2021 IEEE/ACM 6th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2). 45--52.
[18]
Eric Raut, Jonathon Anderson, Mauricio Araya-Polo, and Jie Meng. 2021. Porting and Evaluation of a Distributed Task-Driven Stencil-Based Application. In Proceedings of the 12th International Workshop on Programming Models and Applications for Multicores and Manycores (Virtual Event, Republic of Korea) (PMAM'21). Association for Computing Machinery, New York, NY, USA, 21--30.
[19]
Eric Raut, Jie Meng, Mauricio Araya-Polo, and Barbara Chapman. 2020. Evaluating Performance of OpenMP Tasks in a Seismic Stencil Application. In OpenMP: Portable Multi-Level Parallelism on Modern Systems, Kent Milfeld, Bronis R. de Supinski, Lars Koesterke, and Jannis Klinkenberg (Eds.). Springer International Publishing, Cham, 67--81.
[20]
Carlos Reaño, Federico Silla, Gilad Shainer, and Scot Schultz. 2015. Local and Remote GPUs Perform Similar with EDR 100G InfiniBand. In Proceedings of the Industrial Track of the 16th International Middleware Conference (Vancouver, BC, Canada) (Middleware Industry '15). Association for Computing Machinery, New York, NY, USA, Article 4, 7 pages.
[21]
Paul K. Romano and Benoit Forget. 2013. The OpenMC Monte Carlo particle transport code. Annals of Nuclear Energy 51 (2013), 274--281.
[22]
Ryuichi Sai, John Mellor-Crummey, Xiaozhu Meng, Mauricio Araya-Polo, and Jie Meng. 2020. Accelerating High-Order Stencils on GPUs. arXiv:2009.04619 [cs.DC]
[23]
Pavel Shamis, Manjunath Gorentla Venkata, M. Graham Lopez, Matthew B. Baker, Oscar Hernandez, Yossi Itigin, Mike Dubman, Gilad Shainer, Richard L. Graham, Liran Liss, Yiftah Shahar, Sreeram Potluri, Davide Rossetti, Donald Becker, Duncan Poole, Christopher Lamb, Sameer Kumar, Craig Stunkel, George Bosilca, and Aurelien Bouteiller. 2015. UCX: An Open Source Framework for HPC Network APIs and Beyond. In 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects. 40--43.
[24]
Christian Terboven, Dieter An Mey, Dirk Schmidl, and Marcus Wagner. 2008. First Experiences with Intel Cluster OpenMP. In Proceedings of the 4th International Conference on OpenMP in a New Era of Parallelism (West Lafayette, IN, USA) (IWOMP'08). Springer-Verlag, Berlin, Heidelberg, 48--59.
[25]
Shilei Tian, Johannes Doerfert, and Barbara Chapman. 2022. Concurrent Execution of Deferred OpenMP Target Tasks with Hidden Helper Threads. In Languages and Compilers for Parallel Computing, Barbara Chapman and José Moreira (Eds.). Springer International Publishing, Cham, 41--56.
[26]
John R Tramm, Andrew R Siegel, Tanzima Islam, and Martin Schulz. 2014. XS-Bench - The Development and Verification of a Performance Abstraction for Monte Carlo Reactor Analysis. In PHYSOR 2014 - The Role of Reactor Physics toward a Sustainable Future. Kyoto. https://www.mcs.anl.gov/papers/P5064-0114.pdf
[27]
Christian R. Trott, Damien Lebrun-Grandié, Daniel Arndt, Jan Ciesko, Vinh Dang, Nathan Ellingwood, Rahulkumar Gayatri, Evan Harvey, Daisy S. Hollman, Dan Ibanez, Nevin Liber, Jonathan Madsen, Jeff Miles, David Poliakoff, Amy Powell, Sivasankaran Rajamanickam, Mikael Simberg, Dan Sunderland, Bruno Turcksin, and Jeremiah Wilke. 2022. Kokkos 3: Programming Model Extensions for the Exascale Era. IEEE Transactions on Parallel and Distributed Systems 33, 4 (2022), 805--817.
[28]
Hervé Yviquel, Lauro Cruz, and Guido Araujo. 2018. Cluster Programming Using the OpenMP Accelerator Model. ACM Trans. Archit. Code Optim. 15, 3, Article 35 (aug 2018), 23 pages.
[29]
Hervé Yviquel, Marcio Pereira, Emílio Francesquini, Guilherme Valarini, Pedro Rosso Gustavo Leite, Rodrigo Ceccato, Carla Cusihualpa, Vitoria Dias, Sandro Rigo, Alan Souza, and Guido Araujo. 2022. The OpenMP Cluster Programming Model. 51st International Conference on Parallel Processing Workshop Proceedings (ICPP Workshops 22) (2022).

Cited By

View all
  • (2025)Temporal-Logic-Based Testing Tool for Programs Using the Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) Programming ModelsIEEE Access10.1109/ACCESS.2025.352557813(4171-4187)Online publication date: 2025
  • (2024)Towards an Optimized Heterogeneous Distributed Task Scheduler in OpenMP ClusterProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00239(1894-1903)Online publication date: 17-Nov-2024
  • (2024)Evaluation of Programming Models and Performance for Stencil Computation on GPGPUs2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00198(1178-1180)Online publication date: 27-May-2024
  • Show More Cited By

Index Terms

  1. MPI-based Remote OpenMP Offloading: A More Efficient and Easy-to-use Implementation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    PMAM'23: Proceedings of the 14th International Workshop on Programming Models and Applications for Multicores and Manycores
    February 2023
    73 pages
    ISBN:9798400701153
    DOI:10.1145/3582514
    Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 25 February 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. OpenMP
    2. GPGPU
    3. distributed computing

    Qualifiers

    • Research-article

    Conference

    PMAM'23

    Acceptance Rates

    Overall Acceptance Rate 53 of 97 submissions, 55%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)70
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 03 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Temporal-Logic-Based Testing Tool for Programs Using the Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) Programming ModelsIEEE Access10.1109/ACCESS.2025.352557813(4171-4187)Online publication date: 2025
    • (2024)Towards an Optimized Heterogeneous Distributed Task Scheduler in OpenMP ClusterProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00239(1894-1903)Online publication date: 17-Nov-2024
    • (2024)Evaluation of Programming Models and Performance for Stencil Computation on GPGPUs2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00198(1178-1180)Online publication date: 27-May-2024
    • (2024)Evaluation of Directive-Based Programming Models for Stencil Computation on Current GPGPU ArchitecturesAdvancing OpenMP for Future Accelerators10.1007/978-3-031-72567-8_9(126-140)Online publication date: 23-Sep-2024
    • (2024)Towards a Scalable and Efficient PGAS-Based Distributed OpenMPAdvancing OpenMP for Future Accelerators10.1007/978-3-031-72567-8_5(64-78)Online publication date: 23-Sep-2024
    • (2023)OpenMP Kernel Language Extensions for Performance Portable GPU CodesProceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624164(876-883)Online publication date: 12-Nov-2023

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media