research-article

MPI-based Remote OpenMP Offloading: A More Efficient and Easy-to-use Implementation

Authors:

Mauricio Araya-Polo,

Barbara ChapmanAuthors Info & Claims

PMAM'23: Proceedings of the 14th International Workshop on Programming Models and Applications for Multicores and Manycores

Pages 50 - 59

https://doi.org/10.1145/3582514.3582519

Published: 25 February 2023 Publication History

Abstract

MPI+X is the most popular hybrid programming model for distributed computation on modern heterogeneous HPC systems. Nonetheless, for simplicity, HPC developers ideally would like to implement multi-node distributed parallel computing through a single coherent programming model. As de facto standard for parallel programming, OpenMP has been one of the most influential programming models in parallel computing. Recent work has proven that the OpenMP target offloading model could be used to program distributed accelerator-based HPC systems with marginal changes to the application. However, the UCX-based version of remote OpenMP offloading still has many limitations in terms of performance overhead and ease of use of the plugin.

In this work, we have implemented a new MPI-based remote OpenMP offloading plugin. By comparing it with the UCX-based version, the new MPI-based plugin has been significantly improved in terms of performance, scalability, and ease of use. Evaluation of our work is conducted using one proxy-app, XSBench and an industrial-level seismic modeling code, Minimod. Results show that, compared to the optimized UCX-based version, our optimizations can reduce offloading latency by up to 70%, and raise application parallel efficiency by 68% when running with 16 GPUs on data-bound applications. In particular, the introduction of the concept of locality-aware offloading gives developers of HPC programs greater possibilities to take full advantage of modern hierarchical heterogeneous computing devices.

References

[1]

B. Acun, A. Gupta, N. Jain, A. Langer, H. Menon, E. Mikida, X. Ni, M. Robson, Y. Sun, E. Totoni, L. Wesolowski, and L. Kale. 2014. Parallel Programming with Migratable Objects: Charm++ in Practice. In SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 647--658.

Digital Library

[2]

John Bachan, Scott B. Baden, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Dan Bonachea, Paul H. Hargrove, and Hadia Ahmed. 2019. UPC++: A High-Performance Communication Framework for Asynchronous Computation. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 963--973.

[3]

M. Bauer, S. Treichler, E. Slaughter, and A. Aiken. 2012. Legion: Expressing locality and independence with logical regions. In SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 1--11.

Digital Library

[4]

B.L. Chamberlain, D. Callahan, and H.P. Zima. 2007. Parallel Programmability and the Chapel Language. The International Journal of High Performance Computing Applications 21, 3 (2007), 291--312. arXiv:https://doi.org/10.1177/1094342007078442

Digital Library

[5]

gRPC community. [n.d.]. gRPC. https://grpc.io/about/.

[6]

Chung-Hsing Hsu, Neena Imam, Akhil Langer, Sreeram Potluri, and Chris J. Newburn. 2020. An Initial Assessment of NVSHMEM for High Performance Computing. In 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 1--10.

[7]

Arpith C. Jacob, Ravi Nair, Alexandre E. Eichenberger, Samuel F. Antao, Carlo Bertolli, Tong Chen, Zehra Sura, Kevin O'Brien, and Michael Wong. 2015. Exploiting Fine- and Coarse-Grained Parallelism Using a Directive Based Approach. In OpenMP: Heterogenous Execution and Data Movements, Christian Terboven, Bronis R. de Supinski, Pablo Reble, Barbara M. Chapman, and Matthias S. Müller (Eds.). Springer International Publishing, Cham, 30--41.

[8]

Kokkos. [n.d.]. Kokkos Remote Spaces. https://github.com/kokkos/kokkos-remote-spaces

[9]

Wenbin Lu, Tony Curtis, and Barbara Chapman. 2019. Enabling Low-Overhead Communication in Multi-threaded OpenSHMEM Applications using Contexts. In 2019 IEEE/ACM Parallel Applications Workshop, Alternatives To MPI (PAW-ATM). 47--57.

[10]

Wenbin Lu, Baodi Shan, Eric Raut, Jie Meng, Mauricio Araya-Polo, Johannes Doerfert, Abid M. Malik, and Barbara Chapman. 2022. Towards Efficient Remote OpenMP Offloading. In OpenMP in a Modern World: From Multi-device Support to Meta Programming, Michael Klemm, Bronis R. de Supinski, Jannis Klinkenberg, and Brandon Neth (Eds.). Springer International Publishing, Cham, 17--31.

[11]

Jie Meng, Andreas Atle, Henri Calandra, and Mauricio Araya-Polo. 2020. Minimod: A Finite Difference solver for Seismic Modeling. arXiv (2020). arXiv:2007.06048 [cs.DC] https://arxiv.org/abs/2007.06048

[12]

NVIDIA. [n.d.]. NVIDIA CUDA GPUDirect RDMA. https://docs.nvidia.com/cuda/gpudirect-rdma/index.html.

[13]

NVIDIA. [n.d.]. NVIDIA Nsight Systems. https://developer.nvidia.com/nsight-systems.

[14]

OpenMP Architecture Review Board. 2018. OpenMP Application Programming Interface. https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0.pdf Version 5.0.

[15]

Atmn Patel and Johannes Doerfert. 2022. Remote OpenMP Offloading. In High Performance Computing, Ana-Lucia Varbanescu, Abhinav Bhatele, Piotr Luszczek, and Baboulin Marc (Eds.). Springer International Publishing, Cham, 315--333.

Digital Library

[16]

Eric Raut, Jonathon Anderson, Mauricio Araya-Polo, and Jie Meng. 2021. Evaluation of Distributed Tasks in Stencil-based Application on GPUs. In 2021 IEEE/ACM 6th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2). 45--52.

[17]

Eric Raut, Jonathon Anderson, Mauricio Araya-Polo, and Jie Meng. 2021. Evaluation of Distributed Tasks in Stencil-based Application on GPUs. In 2021 IEEE/ACM 6th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2). 45--52.

[18]

Eric Raut, Jonathon Anderson, Mauricio Araya-Polo, and Jie Meng. 2021. Porting and Evaluation of a Distributed Task-Driven Stencil-Based Application. In Proceedings of the 12th International Workshop on Programming Models and Applications for Multicores and Manycores (Virtual Event, Republic of Korea) (PMAM'21). Association for Computing Machinery, New York, NY, USA, 21--30.

Digital Library

[19]

Eric Raut, Jie Meng, Mauricio Araya-Polo, and Barbara Chapman. 2020. Evaluating Performance of OpenMP Tasks in a Seismic Stencil Application. In OpenMP: Portable Multi-Level Parallelism on Modern Systems, Kent Milfeld, Bronis R. de Supinski, Lars Koesterke, and Jannis Klinkenberg (Eds.). Springer International Publishing, Cham, 67--81.

Digital Library

[20]

Carlos Reaño, Federico Silla, Gilad Shainer, and Scot Schultz. 2015. Local and Remote GPUs Perform Similar with EDR 100G InfiniBand. In Proceedings of the Industrial Track of the 16th International Middleware Conference (Vancouver, BC, Canada) (Middleware Industry '15). Association for Computing Machinery, New York, NY, USA, Article 4, 7 pages.

Digital Library

[21]

Paul K. Romano and Benoit Forget. 2013. The OpenMC Monte Carlo particle transport code. Annals of Nuclear Energy 51 (2013), 274--281.

[22]

Ryuichi Sai, John Mellor-Crummey, Xiaozhu Meng, Mauricio Araya-Polo, and Jie Meng. 2020. Accelerating High-Order Stencils on GPUs. arXiv:2009.04619 [cs.DC]

[23]

Pavel Shamis, Manjunath Gorentla Venkata, M. Graham Lopez, Matthew B. Baker, Oscar Hernandez, Yossi Itigin, Mike Dubman, Gilad Shainer, Richard L. Graham, Liran Liss, Yiftah Shahar, Sreeram Potluri, Davide Rossetti, Donald Becker, Duncan Poole, Christopher Lamb, Sameer Kumar, Craig Stunkel, George Bosilca, and Aurelien Bouteiller. 2015. UCX: An Open Source Framework for HPC Network APIs and Beyond. In 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects. 40--43.

Digital Library

[24]

Christian Terboven, Dieter An Mey, Dirk Schmidl, and Marcus Wagner. 2008. First Experiences with Intel Cluster OpenMP. In Proceedings of the 4th International Conference on OpenMP in a New Era of Parallelism (West Lafayette, IN, USA) (IWOMP'08). Springer-Verlag, Berlin, Heidelberg, 48--59.

Digital Library

[25]

Shilei Tian, Johannes Doerfert, and Barbara Chapman. 2022. Concurrent Execution of Deferred OpenMP Target Tasks with Hidden Helper Threads. In Languages and Compilers for Parallel Computing, Barbara Chapman and José Moreira (Eds.). Springer International Publishing, Cham, 41--56.

[26]

John R Tramm, Andrew R Siegel, Tanzima Islam, and Martin Schulz. 2014. XS-Bench - The Development and Verification of a Performance Abstraction for Monte Carlo Reactor Analysis. In PHYSOR 2014 - The Role of Reactor Physics toward a Sustainable Future. Kyoto. https://www.mcs.anl.gov/papers/P5064-0114.pdf

[27]

Christian R. Trott, Damien Lebrun-Grandié, Daniel Arndt, Jan Ciesko, Vinh Dang, Nathan Ellingwood, Rahulkumar Gayatri, Evan Harvey, Daisy S. Hollman, Dan Ibanez, Nevin Liber, Jonathan Madsen, Jeff Miles, David Poliakoff, Amy Powell, Sivasankaran Rajamanickam, Mikael Simberg, Dan Sunderland, Bruno Turcksin, and Jeremiah Wilke. 2022. Kokkos 3: Programming Model Extensions for the Exascale Era. IEEE Transactions on Parallel and Distributed Systems 33, 4 (2022), 805--817.

[28]

Hervé Yviquel, Lauro Cruz, and Guido Araujo. 2018. Cluster Programming Using the OpenMP Accelerator Model. ACM Trans. Archit. Code Optim. 15, 3, Article 35 (aug 2018), 23 pages.

Digital Library

[29]

Hervé Yviquel, Marcio Pereira, Emílio Francesquini, Guilherme Valarini, Pedro Rosso Gustavo Leite, Rodrigo Ceccato, Carla Cusihualpa, Vitoria Dias, Sandro Rigo, Alan Souza, and Guido Araujo. 2022. The OpenMP Cluster Programming Model. 51st International Conference on Parallel Processing Workshop Proceedings (ICPP Workshops 22) (2022).

Cited By

Saad SFadel EAlzamzami OEassa FAlghamdi A(2025)Temporal-Logic-Based Testing Tool for Programs Using the Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) Programming ModelsIEEE Access10.1109/ACCESS.2025.352557813(4171-4187)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2025.3525578
Neveu RCeccato RLeite GAraujo GDiaz JYviquel H(2024)Towards an Optimized Heterogeneous Distributed Task Scheduler in OpenMP ClusterProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00239(1894-1903)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SCW63240.2024.00239
Shan BAraya-Polo M(2024)Evaluation of Programming Models and Performance for Stencil Computation on GPGPUs2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00198(1178-1180)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPSW63119.2024.00198
Show More Cited By

Index Terms

MPI-based Remote OpenMP Offloading: A More Efficient and Easy-to-use Implementation
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages

Recommendations

Remote OpenMP offloading
PPoPP '22: Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

OpenMP has a long and successful history in parallel programming for CPUs, and more recently GPUs through accelerator offloading.

In this work we show that the OpenMP accelerator offloading model is sufficient to seamlessly and efficiently utilize more ...
Remote OpenMP Offloading
High Performance Computing
Abstract
OpenMP has a long and successful history in parallel programming for CPUs. Since the introduction of accelerator offloading, it has evolved into a promising candidate for all intra-node parallel computing needs. While this addition broke with the ...
Implementation of OpenMP Data-Sharing on CAPE
SoICT '18: Proceedings of the 9th International Symposium on Information and Communication Technology

CAPE (Checkpointing-Aided Parallel Execution) is a framework that automatically translates and executes OpenMP on distributed-memory architectures based on checkpoint technique. In some experiments, this approach shows high-performance on distributed-...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PMAM'23: Proceedings of the 14th International Workshop on Programming Models and Applications for Multicores and Manycores

February 2023

73 pages

ISBN:9798400701153

DOI:10.1145/3582514

Program Co-chairs:
Quan Chen,
Zhiyi Huang,
Min Si

Copyright © 2023 ACM.

Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 February 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PMAM'23

Sponsor:

PMAM'23: 14th International Workshop on Programming Models and Applications for Multicores and Manycores

February 25 - March 1, 2023

QC, Montreal, Canada

Acceptance Rates

Overall Acceptance Rate 53 of 97 submissions, 55%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
208
Total Downloads

Downloads (Last 12 months)70
Downloads (Last 6 weeks)3

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Saad SFadel EAlzamzami OEassa FAlghamdi A(2025)Temporal-Logic-Based Testing Tool for Programs Using the Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) Programming ModelsIEEE Access10.1109/ACCESS.2025.352557813(4171-4187)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2025.3525578
Neveu RCeccato RLeite GAraujo GDiaz JYviquel H(2024)Towards an Optimized Heterogeneous Distributed Task Scheduler in OpenMP ClusterProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00239(1894-1903)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SCW63240.2024.00239
Shan BAraya-Polo M(2024)Evaluation of Programming Models and Performance for Stencil Computation on GPGPUs2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00198(1178-1180)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPSW63119.2024.00198
Shan BAraya-Polo MChapman B(2024)Evaluation of Directive-Based Programming Models for Stencil Computation on Current GPGPU ArchitecturesAdvancing OpenMP for Future Accelerators10.1007/978-3-031-72567-8_9(126-140)Online publication date: 23-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-72567-8_9
Shan BAraya-Polo MChapman B(2024)Towards a Scalable and Efficient PGAS-Based Distributed OpenMPAdvancing OpenMP for Future Accelerators10.1007/978-3-031-72567-8_5(64-78)Online publication date: 23-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-72567-8_5
Tian SScogland TChapman BDoerfert J(2023)OpenMP Kernel Language Extensions for Performance Portable GPU CodesProceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624164(876-883)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624164

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten