research-article

Open access

Non-Blocking GPU-CPU Notifications to Enable More GPU-CPU Parallelism

Authors:

Martin SchulzAuthors Info & Claims

HPCAsia '24: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region

Pages 1 - 11

https://doi.org/10.1145/3635035.3635036

Published: 19 January 2024 Publication History

All formats PDF

Abstract

GPUs are increasingly popular in HPC systems, and more applications are adopting GPUs each day. However, the control synchronization of GPUs with CPUs is suboptimal and only possible after GPU kernel termination points, resulting in serialized host and device tasks. In this paper, we propose a novel CPU-GPU notification method that enables non-blocking in-kernel control synchronization of device and host tasks in combination with persistent GPU kernels. Using this notification method, we increase the overlap of CPU and GPU execution and with that parallelism. We present the concept and structure of the proposed notification mechanism together with in-kernel GPU-CPU control synchronization, using halo-exchange as an example. We analyze the performance of the halo-exchange pattern using our new notification method, as well as the interference between CPU and GPU operations due to the execution overlap. Finally, we verify our results using a performance model covering the halo-exchange pattern with the new notification method.

References

[1]

Tyler Allen and Rong Ge. 2021. Demystifying GPU UVM Cost with Deep Runtime and Workload Analysis. In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 141–150. https://doi.org/10.1109/IPDPS49936.2021.00023

[2]

Tyler Allen and Rong Ge. 2021. In-Depth Analyses of Unified Virtual Memory System for GPU Accelerated Computing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (St. Louis, Missouri) (SC 21). Association for Computing Machinery, New York, NY, USA, Article 64, 15 pages. https://doi.org/10.1145/3458817.3480855

Digital Library

[3]

Mauro Bianco. 2014. An interface for halo exchange pattern. www.prace-ri.eu/IMG/pdf/wp86.pdf (2014).

[4]

Yuxin Chen, Benjamin Brock, Serban Porumbescu, Aydin Buluç, Katherine Yelick, and John D. Owens. 2022. Scalable Irregular Parallelism with GPUs: Getting CPUs Out of the Way. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Dallas, TX, USA, 1–16. https://doi.org/10.1109/SC41404.2022.00055

[5]

Jaemin Choi, David F. Richards, Laxmikant V. Kale, and Abhinav Bhatele. 2020. End-to-End Performance Modeling of Distributed GPU Applications. In Proceedings of the 34th ACM International Conference on Supercomputing (Barcelona, Spain) (ICS ’20). Association for Computing Machinery, New York, NY, USA, Article 30, 12 pages. https://doi.org/10.1145/3392717.3392737

Digital Library

[6]

Matthew G.F. Dosanjh, Andrew Worley, Derek Schafer, Prema Soundararajan, Sheikh Ghafoor, Anthony Skjellum, Purushotham V. Bangalore, and Ryan E. Grant. 2021. Implementation and evaluation of MPI 4.0 partitioned communication libraries. Parallel Comput. 108 (Dec 2021), 102827. https://doi.org/10.1016/j.parco.2021.102827

Digital Library

[7]

Nathan Hanford, Ramesh Pankajakshan, Edgar A. Leon, and Ian Karlin. 2020. Challenges of GPU-aware Communication in MPI. In 2020 Workshop on Exascale MPI (ExaMPI). IEEE, 1–10. https://doi.org/10.1109/ExaMPI52011.2020.00006

[8]

Mark Harris and Kyrylo Perelygin. 2023. Cooperative groups: Flexible cuda thread programming. https://developer.nvidia.com/blog/cooperative-groups/

[9]

Feng Ji, Ashwin M. Aji, James Dinan, Darius Buntinas, Pavan Balaji, Wu-chun Feng, and Xiaosong Ma. 2012. Efficient Intranode Communication in GPU-Accelerated Systems. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops and PhD Forum. IEEE, Shanghai, China, 1838–1847. https://doi.org/10.1109/IPDPSW.2012.227

Digital Library

[10]

Jiri Kraus. 2022. An introduction to cuda-aware MPI. https://developer.nvidia.com/blog/introduction-cuda-aware-mpi/

[11]

Michael LeBeane, Khaled Hamidouche, Brad Benton, Mauricio Breternitz, Steven K Reinhardt, and Lizy K John. 2017. GPU Triggered Networking for Intra-Kernel Communications. (2017), 12.

[12]

Ang Li, Shuaiwen Leon Song, Jieyang Chen, Jiajia Li, Xu Liu, Nathan Tallent, and Kevin Barker. 2020. Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. IEEE Transactions on Parallel and Distributed Systems 31, 1 (Jan 2020), 94–110. https://doi.org/10.1109/TPDS.2019.2928289

Digital Library

[13]

Dian-Lun Lin and Tsung-Wei Huang. 2021. Efficient GPU Computation Using Task Graph Parallelism. In Euro-Par 2021: Parallel Processing, Leonel Sousa, Nuno Roma, and Pedro Tomás (Eds.). Springer, Cham, 435–450.

[14]

Clemens Lutz, Sebastian Breß, Steffen Zeuch, Tilmann Rabl, and Volker Markl. 2020. Pump Up the Volume: Processing Large Data on GPUs with Fast Interconnects. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. ACM, Portland OR USA, 1633–1649. https://doi.org/10.1145/3318464.3389705

Digital Library

[15]

Naveen Namashivayam, Krishna Kandalla, Trey White, Nick Radcliffe, Larry Kaplan, and Mark Pagel. 2022. Exploring GPU Stream-Aware Message Passing using Triggered Operations. arXiv:2208.04817 (Aug 2022). http://arxiv.org/abs/2208.04817 arXiv:2208.04817 [cs].

[16]

Pier Giorgio Raponi, Fabrizio Petrini, Robert Walkup, and Fabio Checconi. 2011. Characterization of the Communication Patterns of Scientific Applications on Blue Gene/P. In 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops. 1017–1024. https://doi.org/10.1109/IPDPS.2011.249

Digital Library

[17]

Lukas Spies, Amanda Bienz, David Moulton, Luke Olson, and Andrew Reisner. 2022. Tausch: A halo exchange library for large heterogeneous computing systems using MPI, OpenCL, and CUDA. Parallel Comput. 114 (2022), 102973. https://doi.org/10.1016/j.parco.2022.102973

Digital Library

[18]

Jeff A. Stuart, Michael Cox, and John D. Owens. 2011. GPU-to-CPU Callbacks. In Euro-Par 2010 Parallel Processing Workshops, Mario R. Guarracino, Frédéric Vivien, Jesper Larsson Träff, Mario Cannatoro, Marco Danelutto, Anders Hast, Francesca Perla, Andreas Knüpfer, Beniamino Di Martino, and Michael Alexander (Eds.). Springer Berlin Heidelberg, 365–372.

[19]

V. Volkov and J.W. Demmel. 2008. Benchmarking GPUs to tune dense linear algebra. In 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Austin, TX, 1–11. https://doi.org/10.1109/SC.2008.5214359

[20]

Chenle Yu, Sara Royuela, and Eduardo Quiñones. 2020. OpenMP to CUDA Graphs: A Compiler-Based Transformation to Enhance the Programmability of NVIDIA Devices. In Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems(SCOPES ’20). Association for Computing Machinery, New York, NY, USA, 42–47. https://doi.org/10.1145/3378678.3391881

Digital Library

[21]

Lingqi Zhang, Mohamed Wahib, Haoyu Zhang, and Satoshi Matsuoka. 2020. A Study of Single and Multi-device Synchronization Methods in Nvidia GPUs. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, New Orleans, 483–493. https://doi.org/10.1109/IPDPS47924.2020.00057

Cited By

Bridges PSkjellum ASuggs ESchafer DBangalore P(2024)Understanding GPU Triggering APIs for MPI+X CommunicationRecent Advances in the Message Passing Interface10.1007/978-3-031-73370-3_3(39-55)Online publication date: 25-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-73370-3_3

Index Terms

Non-Blocking GPU-CPU Notifications to Enable More GPU-CPU Parallelism

Recommendations

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing
CLUSTER '10: Proceedings of the 2010 IEEE International Conference on Cluster Computing

In this paper, we describe our experiment developing an implementation of the Linpack benchmark for TianHe-1, a petascale CPU/GPU supercomputer system, the largest GPU-accelerated system ever attempted before. An adaptive optimization framework is ...
Exploration of CPU/GPU co-execution: from the perspective of performance, energy, and temperature
RACS '11: Proceedings of the 2011 ACM Symposium on Research in Applied Computation

In recent computing systems, CPUs have encountered the situations in which they cannot meet the increasing throughput demands. To overcome the limits of CPUs in processing heavy tasks, especially for computer graphics, GPUs have been widely used. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

HPCAsia '24: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region

January 2024

185 pages

ISBN:9798400708893

DOI:10.1145/3635035

Copyright © 2024 ACM.

Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 January 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Lawrence Livermore National Laboratory
European Union

Conference

HPCAsia 2024

HPCAsia 2024: International Conference on High Performance Computing in Asia-Pacific Region

January 25 - 27, 2024

Nagoya, Japan

Acceptance Rates

Overall Acceptance Rate 69 of 143 submissions, 48%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
817
Total Downloads

Downloads (Last 12 months)736
Downloads (Last 6 weeks)64

Reflects downloads up to 14 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Bridges PSkjellum ASuggs ESchafer DBangalore P(2024)Understanding GPU Triggering APIs for MPI+X CommunicationRecent Advances in the Message Passing Interface10.1007/978-3-031-73370-3_3(39-55)Online publication date: 25-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-73370-3_3

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten