skip to main content
10.1145/3635035.3635036acmotherconferencesArticle/Chapter ViewAbstractPublication PageshpcasiaConference Proceedingsconference-collections
research-article

Non-Blocking GPU-CPU Notifications to Enable More GPU-CPU Parallelism

Published:19 January 2024Publication History

ABSTRACT

GPUs are increasingly popular in HPC systems, and more applications are adopting GPUs each day. However, the control synchronization of GPUs with CPUs is suboptimal and only possible after GPU kernel termination points, resulting in serialized host and device tasks. In this paper, we propose a novel CPU-GPU notification method that enables non-blocking in-kernel control synchronization of device and host tasks in combination with persistent GPU kernels. Using this notification method, we increase the overlap of CPU and GPU execution and with that parallelism. We present the concept and structure of the proposed notification mechanism together with in-kernel GPU-CPU control synchronization, using halo-exchange as an example. We analyze the performance of the halo-exchange pattern using our new notification method, as well as the interference between CPU and GPU operations due to the execution overlap. Finally, we verify our results using a performance model covering the halo-exchange pattern with the new notification method.

References

  1. Tyler Allen and Rong Ge. 2021. Demystifying GPU UVM Cost with Deep Runtime and Workload Analysis. In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 141–150. https://doi.org/10.1109/IPDPS49936.2021.00023Google ScholarGoogle ScholarCross RefCross Ref
  2. Tyler Allen and Rong Ge. 2021. In-Depth Analyses of Unified Virtual Memory System for GPU Accelerated Computing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (St. Louis, Missouri) (SC 21). Association for Computing Machinery, New York, NY, USA, Article 64, 15 pages. https://doi.org/10.1145/3458817.3480855Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Mauro Bianco. 2014. An interface for halo exchange pattern. www.prace-ri.eu/IMG/pdf/wp86.pdf (2014).Google ScholarGoogle Scholar
  4. Yuxin Chen, Benjamin Brock, Serban Porumbescu, Aydin Buluç, Katherine Yelick, and John D. Owens. 2022. Scalable Irregular Parallelism with GPUs: Getting CPUs Out of the Way. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Dallas, TX, USA, 1–16. https://doi.org/10.1109/SC41404.2022.00055Google ScholarGoogle ScholarCross RefCross Ref
  5. Jaemin Choi, David F. Richards, Laxmikant V. Kale, and Abhinav Bhatele. 2020. End-to-End Performance Modeling of Distributed GPU Applications. In Proceedings of the 34th ACM International Conference on Supercomputing (Barcelona, Spain) (ICS ’20). Association for Computing Machinery, New York, NY, USA, Article 30, 12 pages. https://doi.org/10.1145/3392717.3392737Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Matthew G.F. Dosanjh, Andrew Worley, Derek Schafer, Prema Soundararajan, Sheikh Ghafoor, Anthony Skjellum, Purushotham V. Bangalore, and Ryan E. Grant. 2021. Implementation and evaluation of MPI 4.0 partitioned communication libraries. Parallel Comput. 108 (Dec 2021), 102827. https://doi.org/10.1016/j.parco.2021.102827Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Nathan Hanford, Ramesh Pankajakshan, Edgar A. Leon, and Ian Karlin. 2020. Challenges of GPU-aware Communication in MPI. In 2020 Workshop on Exascale MPI (ExaMPI). IEEE, 1–10. https://doi.org/10.1109/ExaMPI52011.2020.00006Google ScholarGoogle ScholarCross RefCross Ref
  8. Mark Harris and Kyrylo Perelygin. 2023. Cooperative groups: Flexible cuda thread programming. https://developer.nvidia.com/blog/cooperative-groups/Google ScholarGoogle Scholar
  9. Feng Ji, Ashwin M. Aji, James Dinan, Darius Buntinas, Pavan Balaji, Wu-chun Feng, and Xiaosong Ma. 2012. Efficient Intranode Communication in GPU-Accelerated Systems. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops and PhD Forum. IEEE, Shanghai, China, 1838–1847. https://doi.org/10.1109/IPDPSW.2012.227Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Jiri Kraus. 2022. An introduction to cuda-aware MPI. https://developer.nvidia.com/blog/introduction-cuda-aware-mpi/Google ScholarGoogle Scholar
  11. Michael LeBeane, Khaled Hamidouche, Brad Benton, Mauricio Breternitz, Steven K Reinhardt, and Lizy K John. 2017. GPU Triggered Networking for Intra-Kernel Communications. (2017), 12.Google ScholarGoogle Scholar
  12. Ang Li, Shuaiwen Leon Song, Jieyang Chen, Jiajia Li, Xu Liu, Nathan Tallent, and Kevin Barker. 2020. Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. IEEE Transactions on Parallel and Distributed Systems 31, 1 (Jan 2020), 94–110. https://doi.org/10.1109/TPDS.2019.2928289Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Dian-Lun Lin and Tsung-Wei Huang. 2021. Efficient GPU Computation Using Task Graph Parallelism. In Euro-Par 2021: Parallel Processing, Leonel Sousa, Nuno Roma, and Pedro Tomás (Eds.). Springer, Cham, 435–450.Google ScholarGoogle Scholar
  14. Clemens Lutz, Sebastian Breß, Steffen Zeuch, Tilmann Rabl, and Volker Markl. 2020. Pump Up the Volume: Processing Large Data on GPUs with Fast Interconnects. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. ACM, Portland OR USA, 1633–1649. https://doi.org/10.1145/3318464.3389705Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Naveen Namashivayam, Krishna Kandalla, Trey White, Nick Radcliffe, Larry Kaplan, and Mark Pagel. 2022. Exploring GPU Stream-Aware Message Passing using Triggered Operations. arXiv:2208.04817 (Aug 2022). http://arxiv.org/abs/2208.04817 arXiv:2208.04817 [cs].Google ScholarGoogle Scholar
  16. Pier Giorgio Raponi, Fabrizio Petrini, Robert Walkup, and Fabio Checconi. 2011. Characterization of the Communication Patterns of Scientific Applications on Blue Gene/P. In 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops. 1017–1024. https://doi.org/10.1109/IPDPS.2011.249Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Lukas Spies, Amanda Bienz, David Moulton, Luke Olson, and Andrew Reisner. 2022. Tausch: A halo exchange library for large heterogeneous computing systems using MPI, OpenCL, and CUDA. Parallel Comput. 114 (2022), 102973. https://doi.org/10.1016/j.parco.2022.102973Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Jeff A. Stuart, Michael Cox, and John D. Owens. 2011. GPU-to-CPU Callbacks. In Euro-Par 2010 Parallel Processing Workshops, Mario R. Guarracino, Frédéric Vivien, Jesper Larsson Träff, Mario Cannatoro, Marco Danelutto, Anders Hast, Francesca Perla, Andreas Knüpfer, Beniamino Di Martino, and Michael Alexander (Eds.). Springer Berlin Heidelberg, 365–372.Google ScholarGoogle Scholar
  19. V. Volkov and J.W. Demmel. 2008. Benchmarking GPUs to tune dense linear algebra. In 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Austin, TX, 1–11. https://doi.org/10.1109/SC.2008.5214359Google ScholarGoogle ScholarCross RefCross Ref
  20. Chenle Yu, Sara Royuela, and Eduardo Quiñones. 2020. OpenMP to CUDA Graphs: A Compiler-Based Transformation to Enhance the Programmability of NVIDIA Devices. In Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems(SCOPES ’20). Association for Computing Machinery, New York, NY, USA, 42–47. https://doi.org/10.1145/3378678.3391881Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Lingqi Zhang, Mohamed Wahib, Haoyu Zhang, and Satoshi Matsuoka. 2020. A Study of Single and Multi-device Synchronization Methods in Nvidia GPUs. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, New Orleans, 483–493. https://doi.org/10.1109/IPDPS47924.2020.00057Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Non-Blocking GPU-CPU Notifications to Enable More GPU-CPU Parallelism

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Other conferences
          HPCAsia '24: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region
          January 2024
          185 pages
          ISBN:9798400708893
          DOI:10.1145/3635035

          Copyright © 2024 ACM

          Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 19 January 2024

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed limited

          Acceptance Rates

          Overall Acceptance Rate69of143submissions,48%
        • Article Metrics

          • Downloads (Last 12 months)189
          • Downloads (Last 6 weeks)34

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format