ABSTRACT
GPUs are increasingly popular in HPC systems, and more applications are adopting GPUs each day. However, the control synchronization of GPUs with CPUs is suboptimal and only possible after GPU kernel termination points, resulting in serialized host and device tasks. In this paper, we propose a novel CPU-GPU notification method that enables non-blocking in-kernel control synchronization of device and host tasks in combination with persistent GPU kernels. Using this notification method, we increase the overlap of CPU and GPU execution and with that parallelism. We present the concept and structure of the proposed notification mechanism together with in-kernel GPU-CPU control synchronization, using halo-exchange as an example. We analyze the performance of the halo-exchange pattern using our new notification method, as well as the interference between CPU and GPU operations due to the execution overlap. Finally, we verify our results using a performance model covering the halo-exchange pattern with the new notification method.
- Tyler Allen and Rong Ge. 2021. Demystifying GPU UVM Cost with Deep Runtime and Workload Analysis. In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 141–150. https://doi.org/10.1109/IPDPS49936.2021.00023Google ScholarCross Ref
- Tyler Allen and Rong Ge. 2021. In-Depth Analyses of Unified Virtual Memory System for GPU Accelerated Computing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (St. Louis, Missouri) (SC 21). Association for Computing Machinery, New York, NY, USA, Article 64, 15 pages. https://doi.org/10.1145/3458817.3480855Google ScholarDigital Library
- Mauro Bianco. 2014. An interface for halo exchange pattern. www.prace-ri.eu/IMG/pdf/wp86.pdf (2014).Google Scholar
- Yuxin Chen, Benjamin Brock, Serban Porumbescu, Aydin Buluç, Katherine Yelick, and John D. Owens. 2022. Scalable Irregular Parallelism with GPUs: Getting CPUs Out of the Way. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Dallas, TX, USA, 1–16. https://doi.org/10.1109/SC41404.2022.00055Google ScholarCross Ref
- Jaemin Choi, David F. Richards, Laxmikant V. Kale, and Abhinav Bhatele. 2020. End-to-End Performance Modeling of Distributed GPU Applications. In Proceedings of the 34th ACM International Conference on Supercomputing (Barcelona, Spain) (ICS ’20). Association for Computing Machinery, New York, NY, USA, Article 30, 12 pages. https://doi.org/10.1145/3392717.3392737Google ScholarDigital Library
- Matthew G.F. Dosanjh, Andrew Worley, Derek Schafer, Prema Soundararajan, Sheikh Ghafoor, Anthony Skjellum, Purushotham V. Bangalore, and Ryan E. Grant. 2021. Implementation and evaluation of MPI 4.0 partitioned communication libraries. Parallel Comput. 108 (Dec 2021), 102827. https://doi.org/10.1016/j.parco.2021.102827Google ScholarDigital Library
- Nathan Hanford, Ramesh Pankajakshan, Edgar A. Leon, and Ian Karlin. 2020. Challenges of GPU-aware Communication in MPI. In 2020 Workshop on Exascale MPI (ExaMPI). IEEE, 1–10. https://doi.org/10.1109/ExaMPI52011.2020.00006Google ScholarCross Ref
- Mark Harris and Kyrylo Perelygin. 2023. Cooperative groups: Flexible cuda thread programming. https://developer.nvidia.com/blog/cooperative-groups/Google Scholar
- Feng Ji, Ashwin M. Aji, James Dinan, Darius Buntinas, Pavan Balaji, Wu-chun Feng, and Xiaosong Ma. 2012. Efficient Intranode Communication in GPU-Accelerated Systems. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops and PhD Forum. IEEE, Shanghai, China, 1838–1847. https://doi.org/10.1109/IPDPSW.2012.227Google ScholarDigital Library
- Jiri Kraus. 2022. An introduction to cuda-aware MPI. https://developer.nvidia.com/blog/introduction-cuda-aware-mpi/Google Scholar
- Michael LeBeane, Khaled Hamidouche, Brad Benton, Mauricio Breternitz, Steven K Reinhardt, and Lizy K John. 2017. GPU Triggered Networking for Intra-Kernel Communications. (2017), 12.Google Scholar
- Ang Li, Shuaiwen Leon Song, Jieyang Chen, Jiajia Li, Xu Liu, Nathan Tallent, and Kevin Barker. 2020. Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. IEEE Transactions on Parallel and Distributed Systems 31, 1 (Jan 2020), 94–110. https://doi.org/10.1109/TPDS.2019.2928289Google ScholarDigital Library
- Dian-Lun Lin and Tsung-Wei Huang. 2021. Efficient GPU Computation Using Task Graph Parallelism. In Euro-Par 2021: Parallel Processing, Leonel Sousa, Nuno Roma, and Pedro Tomás (Eds.). Springer, Cham, 435–450.Google Scholar
- Clemens Lutz, Sebastian Breß, Steffen Zeuch, Tilmann Rabl, and Volker Markl. 2020. Pump Up the Volume: Processing Large Data on GPUs with Fast Interconnects. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. ACM, Portland OR USA, 1633–1649. https://doi.org/10.1145/3318464.3389705Google ScholarDigital Library
- Naveen Namashivayam, Krishna Kandalla, Trey White, Nick Radcliffe, Larry Kaplan, and Mark Pagel. 2022. Exploring GPU Stream-Aware Message Passing using Triggered Operations. arXiv:2208.04817 (Aug 2022). http://arxiv.org/abs/2208.04817 arXiv:2208.04817 [cs].Google Scholar
- Pier Giorgio Raponi, Fabrizio Petrini, Robert Walkup, and Fabio Checconi. 2011. Characterization of the Communication Patterns of Scientific Applications on Blue Gene/P. In 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops. 1017–1024. https://doi.org/10.1109/IPDPS.2011.249Google ScholarDigital Library
- Lukas Spies, Amanda Bienz, David Moulton, Luke Olson, and Andrew Reisner. 2022. Tausch: A halo exchange library for large heterogeneous computing systems using MPI, OpenCL, and CUDA. Parallel Comput. 114 (2022), 102973. https://doi.org/10.1016/j.parco.2022.102973Google ScholarDigital Library
- Jeff A. Stuart, Michael Cox, and John D. Owens. 2011. GPU-to-CPU Callbacks. In Euro-Par 2010 Parallel Processing Workshops, Mario R. Guarracino, Frédéric Vivien, Jesper Larsson Träff, Mario Cannatoro, Marco Danelutto, Anders Hast, Francesca Perla, Andreas Knüpfer, Beniamino Di Martino, and Michael Alexander (Eds.). Springer Berlin Heidelberg, 365–372.Google Scholar
- V. Volkov and J.W. Demmel. 2008. Benchmarking GPUs to tune dense linear algebra. In 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Austin, TX, 1–11. https://doi.org/10.1109/SC.2008.5214359Google ScholarCross Ref
- Chenle Yu, Sara Royuela, and Eduardo Quiñones. 2020. OpenMP to CUDA Graphs: A Compiler-Based Transformation to Enhance the Programmability of NVIDIA Devices. In Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems(SCOPES ’20). Association for Computing Machinery, New York, NY, USA, 42–47. https://doi.org/10.1145/3378678.3391881Google ScholarDigital Library
- Lingqi Zhang, Mohamed Wahib, Haoyu Zhang, and Satoshi Matsuoka. 2020. A Study of Single and Multi-device Synchronization Methods in Nvidia GPUs. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, New Orleans, 483–493. https://doi.org/10.1109/IPDPS47924.2020.00057Google ScholarCross Ref
Index Terms
- Non-Blocking GPU-CPU Notifications to Enable More GPU-CPU Parallelism
Recommendations
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance ComputingThe graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing
CLUSTER '10: Proceedings of the 2010 IEEE International Conference on Cluster ComputingIn this paper, we describe our experiment developing an implementation of the Linpack benchmark for TianHe-1, a petascale CPU/GPU supercomputer system, the largest GPU-accelerated system ever attempted before. An adaptive optimization framework is ...
Exploration of CPU/GPU co-execution: from the perspective of performance, energy, and temperature
RACS '11: Proceedings of the 2011 ACM Symposium on Research in Applied ComputationIn recent computing systems, CPUs have encountered the situations in which they cannot meet the increasing throughput demands. To overcome the limits of CPUs in processing heavy tasks, especially for computer graphics, GPUs have been widely used. ...
Comments