skip to main content
10.1145/3332466.3374544acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

<u>G</u>PU <u>i</u>nitiated <u>O</u>penSHMEM: correct and efficient intra-kernel networking for dGPUs

Published: 19 February 2020 Publication History

Abstract

Current state-of-the-art in GPU networking utilizes a host-centric, kernel-boundary communication model that reduces performance and increases code complexity. To address these concerns, recent works have explored performing network operations from within a GPU kernel itself. However, these approaches typically involve the CPU in the critical path, which leads to high latency and inefficient utilization of network and/or GPU resources.
In this work, we introduce GPU Initiated OpenSHMEM (GIO), a new intra-kernel PGAS programming model and runtime that enables GPUs to communicate directly with a NIC without the intervention of the CPU. We accomplish this by exploring the GPU's coarse-grained memory model and correcting semantic mismatches when GPUs wish to directly interact with the network. GIO also reduces latency by relying on a novel template-based design to minimize the overhead of initiating a network operation. We illustrate that for structured applications like a Jacobi 2D stencil, GIO can improve application performance by up to 40% compared to traditional kernel-boundary networking. Furthermore, we demonstrate that on irregular applications like Sparse Triangular Solve (SpTS), GIO provides up to 44% improvement compared to existing intra-kernel networking schemes.

References

[1]
Elena Agostini, Davide Rossetti, and Sreeram Potluri. 2017. Offloading Communication Control Logic in GPU Accelerated Applications. In Intl. Symp. on Cluster, Cloud and Grid Computing (CCGrid).
[2]
Johnathan Alsop, Marc S. Orr, Bradford M. Beckmann, and David A. Wood. 2016. Lazy release consistency for GPUs. In Intl. Symp. on Microarchitecture (MICRO).
[3]
AMD. 2017. Graphics Core Next Architecture, Generation 3 ISA. http://gpuopen.com/compute-product/amd-gcn3-isa-architecture-manual/
[4]
AMD. 2017. ROCn RDMA. https://github.com/rocmarchive/ROCnRDMA
[5]
AMD. 2018. HIP: Heterogeneous-computing Interface for Portability. http://rocm-developer-tools.github.io/HIP/
[6]
Matthew Baker, Swen Boehm, Aurelien Bouteiller, Barbara Chapman, Robert Cernohous, James Culhane, Tony Curtis, James Dinan, Mike Dubman, Karl Feind, Manjunath Gorentla Venkata, Max Grossman, Khaled Hamidouche, Jeff Hammond, Yossi Itigin, Bryant Lam, David Knaak, Jeff Kuehn, Jens Manser, Tiffany M. Mintz, David Ozog, Nicholas Park, Steve Poole, Wendy Poole, Swaroop Pophale, Sreeram Potluri, Howard Pritchard, Naveen Ravichandrasekaran, Michael Raymond, James Ross, Pavel Shamis, Sameer Shende, and Lauren Smith. 2018. OpenSHMEM Specification, http://openshmem.org/site/sites/default/site_files/OpenSHMEM-1.4.pdf
[7]
Feras Daoud, Amir Watad, and Mark Silberstein. 2016. GPUrdma: GPU-side Library for High Performance Networking from GPU Kernels. In Intl. Workshop on Runtime and Operating Systems for Supercomputers (ROSS). 6:1--6:8.
[8]
Tobias Gysi, Jeremia Bär, and Torsten Hoefler. 2016. dCUDA: Hardware Supported Overlap of Computation and Communication. In Intl. Conf. for High Performance Computing, Networking, Storage and Analysis (SC).
[9]
Khaled Hamidouche, Akshay Venkatesh, Ammar Ahmad Awan, Hari Subramoni, Ching-Hsiang Chu, and Dhabaleswar K. Panda. 2016. CUDA-Aware OpenSHMEM: Extensions and Designs for High Performance OpenSHMEM on GPU Clusters. Parallel Computing 58 (2016), 27--36.
[10]
Derek R. Hower, Blake A. Hechtman, Bradford M. Beckmann, Benedict R. Gaster, Mark D. Hill, Steven K. Reinhardt, and David A. Wood. 2014. Heterogeneous-race-free Memory Models. In Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
[11]
InfiniBand Trade Association. 2000. InfiniBand Architecture Specification: Release 1.0.2. http://www.infinibandta.org/content/pages.php?pg=technology_download
[12]
Sangman Kim, Seonggu Huh, Yige Hu, Xinya Zhang, Emmett Witchel, Amir Wated, and Mark Silberstein. 2014. GPUnet: Networking Abstractions for GPU Programs. In USENIX Conf. on Operating Systems Design and Implementation (OSDI). 201--216.
[13]
Benjamin Klenk, Lena Oden, and Holger Froning. 2014. Analyzing Put/Get APIs for Thread-Collaborative Processors. In Intl. Conf. on Parallel Processing (ICPP) Workshops.
[14]
Benjamin Klenk, Lena Oden, and Holger Froning. 2015. Analyzing Communication Models for Distributed Thread-collaborative Processors in Terms of Energy and Time. In Intl. Symp. on Performance Analysis of Systems and Software (ISPASS).
[15]
Michael LeBeane, Khaled Hamidouche, Brad Benton, Mauricio Breternitz, Steven K. Reinhardt, and Lizy K. John. 2017. GPU Triggered Networking for Intra-Kernel Communications. In Intl. Conf. for High Performance Computing, Networking, Storage and Analysis (SC).
[16]
Michael LeBeane, Khaled Hamidouche, Brad Benton, Mauricio Breternitz, Steven K. Reinhardt, and Lizy K. John. 2018. ComP-Net: Command Processor Networking for Efficient Intra-kernel Communications on GPUs. In Intl. Conf. on Parallel Architectures and Compilation Techniques (PACT'18).
[17]
Weifeng Liu, Ang Li, Jonathan Hogg, Iain S. Duff, and Brian Vinter. 2016. A Synchronization-Free Algorithm for Parallel Sparse Triangular Solves. In Intl. Conf. on Parallel Processing (Euro-Par).
[18]
Weifeng Liu, Ang Li, Jonathan D. Hogg, Iain S. Duff, and Brian Vinter. 2017. Fast Synchronization-Free Algorithms for Parallel Sparse Triangular Solves with Multiple Right-Hand Sides (SpTRSM). In Journal of Concurrency and Computation: Practice and Experience.
[19]
Mellanox. 2017. Mellanox OFED GPUDirect RDMA. http://www.mellanox.com/page/products_dyn?product_family=116
[20]
Mellanox. 2018. How To Implement PeerDirect Client using MLNX_OFED. https://community.mellanox.com/s/article/howto-implement-peerdirect-client-using-mlnx-ofed
[21]
Mellanox. 2018. InfiniBand Performance, http://www.mellanox.com/page/performance_infiniband
[22]
MPI Forum. 2012. MPI: A Message-Passing Interface Standard. Ver. 3. www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf
[23]
Maxim Naumov. 2011. Parallel Solution of Sparse Triangular Linear Systems in the Preconditioned Iterative Methods on the GPU. In Technical Report NVR-2011-001, Nvidia.
[24]
Nvidia. 2017. GPU Applications. http://www.nvidia.com/object/gpu-applications-domain.html
[25]
Nvidia. 2018. CUDA Toolkit 9.2. https://developer.nvidia.com/cuda-toolkit
[26]
Nvidia. 2019. Developing a Linux Kernel Module using GPUDirect RDMA and CUDA APIs for Memory Ordering. https://docs.nvidia.com/cuda/gpudirect-rdma/index.html#sync-behavior
[27]
Nvidia. 2019. GPUDirect RDMA. https://docs.nvidia.com/cuda/gpudirect-rdma/index.html#sync-behavior
[28]
Lena Oden and Holger Froning. 2013. GGAS: Global GPU Address Spaces for Efficient Communication in Heterogeneous Clusters. In Intl. Conf. on Cluster Computing (CLUSTER).
[29]
Lena Oden, Holger Froning, and Franz-Joseph Pfreundt. 2014. Infiniband-Verbs on GPU: A Case Study of Controlling an Infiniband Network Device from the GPU. In Intl. Conf. on Parallel Distributed Processing Symposium Workshops (IPDPSW). 976--983.
[30]
Lena Oden, Benjamin Klenk, and Holger Froning. 2014. Energy-Efficient Collective Reduce and Allreduce Operations on Distributed GPUs. In Intl. Symp. on Cluster, Cloud and Grid Computing (CCGrid). 483--492.
[31]
Marc S. Orr, Shuai Che, Bradford M. Beckmann, Mark Oskin, Steven K. Reinhardt, and David A. Wood. 2017. Gravel: Fine-Grain GPU-Initiated Network Messages. In Intl. Conf. for High Performance Computing, Networking, Storage and Analysis (SC).
[32]
Linux Man Pages. 2019. Diret Verbs. http://man7.org/linux/man-pages/man3/mlx5dv_init_obj.3.htmlS
[33]
Sreeram Potluri, Anshuman Goswami, Davide Rossetti, C. J. Newburn, Manjunath G. Venkata, and Neena Imam. 2017. GPU-Centric Communication on NVIDIA GPU Clusters with InfiniBand: A Case Study with OpenSHMEM. In Intl. Conf. on High Performance Computing (HiPC). 253--262.
[34]
Sreeram Potluri, Davide Rossetti, Becker Donald, Poole Duncan, Venkata Manjunath, Hernandez Oscar, Shamis Pavel, Lopez M. Graham, Baker Mathew, and Poole Wendy. 2015. Exploring OpenSHMEM Model to Program GPU-based Extreme-Scale Systems. In Workshop on OpenSHMEM and related techonlogies.
[35]
Davide Rossetti. 2015. GPUDirect Async. http://on-demand.gputechconf.com/gtc/2015/presentation/S5412-Davide-Rossetti.pdf
[36]
Sandia National Laboratories. 2017. The Portals 4.1 Network Programming Interface. http://www.cs.sandia.gov/Portals/portals41.pdf
[37]
Jeff A. Stuart and John D. Owens. 2009. Message Passing on Data-parallel Architectures. In Intl. Symp. on Parallel Distributed Processing (IPDPS).
[38]
TOP500.org. 2019. Highlights - June 2019. https://www.top500.org/lists/2019/06/

Cited By

View all
  • (2024)Snoopie: A Multi-GPU Communication Profiler and VisualizerProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656597(525-536)Online publication date: 30-May-2024
  • (2024)Intel® SHMEM: GPU-initiated OpenSHMEM using SYCLProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00169(1288-1301)Online publication date: 17-Nov-2024
  • (2024)Autonomous Execution for Multi-GPU Systems: Compiler SupportProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00155(1129-1140)Online publication date: 17-Nov-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PPoPP '20: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
February 2020
454 pages
ISBN:9781450368186
DOI:10.1145/3332466
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 February 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPUs
  2. RDMA networks
  3. distributed programming models

Qualifiers

  • Research-article

Conference

PPoPP '20

Acceptance Rates

PPoPP '20 Paper Acceptance Rate 28 of 121 submissions, 23%;
Overall Acceptance Rate 230 of 1,014 submissions, 23%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)61
  • Downloads (Last 6 weeks)3
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Snoopie: A Multi-GPU Communication Profiler and VisualizerProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656597(525-536)Online publication date: 30-May-2024
  • (2024)Intel® SHMEM: GPU-initiated OpenSHMEM using SYCLProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00169(1288-1301)Online publication date: 17-Nov-2024
  • (2024)Autonomous Execution for Multi-GPU Systems: Compiler SupportProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00155(1129-1140)Online publication date: 17-Nov-2024
  • (2024)Optimizing Distributed ML Communication with Fused Computation-Collective OperationsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00094(1-17)Online publication date: 17-Nov-2024
  • (2023)Evaluating the Performance of One-sided Communication on CPUs and GPUsProceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624182(1059-1069)Online publication date: 12-Nov-2023
  • (2023)Multi-GPU Communication Schemes for Iterative Solvers: When CPUs are Not in ChargeProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593713(192-202)Online publication date: 21-Jun-2023
  • (2022)Lessons learned on MPI+threads communicationProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/3571885.3571987(1-16)Online publication date: 13-Nov-2022
  • (2022)Lessons Learned on MPI+Threads CommunicationSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00082(1-16)Online publication date: Nov-2022
  • (2022)Extending OpenMP and OpenSHMEM for Efficient Heterogeneous Computing2022 IEEE/ACM Parallel Applications Workshop: Alternatives To MPI+X (PAW-ATM)10.1109/PAW-ATM56565.2022.00006(1-12)Online publication date: Nov-2022
  • (2021)A sage package for the symbolic-numeric factorization of linear differential operatorsACM Communications in Computer Algebra10.1145/3493492.349349655:2(44-48)Online publication date: 20-Oct-2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media