skip to main content
10.1145/2931088.2931091acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

GPUrdma: GPU-side library for high performance networking from GPU kernels

Published: 01 June 2016 Publication History

Abstract

We present GPUrdma, a GPU-side library for performing Remote Direct Memory Accesses (RDMA) across the network directly from GPU kernels. The library executes no code on CPU, directly accessing the Host Channel Adapter (HCA) Infiniband hardware for both control and data. Slow single-thread GPU performance and the intricacies of the GPU-to-network adapter interaction pose a significant challenge. We describe several design options and analyze their performance implications in detail.
We achieve 5μsec one-way communication latency and up to 50Gbit/sec transfer bandwidth for messages from 16KB and larger between K40c NVIDIA GPUs across the network. Moreover, GPUrdma outperforms the CPU RDMA for smaller packets ranging from 2 to 1024 bytes by factor of 4.5x thanks to greater parallelism of transfer requests enabled by highly parallel GPU hardware.
We use GPUrdma to implement a subset of the global address space programming interface (GPI) for point-to-point asynchronous RDMA messaging. We demonstrate our preliminary results using two simple applications -- ping-pong and a multi-matrix-vector product with constant matrix and multiple vectors -- each running on two different machines connected by Infiniband. Our basic ping-pong implementation achieves 5%higher performance than the baseline using GPI-2. The improved ping-pong implementation with per-threadblock communication overlap enables further 20% improvement. The multi-matrix-vector product is up to 4.5x faster thanks to higher throughput for small messages and the ability to keep the matrix in fast GPU shared memory while receiving new inputs.
GPUrdma prototype is not yet suitable for production systems due to hardware constraints in the current generation of NVIDIA GPUs which we discuss in detail. However, our results highlight the great potential of GPU-side native networking, and encourage further research toward scalable, high-performance, heterogeneous networking infrastructure.

References

[1]
GPI-2. http://www.gpi-site.com/gpi2/.
[2]
GPUDirectRDMA technology. http://docs.nvidia.com/cuda/gpudirect-rdma/index.html.
[3]
MVAPICH2: High performance MPI over InfiniBand, iWARP and RoCE. http://mvapich.cse.ohio-state.edu.
[4]
OpenFabrics Enterprise Distribution. https://www.openfabrics.org/index.php/openfabrics-software.html.
[5]
Sangman Kim, Seonggu Huh, Xinya Zhang, Yige Hu, Amir Wated, Emmett Witchel, and Mark Silberstein. Gpunet: Networking abstractions for GPU programs. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 201--216, 2014.
[6]
David B Kirk and Wen-mei W Hwu. Programming Massively Parallel Processors: A Hands-on Approach. Newnes, 2012.
[7]
Lena Oden and Holger Fröning. Infiniband verbs on GPU: a case study of controlling an InfiniBand network device from the GPU. International Journal of High Performance Computing Applications, page 8, 2015.
[8]
Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 693--701, 2011.
[9]
Davide Rossetti. GPUDirect async: integrating the GPU with a network interface.
[10]
Sagi Shahar, Shai Bergman, and Mark Silberstein. ActivePointers: A Case For Software Translation on GPUs. In Proceedings of the ACM IEEE International Symposium on Computer Architecture (ISCA). IEEE, 2016.
[11]
Sagi Shahar and Mark Silberstein. Supporting Data-Driven I/O on GPUs using GPUfs. In ACM International Conference on Systems and Storage (SYSTOR). ACM, 2016.
[12]
Mark Silberstein, Bryan Ford, Idit Keidar, and Emmett Witchel. GPUfs: Integrating a file system with GPUs. In Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, 2013.
[13]
Mark Silberstein, Bryan Ford, Idit Keidar, and Emmett Witchel. GPUfs: Integrating a file system with GPUs. ACM Transactions on Computer Systems (TOCS), 2014.
[14]
https://github.com/NVIDIA/nccl. NCCL: optimized primitives for collective multi-GPU communication.

Cited By

View all
  • (2024)Toward GPU-centric Networking on Commodity HardwareProceedings of the 7th International Workshop on Edge Systems, Analytics and Networking10.1145/3642968.3654820(43-48)Online publication date: 22-Apr-2024
  • (2024)Zero-sided RDMA: Network-driven Data Shuffling for Disaggregated Heterogeneous Cloud DBMSsProceedings of the ACM on Management of Data10.1145/36392912:1(1-28)Online publication date: 26-Mar-2024
  • (2024)DPU-Direct: Unleashing Remote Accelerators via Enhanced RDMA for Disaggregated DatacentersIEEE Transactions on Computers10.1109/TC.2024.340408973:8(2081-2095)Online publication date: Aug-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ROSS '16: Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers
June 2016
54 pages
ISBN:9781450343879
DOI:10.1145/2931088
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPGPUs
  2. Networking
  3. Operating Systems Design
  4. accelerators

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ROSS '16

Acceptance Rates

ROSS '16 Paper Acceptance Rate 6 of 10 submissions, 60%;
Overall Acceptance Rate 58 of 169 submissions, 34%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)51
  • Downloads (Last 6 weeks)4
Reflects downloads up to 10 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Toward GPU-centric Networking on Commodity HardwareProceedings of the 7th International Workshop on Edge Systems, Analytics and Networking10.1145/3642968.3654820(43-48)Online publication date: 22-Apr-2024
  • (2024)Zero-sided RDMA: Network-driven Data Shuffling for Disaggregated Heterogeneous Cloud DBMSsProceedings of the ACM on Management of Data10.1145/36392912:1(1-28)Online publication date: 26-Mar-2024
  • (2024)DPU-Direct: Unleashing Remote Accelerators via Enhanced RDMA for Disaggregated DatacentersIEEE Transactions on Computers10.1109/TC.2024.340408973:8(2081-2095)Online publication date: Aug-2024
  • (2024)Hardware ArchitectureEdge Computing Acceleration10.1002/9781119813873.ch5(125-166)Online publication date: 29-Nov-2024
  • (2023)CPU-free Computing: A Vision with a BlueprintProceedings of the 19th Workshop on Hot Topics in Operating Systems10.1145/3593856.3595906(1-14)Online publication date: 22-Jun-2023
  • (2023)Skadi: Building a Distributed Runtime for Data Systems in Disaggregated Data CentersProceedings of the 19th Workshop on Hot Topics in Operating Systems10.1145/3593856.3595897(94-102)Online publication date: 22-Jun-2023
  • (2023)SplitRPC: A {Control + Data} Path Splitting RPC Stack for ML Inference ServingProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35899747:2(1-26)Online publication date: 22-May-2023
  • (2023)Multi-GPU Communication Schemes for Iterative Solvers: When CPUs are Not in ChargeProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593713(192-202)Online publication date: 21-Jun-2023
  • (2023)PERKS: a Locality-Optimized Execution Model for Iterative Memory-bound GPU ApplicationsProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593705(167-179)Online publication date: 21-Jun-2023
  • (2023)GPU-Initiated On-Demand High-Throughput Storage Access in the BaM System ArchitectureProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575748(325-339)Online publication date: 27-Jan-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media