research-article

GPUrdma: GPU-side library for high performance networking from GPU kernels

Authors:

Feras Daoud,

Amir Watad,

Mark SilbersteinAuthors Info & Claims

ROSS '16: Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers

Article No.: 6, Pages 1 - 8

https://doi.org/10.1145/2931088.2931091

Published: 01 June 2016 Publication History

Get Access

Abstract

We present GPUrdma, a GPU-side library for performing Remote Direct Memory Accesses (RDMA) across the network directly from GPU kernels. The library executes no code on CPU, directly accessing the Host Channel Adapter (HCA) Infiniband hardware for both control and data. Slow single-thread GPU performance and the intricacies of the GPU-to-network adapter interaction pose a significant challenge. We describe several design options and analyze their performance implications in detail.

We achieve 5μsec one-way communication latency and up to 50Gbit/sec transfer bandwidth for messages from 16KB and larger between K40c NVIDIA GPUs across the network. Moreover, GPUrdma outperforms the CPU RDMA for smaller packets ranging from 2 to 1024 bytes by factor of 4.5x thanks to greater parallelism of transfer requests enabled by highly parallel GPU hardware.

We use GPUrdma to implement a subset of the global address space programming interface (GPI) for point-to-point asynchronous RDMA messaging. We demonstrate our preliminary results using two simple applications -- ping-pong and a multi-matrix-vector product with constant matrix and multiple vectors -- each running on two different machines connected by Infiniband. Our basic ping-pong implementation achieves 5%higher performance than the baseline using GPI-2. The improved ping-pong implementation with per-threadblock communication overlap enables further 20% improvement. The multi-matrix-vector product is up to 4.5x faster thanks to higher throughput for small messages and the ability to keep the matrix in fast GPU shared memory while receiving new inputs.

GPUrdma prototype is not yet suitable for production systems due to hardware constraints in the current generation of NVIDIA GPUs which we discuss in detail. However, our results highlight the great potential of GPU-side native networking, and encourage further research toward scalable, high-performance, heterogeneous networking infrastructure.

References

[1]

GPI-2. http://www.gpi-site.com/gpi2/.

Google Scholar

[2]

GPUDirectRDMA technology. http://docs.nvidia.com/cuda/gpudirect-rdma/index.html.

Google Scholar

[3]

MVAPICH2: High performance MPI over InfiniBand, iWARP and RoCE. http://mvapich.cse.ohio-state.edu.

Google Scholar

[4]

OpenFabrics Enterprise Distribution. https://www.openfabrics.org/index.php/openfabrics-software.html.

Google Scholar

[5]

Sangman Kim, Seonggu Huh, Xinya Zhang, Yige Hu, Amir Wated, Emmett Witchel, and Mark Silberstein. Gpunet: Networking abstractions for GPU programs. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 201--216, 2014.

Digital Library

Google Scholar

[6]

David B Kirk and Wen-mei W Hwu. Programming Massively Parallel Processors: A Hands-on Approach. Newnes, 2012.

Digital Library

Google Scholar

[7]

Lena Oden and Holger Fröning. Infiniband verbs on GPU: a case study of controlling an InfiniBand network device from the GPU. International Journal of High Performance Computing Applications, page 8, 2015.

Google Scholar

[8]

Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, pages 693--701, 2011.

Digital Library

Google Scholar

[9]

Davide Rossetti. GPUDirect async: integrating the GPU with a network interface.

Google Scholar

[10]

Sagi Shahar, Shai Bergman, and Mark Silberstein. ActivePointers: A Case For Software Translation on GPUs. In Proceedings of the ACM IEEE International Symposium on Computer Architecture (ISCA). IEEE, 2016.

Google Scholar

[11]

Sagi Shahar and Mark Silberstein. Supporting Data-Driven I/O on GPUs using GPUfs. In ACM International Conference on Systems and Storage (SYSTOR). ACM, 2016.

Google Scholar

[12]

Mark Silberstein, Bryan Ford, Idit Keidar, and Emmett Witchel. GPUfs: Integrating a file system with GPUs. In Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, 2013.

Digital Library

Google Scholar

[13]

Mark Silberstein, Bryan Ford, Idit Keidar, and Emmett Witchel. GPUfs: Integrating a file system with GPUs. ACM Transactions on Computer Systems (TOCS), 2014.

Digital Library

Google Scholar

[14]

https://github.com/NVIDIA/nccl. NCCL: optimized primitives for collective multi-GPU communication.

Google Scholar

Cited By

View all

Girondi MScazzariello MMaguire GKostić DDing AAral AHilt V(2024)Toward GPU-centric Networking on Commodity HardwareProceedings of the 7th International Workshop on Edge Systems, Analytics and Networking10.1145/3642968.3654820(43-48)Online publication date: 22-Apr-2024
https://dl.acm.org/doi/10.1145/3642968.3654820
Jasny MThostrup LTamimi SKoch AIstván ZBinnig C(2024)Zero-sided RDMA: Network-driven Data Shuffling for Disaggregated Heterogeneous Cloud DBMSsProceedings of the ACM on Management of Data10.1145/36392912:1(1-28)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639291
Liao YWu JLu WLi XYan G(2024)DPU-Direct: Unleashing Remote Accelerators via Enhanced RDMA for Disaggregated DatacentersIEEE Transactions on Computers10.1109/TC.2024.340408973:8(2081-2095)Online publication date: Aug-2024
https://doi.org/10.1109/TC.2024.3404089
Show More Cited By

Index Terms

GPUrdma: GPU-side library for high performance networking from GPU kernels
1. Computing methodologies
  1. Computer graphics
    1. Graphics systems and interfaces
2. Software and its engineering
  1. Software creation and management
    1. Designing software
  2. Software organization and properties
    1. Contextual software domains
      1. Operating systems

Recommendations

GPUnet: Networking Abstractions for GPU Programs

Despite the popularity of GPUs in high-performance and scientific computing, and despite increasingly general-purpose hardware capabilities, the use of GPUs in network servers or distributed systems poses significant challenges.

GPUnet is a native GPU ...
GPUfs: Integrating a file system with GPUs

As GPU hardware becomes increasingly general-purpose, it is quickly outgrowing the traditional, constrained GPU-as-coprocessor programming model. This article advocates for extending standard operating system services and abstractions to GPUs in order ...
GPUfs: integrating a file system with GPUs
ASPLOS '13

PU hardware is becoming increasingly general purpose, quickly outgrowing the traditional but constrained GPU-as-coprocessor programming model. To make GPUs easier to program and easier to integrate with existing systems, we propose making the host's ...

Comments

Information & Contributors

Information

Published In

ROSS '16: Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers

June 2016

54 pages

ISBN:9781450343879

DOI:10.1145/2931088

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ROSS '16

ROSS '16: International Workshop on Runtime and Operating Systems for Supercomputers

June 1, 2016

Kyoto, Japan

Acceptance Rates

ROSS '16 Paper Acceptance Rate 6 of 10 submissions, 60%;

Overall Acceptance Rate 58 of 169 submissions, 34%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

36
Total Citations
View Citations
540
Total Downloads

Downloads (Last 12 months)51
Downloads (Last 6 weeks)4

Reflects downloads up to 10 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Girondi MScazzariello MMaguire GKostić DDing AAral AHilt V(2024)Toward GPU-centric Networking on Commodity HardwareProceedings of the 7th International Workshop on Edge Systems, Analytics and Networking10.1145/3642968.3654820(43-48)Online publication date: 22-Apr-2024
https://dl.acm.org/doi/10.1145/3642968.3654820
Jasny MThostrup LTamimi SKoch AIstván ZBinnig C(2024)Zero-sided RDMA: Network-driven Data Shuffling for Disaggregated Heterogeneous Cloud DBMSsProceedings of the ACM on Management of Data10.1145/36392912:1(1-28)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639291
Liao YWu JLu WLi XYan G(2024)DPU-Direct: Unleashing Remote Accelerators via Enhanced RDMA for Disaggregated DatacentersIEEE Transactions on Computers10.1109/TC.2024.340408973:8(2081-2095)Online publication date: Aug-2024
https://doi.org/10.1109/TC.2024.3404089
Hung PKan HKnopf G(2024)Hardware ArchitectureEdge Computing Acceleration10.1002/9781119813873.ch5(125-166)Online publication date: 29-Nov-2024
https://doi.org/10.1002/9781119813873.ch5
Trivedi ABrunella MBaumann ACrooks NSchwarzkopf M(2023)CPU-free Computing: A Vision with a BlueprintProceedings of the 19th Workshop on Hot Topics in Operating Systems10.1145/3593856.3595906(1-14)Online publication date: 22-Jun-2023
https://dl.acm.org/doi/10.1145/3593856.3595906
Hu CWang CWang SSun NBao YZhao JKashyap SZuo PChen XXu LZhang QFeng HShan YBaumann ACrooks NSchwarzkopf M(2023)Skadi: Building a Distributed Runtime for Data Systems in Disaggregated Data CentersProceedings of the 19th Workshop on Hot Topics in Operating Systems10.1145/3593856.3595897(94-102)Online publication date: 22-Jun-2023
https://dl.acm.org/doi/10.1145/3593856.3595897
Kumar ASivasubramaniam AZhu T(2023)SplitRPC: A {Control + Data} Path Splitting RPC Stack for ML Inference ServingProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35899747:2(1-26)Online publication date: 22-May-2023
https://dl.acm.org/doi/10.1145/3589974
Ismayilov IBaydamirli JSağbili DWahib MUnat DGallivan KNikolopoulos DBeivide RGallopoulos E(2023)Multi-GPU Communication Schemes for Iterative Solvers: When CPUs are Not in ChargeProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593713(192-202)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593713
Zhang LWahib MChen PMeng JWang XEndo TMatsuoka SGallivan KNikolopoulos DBeivide RGallopoulos E(2023)PERKS: a Locality-Optimized Execution Model for Iterative Memory-bound GPU ApplicationsProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593705(167-179)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593705
Qureshi ZMailthody VGelado IMin SMasood APark JXiong JNewburn CVainbrand DChung IGarland MDally WHwu WAamodt TJerger NSwift M(2023)GPU-Initiated On-Demand High-Throughput Storage Access in the BaM System ArchitectureProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575748(325-339)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3575693.3575748
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

GPUnet: Networking Abstractions for GPU Programs

GPUfs: Integrating a file system with GPUs

GPUfs: integrating a file system with GPUs

Comments

Information

Published In

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations