research-article

Public Access

GPU triggered networking for intra-kernel communications

Authors:
Michael LeBeane

The University of Texas at Austin

The University of Texas at Austin
View Profile

,
Khaled Hamidouche

Advanced Micro Devices, Inc.

Advanced Micro Devices, Inc.
View Profile

,
Brad Benton

Advanced Micro Devices, Inc.

Advanced Micro Devices, Inc.
View Profile

,
Mauricio Breternitz

Instituto Universitario de Lisboa

Instituto Universitario de Lisboa
View Profile

,
Steven K. Reinhardt

Microsoft Corporation

Microsoft Corporation
View Profile

,
Lizy K. John

The University of Texas at Austin

The University of Texas at Austin
View Profile

SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisNovember 2017Article No.: 22Pages 1–12https://doi.org/10.1145/3126908.3126950

Published:12 November 2017Publication History

SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Pages 1–12

ABSTRACT

GPUs are widespread across clusters of compute nodes due to their attractive performance for data parallel codes. However, communicating between GPUs across the cluster is cumbersome when compared to CPU networking implementations. A number of recent works have enabled GPUs to more naturally access the network, but suffer from performance problems, require hidden CPU helper threads, or restrict communications to kernel boundaries.

In this paper, we propose GPU Triggered Networking, a novel, GPU-centric networking approach which leverages the best of CPUs and GPUs. In this model, CPUs create and stage network messages and GPUs trigger the network interface when data is ready to send. GPU Triggered Networking decouples these two operations, thereby removing the CPU from the critical path. We illustrate how this approach can provide up to 25% speedup compared to standard GPU networking across microbenchmarks, a Jacobi stencil, an important MPI collective operation, and machine-learning workloads.

References

Amit Agarwal, Eldar Akchurin, Chris Basoglu, Guoguo Chen, Scott Cyphers, Jasha Droppo, Adam Eversole, Brian Guenter, Mark Hillebrand, T. Ryan Hoens, Xuedong Huang, Zhiheng Huang, Vladimir Ivanov, Alexey Kamenev, Philipp Kranen, Oleksii Kuchaiev, Wolfgang Manousek, Avner May, Bhaskar Mitra, Olivier Nano, Gaizka Navarro, Alexey Orlov, Hari Parthasarathi, Baolin Peng, Marko Radmilac, Alexey Reznichenko, Frank Seide, Michael L. Seltzer, Malcolm Slaney, Andreas Stolcke, Huaming Wang, Yongqiang Wang, Kaisheng Yao, Dong Yu, Yu Zhang, and Geoffrey Zweig. 2014. An Introduction to Computational Networks and the Computational Network Toolkit. Technical Report. Microsoft. https://www.microsoft.com/en-us/research/wp-content/uploads/2014/08/CNTKBook-20160217.pdfGoogle Scholar
Elena Agostini, Davide Rossetti, and Sreeram Potluri. 2017. Offloading communication control logic in GPU accelerated applications. In Intl. Symp. on Cluster, Cloud and Grid Computing (CCGrid). Google ScholarDigital Library
Amazon. 2016. Amazon EC2 Cloud Computing. https://aws.amazon.com/ec2Google Scholar
AMD. 2015. The AMD gem5 APU Simulator: Modeling Heterogeneous Systems in gem5. http://gem5.org/GPU_ModelsGoogle Scholar
Nathan Binkert, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, David A. Wood, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, and Tushar Krishna. 2011. The gem5 simulator. ACM SIGARCH Computer Architecture News 39, 2 (2011), 1. Google ScholarDigital Library
Bull. 2017. BXI: Bull eXascale Interconnect. https://atos.net/en/products/high-performance-computing-hpc/bxi-bull-exascale-interconnectGoogle Scholar
Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh, Ammar Ahmad Awan, and Dhabaleswar K. Panda. 2016. CUDA Kernel Based Collective Reduction Operations on Large-scale GPU Clusters. In Intl. Symp. on Cluster, Cloud and Grid Computing (CCGrid).Google Scholar
Feras Daoud, Amir Watad, and Mark Silberstein. 2016. GPUrdma: GPU-side Library for High Performance Networking from GPU Kernels. In Intl. Workshop on Runtime and Operating Systems for Supercomputers (ROSS). 6:1--6:8. Google ScholarDigital Library
Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2011. Dark silicon and the end of multicore scaling. In Intl. Symp. on Computer Architecture (ISCA). 365--376. Google ScholarDigital Library
Zhe Fan, Feng Qiu, and Arie E. Kaufman. 2008. Zippy: A Framework for Computation and Visualization on a GPU Cluster. Computer Graphics Forum 27, 2 (2008), 341--350.Google ScholarCross Ref
Ryan E Grant, Anthony Skjellum, and V Purushotham. 2015. Lightweight threading with MPI using Persistent Communications Semantics. In Workshop on Exascale MPI (ExaMPI).Google Scholar
Khronos Group. 2017. OpenCL. https://www.khronos.org/opencl/Google Scholar
Tobias Gysi Jeremia Bär, and Torsten Hoefler. 2016. dCUDA: Hardware Supported Overlap of Computation and Communication. In Intl. Conf. for High Performance Computing, Networking, Storage and Analysis (SC) (SC '16). Article 52, 12 pages. Google ScholarDigital Library
Khaled Hamidouche, Akshay Venkatesh, Ammar Ahmad Awan, Hari Subramoni, Ching-Hsiang Chu, and Dhabaleswar K. Panda. 2016. CUDA-Aware OpenSHMEM: Extensions and Designs for High Performance OpenSHMEM on GPU Clusters. Parallel Comput. 58 (2016), 27--36. Google ScholarDigital Library
Torsten Hoefler and Andrew Lumsdaine. 2006. Design, Implementation, and Usage of LibNBC. Technical Report. Indiana University Bloomington. http://www.cs.indiana.edu/cgi-bin/techreports/TRNNN.cgi?trnum=TR637Google Scholar
Derek R. Hower, Blake A. Hechtman, Bradford M. Beckmann, Benedict R. Gaster, Mark D. Hill, Steven K. Reinhardt, and David A. Wood. 2014. Heterogeneous-race-free Memory Models. In Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google ScholarDigital Library
InfiniBand Trade Association. 2000. InfiniBand Architecture Specification: Release 1.0.2. http://www.infinibandta.org/content/pages.php?pg=technology_downloadGoogle Scholar
InfiniBand Trade Association. 2014. RDMA over Converged Ethernet v2. https://cw.infinibandta.org/document/dl/7781Google Scholar
Intel. 2010. Internet Wide Area RDMA Protocol (iWARP). http://www.intel.com/content/dam/doc/technology-brief/iwarp-brief.pdfGoogle Scholar
Intel. 2015. Omni-Path Fabric 100 Series. https://fabricbuilders.intel.com/Google Scholar
Sangman Kim, Seonggu Huh, Yige Hu, Xinya Zhang, Emmett Witchel, Amir Wated, and Mark Silberstein. 2014. GPUnet: Networking Abstractions for GPU Programs. In USENLX Conf. on Operating Systems Design and Implementation (OSDI). 201--216. Google ScholarDigital Library
Benjamin Klenk, Lena Oden, and Holger Froning. 2014. Analyzing Put/Get APIs for Thread-Collaborative Processors. In Intl. Conf. on Parallel Processing (ICPP) Workshops. Google ScholarDigital Library
Benjamin Klenk, Lena Oden, and Holger Froning. 2015. Analyzing communication models for distributed thread-collaborative processors in terms of energy and time. In Intl. Symp. on Performance Analysis of Systems and Software (ISPASS).Google ScholarCross Ref
Jim Lambers. 2010. Jacobi Methods. http://web.stanford.edu/class/cme335/lecture7.pdfGoogle Scholar
Mellanox. 2017. Mellanox OFED GPUDirect RDMA. http://www.mellanox.com/page/products_dyn?product_family=116Google Scholar
Takefumi Miyoshi, Hidetsugu Irie, Keigo Shima, Hiroki Honda, Masaaki Kondo, and Tsutomu Yoshinaga. 2012. FLAT: A GPU Programming Framework to Provide Embedded MPI. In Workshop on General Purpose Processing with Graphics Processing Units (GPGPU). 20--29. Google ScholarDigital Library
MPI Forum. 2012. MPI: A Message-Passing Interface Standard. Ver. 3. www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdfGoogle Scholar
Nvidia. 2016. CUDA Toolkit 8.0. https://developer.nvidia.com/cuda-toolkitGoogle Scholar
Nvidia. 2016. Fast Multi-GPU collectives with NCCL. https://devblogs.nvidia.com/parallelforall/fast-multi-gpu-collectives-nccl/Google Scholar
Lena Oden and Holger Froning. 2013. GGAS: Global GPU address spaces for efficient communication in heterogeneous clusters. In Intl. Conf. on Cluster Computing (CLUSTER). 1--8.Google ScholarCross Ref
Lena Oden, Holger Froning, and Franz-Joseph Pfreundt. 2014. Infiniband-Verbs on GPU: A Case Study of Controlling an Infiniband Network Device from the GPU. In Intl. Conf. on Parallel Distributed Processing Symposium Workshops (IPDPSW). 976--983. Google ScholarDigital Library
Sreeram Potluri, Nathan Luehr, and Nikolay Sakharnykh. 2016. Simplifying Multi-GPU Communication with NVSHMEM. http://on-demand-gtc.gputechconf.com/gtc-quicklink/7D7mUGoogle Scholar
Davide Rossetti. 2015. GPUDirect Async. http://on-demand.gputechconf.com/gtc/2015/presentation/S5412-Davide-Rossetti.pdfGoogle Scholar
Sandia National Laboratories. 2014. The Portals 4.0.2 Network Programming Interface. http://www.cs.sandia.gov/Portals/portals402.pdfGoogle Scholar
Magnus Strengert, Christoph Müller, Carsten Dachsbacher, and Thomas Ertl. 2008. CUDASA: Compute Unified Device and Systems Architecture. In Eurographics Conf. on Parallel Graphics and Visualization (EGPGV). Google ScholarDigital Library
Jeff A. Stuart and John D. Owens. 2009. Message passing on data-parallel architectures. In Intl. Symp. on Parallel Distributed Processing (IPDPS). 1--12. Google ScholarDigital Library
TACC. 2015. Stampede Supercomputer User Guide. https://portal.tacc.utexas.edu/user-guides/stampedeGoogle Scholar
TOP500.org. 2016. Green 500. http://www.top500.org/green500Google Scholar
TOP500.org. 2017. Highlights - June 2017. https://www.top500.org/lists/2017/06/highlights/Google Scholar
Keith D. Underwood, Jerrie Coffman, Roy Larsen, K. Scott Hemmert, Brian W. Barrett, Ron Brightwell, and Michael Levenhagen. 2011. Enabling Flexible Collective Communication Offload with Triggered Operations. In Symp. on High Performance Interconnects (Hot Interconnects). Google ScholarDigital Library

Index Terms

GPU triggered networking for intra-kernel communications
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems
2. Hardware
  1. Communication hardware, interfaces and storage
    1. Networking hardware

Recommendations

GPU initiated OpenSHMEM: correct and efficient intra-kernel networking for dGPUs
PPoPP '20: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Current state-of-the-art in GPU networking utilizes a host-centric, kernel-boundary communication model that reduces performance and increases code complexity. To address these concerns, recent works have explored performing network operations from ...
Read More
ComP-net: command processor networking for efficient intra-kernel communications on GPUs
PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques

Current state-of-the-art in GPU networking advocates a host-centric model that reduces performance and increases code complexity. Recently, researchers have explored several techniques for networking within a GPU kernel itself. These approaches, however,...
Read More
Analyzing GPU-controlled communication with dynamic parallelism in terms of performance and energy

Intra-GPU synchronization is a problem for GPU controlled communication.Options, based on dynamic parallelism provide on-device synchronization.GPU controlled communication have a lower performance than CPU assisted approaches.Relieving the CPU from the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2017
801 pages
ISBN:9781450351140
DOI:10.1145/3126908
General Chair:
Bernd Mohr
Jülich Supercomputing Center, Jülich, Germany
,
Program Chair:
Padma Raghavan
Vanderbilt University, Nashville, TN
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 November 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
GPUs
NIC hardware
RDMA networks
Qualifiers
- research-article
Conference

Acceptance Rates
SC '17 Paper Acceptance Rate61of327submissions,19%Overall Acceptance Rate1,516of6,373submissions,24%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 13
  Total Citations
  View Citations
- 757
  Total Downloads
- Downloads (Last 12 months)64
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

GPU triggered networking for intra-kernel communications

SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

<u>G</u>PU <u>i</u>nitiated <u>O</u>penSHMEM: correct and efficient intra-kernel networking for dGPUs

ComP-net: command processor networking for efficient intra-kernel communications on GPUs

Analyzing GPU-controlled communication with dynamic parallelism in terms of performance and energy

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

GPU triggered networking for intra-kernel communications

SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

<u>G</u>PU <u>i</u>nitiated <u>O</u>penSHMEM: correct and efficient intra-kernel networking for dGPUs

ComP-net: command processor networking for efficient intra-kernel communications on GPUs

Analyzing GPU-controlled communication with dynamic parallelism in terms of performance and energy

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media