ABSTRACT
GPUs are widespread across clusters of compute nodes due to their attractive performance for data parallel codes. However, communicating between GPUs across the cluster is cumbersome when compared to CPU networking implementations. A number of recent works have enabled GPUs to more naturally access the network, but suffer from performance problems, require hidden CPU helper threads, or restrict communications to kernel boundaries.
In this paper, we propose GPU Triggered Networking, a novel, GPU-centric networking approach which leverages the best of CPUs and GPUs. In this model, CPUs create and stage network messages and GPUs trigger the network interface when data is ready to send. GPU Triggered Networking decouples these two operations, thereby removing the CPU from the critical path. We illustrate how this approach can provide up to 25% speedup compared to standard GPU networking across microbenchmarks, a Jacobi stencil, an important MPI collective operation, and machine-learning workloads.
- Amit Agarwal, Eldar Akchurin, Chris Basoglu, Guoguo Chen, Scott Cyphers, Jasha Droppo, Adam Eversole, Brian Guenter, Mark Hillebrand, T. Ryan Hoens, Xuedong Huang, Zhiheng Huang, Vladimir Ivanov, Alexey Kamenev, Philipp Kranen, Oleksii Kuchaiev, Wolfgang Manousek, Avner May, Bhaskar Mitra, Olivier Nano, Gaizka Navarro, Alexey Orlov, Hari Parthasarathi, Baolin Peng, Marko Radmilac, Alexey Reznichenko, Frank Seide, Michael L. Seltzer, Malcolm Slaney, Andreas Stolcke, Huaming Wang, Yongqiang Wang, Kaisheng Yao, Dong Yu, Yu Zhang, and Geoffrey Zweig. 2014. An Introduction to Computational Networks and the Computational Network Toolkit. Technical Report. Microsoft. https://www.microsoft.com/en-us/research/wp-content/uploads/2014/08/CNTKBook-20160217.pdfGoogle Scholar
- Elena Agostini, Davide Rossetti, and Sreeram Potluri. 2017. Offloading communication control logic in GPU accelerated applications. In Intl. Symp. on Cluster, Cloud and Grid Computing (CCGrid). Google ScholarDigital Library
- Amazon. 2016. Amazon EC2 Cloud Computing. https://aws.amazon.com/ec2Google Scholar
- AMD. 2015. The AMD gem5 APU Simulator: Modeling Heterogeneous Systems in gem5. http://gem5.org/GPU_ModelsGoogle Scholar
- Nathan Binkert, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, David A. Wood, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, and Tushar Krishna. 2011. The gem5 simulator. ACM SIGARCH Computer Architecture News 39, 2 (2011), 1. Google ScholarDigital Library
- Bull. 2017. BXI: Bull eXascale Interconnect. https://atos.net/en/products/high-performance-computing-hpc/bxi-bull-exascale-interconnectGoogle Scholar
- Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh, Ammar Ahmad Awan, and Dhabaleswar K. Panda. 2016. CUDA Kernel Based Collective Reduction Operations on Large-scale GPU Clusters. In Intl. Symp. on Cluster, Cloud and Grid Computing (CCGrid).Google Scholar
- Feras Daoud, Amir Watad, and Mark Silberstein. 2016. GPUrdma: GPU-side Library for High Performance Networking from GPU Kernels. In Intl. Workshop on Runtime and Operating Systems for Supercomputers (ROSS). 6:1--6:8. Google ScholarDigital Library
- Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2011. Dark silicon and the end of multicore scaling. In Intl. Symp. on Computer Architecture (ISCA). 365--376. Google ScholarDigital Library
- Zhe Fan, Feng Qiu, and Arie E. Kaufman. 2008. Zippy: A Framework for Computation and Visualization on a GPU Cluster. Computer Graphics Forum 27, 2 (2008), 341--350.Google ScholarCross Ref
- Ryan E Grant, Anthony Skjellum, and V Purushotham. 2015. Lightweight threading with MPI using Persistent Communications Semantics. In Workshop on Exascale MPI (ExaMPI).Google Scholar
- Khronos Group. 2017. OpenCL. https://www.khronos.org/opencl/Google Scholar
- Tobias Gysi Jeremia Bär, and Torsten Hoefler. 2016. dCUDA: Hardware Supported Overlap of Computation and Communication. In Intl. Conf. for High Performance Computing, Networking, Storage and Analysis (SC) (SC '16). Article 52, 12 pages. Google ScholarDigital Library
- Khaled Hamidouche, Akshay Venkatesh, Ammar Ahmad Awan, Hari Subramoni, Ching-Hsiang Chu, and Dhabaleswar K. Panda. 2016. CUDA-Aware OpenSHMEM: Extensions and Designs for High Performance OpenSHMEM on GPU Clusters. Parallel Comput. 58 (2016), 27--36. Google ScholarDigital Library
- Torsten Hoefler and Andrew Lumsdaine. 2006. Design, Implementation, and Usage of LibNBC. Technical Report. Indiana University Bloomington. http://www.cs.indiana.edu/cgi-bin/techreports/TRNNN.cgi?trnum=TR637Google Scholar
- Derek R. Hower, Blake A. Hechtman, Bradford M. Beckmann, Benedict R. Gaster, Mark D. Hill, Steven K. Reinhardt, and David A. Wood. 2014. Heterogeneous-race-free Memory Models. In Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google ScholarDigital Library
- InfiniBand Trade Association. 2000. InfiniBand Architecture Specification: Release 1.0.2. http://www.infinibandta.org/content/pages.php?pg=technology_downloadGoogle Scholar
- InfiniBand Trade Association. 2014. RDMA over Converged Ethernet v2. https://cw.infinibandta.org/document/dl/7781Google Scholar
- Intel. 2010. Internet Wide Area RDMA Protocol (iWARP). http://www.intel.com/content/dam/doc/technology-brief/iwarp-brief.pdfGoogle Scholar
- Intel. 2015. Omni-Path Fabric 100 Series. https://fabricbuilders.intel.com/Google Scholar
- Sangman Kim, Seonggu Huh, Yige Hu, Xinya Zhang, Emmett Witchel, Amir Wated, and Mark Silberstein. 2014. GPUnet: Networking Abstractions for GPU Programs. In USENLX Conf. on Operating Systems Design and Implementation (OSDI). 201--216. Google ScholarDigital Library
- Benjamin Klenk, Lena Oden, and Holger Froning. 2014. Analyzing Put/Get APIs for Thread-Collaborative Processors. In Intl. Conf. on Parallel Processing (ICPP) Workshops. Google ScholarDigital Library
- Benjamin Klenk, Lena Oden, and Holger Froning. 2015. Analyzing communication models for distributed thread-collaborative processors in terms of energy and time. In Intl. Symp. on Performance Analysis of Systems and Software (ISPASS).Google ScholarCross Ref
- Jim Lambers. 2010. Jacobi Methods. http://web.stanford.edu/class/cme335/lecture7.pdfGoogle Scholar
- Mellanox. 2017. Mellanox OFED GPUDirect RDMA. http://www.mellanox.com/page/products_dyn?product_family=116Google Scholar
- Takefumi Miyoshi, Hidetsugu Irie, Keigo Shima, Hiroki Honda, Masaaki Kondo, and Tsutomu Yoshinaga. 2012. FLAT: A GPU Programming Framework to Provide Embedded MPI. In Workshop on General Purpose Processing with Graphics Processing Units (GPGPU). 20--29. Google ScholarDigital Library
- MPI Forum. 2012. MPI: A Message-Passing Interface Standard. Ver. 3. www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdfGoogle Scholar
- Nvidia. 2016. CUDA Toolkit 8.0. https://developer.nvidia.com/cuda-toolkitGoogle Scholar
- Nvidia. 2016. Fast Multi-GPU collectives with NCCL. https://devblogs.nvidia.com/parallelforall/fast-multi-gpu-collectives-nccl/Google Scholar
- Lena Oden and Holger Froning. 2013. GGAS: Global GPU address spaces for efficient communication in heterogeneous clusters. In Intl. Conf. on Cluster Computing (CLUSTER). 1--8.Google ScholarCross Ref
- Lena Oden, Holger Froning, and Franz-Joseph Pfreundt. 2014. Infiniband-Verbs on GPU: A Case Study of Controlling an Infiniband Network Device from the GPU. In Intl. Conf. on Parallel Distributed Processing Symposium Workshops (IPDPSW). 976--983. Google ScholarDigital Library
- Sreeram Potluri, Nathan Luehr, and Nikolay Sakharnykh. 2016. Simplifying Multi-GPU Communication with NVSHMEM. http://on-demand-gtc.gputechconf.com/gtc-quicklink/7D7mUGoogle Scholar
- Davide Rossetti. 2015. GPUDirect Async. http://on-demand.gputechconf.com/gtc/2015/presentation/S5412-Davide-Rossetti.pdfGoogle Scholar
- Sandia National Laboratories. 2014. The Portals 4.0.2 Network Programming Interface. http://www.cs.sandia.gov/Portals/portals402.pdfGoogle Scholar
- Magnus Strengert, Christoph Müller, Carsten Dachsbacher, and Thomas Ertl. 2008. CUDASA: Compute Unified Device and Systems Architecture. In Eurographics Conf. on Parallel Graphics and Visualization (EGPGV). Google ScholarDigital Library
- Jeff A. Stuart and John D. Owens. 2009. Message passing on data-parallel architectures. In Intl. Symp. on Parallel Distributed Processing (IPDPS). 1--12. Google ScholarDigital Library
- TACC. 2015. Stampede Supercomputer User Guide. https://portal.tacc.utexas.edu/user-guides/stampedeGoogle Scholar
- TOP500.org. 2016. Green 500. http://www.top500.org/green500Google Scholar
- TOP500.org. 2017. Highlights - June 2017. https://www.top500.org/lists/2017/06/highlights/Google Scholar
- Keith D. Underwood, Jerrie Coffman, Roy Larsen, K. Scott Hemmert, Brian W. Barrett, Ron Brightwell, and Michael Levenhagen. 2011. Enabling Flexible Collective Communication Offload with Triggered Operations. In Symp. on High Performance Interconnects (Hot Interconnects). Google ScholarDigital Library
Index Terms
- GPU triggered networking for intra-kernel communications
Recommendations
<u>G</u>PU <u>i</u>nitiated <u>O</u>penSHMEM: correct and efficient intra-kernel networking for dGPUs
PPoPP '20: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingCurrent state-of-the-art in GPU networking utilizes a host-centric, kernel-boundary communication model that reduces performance and increases code complexity. To address these concerns, recent works have explored performing network operations from ...
ComP-net: command processor networking for efficient intra-kernel communications on GPUs
PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation TechniquesCurrent state-of-the-art in GPU networking advocates a host-centric model that reduces performance and increases code complexity. Recently, researchers have explored several techniques for networking within a GPU kernel itself. These approaches, however,...
Analyzing GPU-controlled communication with dynamic parallelism in terms of performance and energy
Intra-GPU synchronization is a problem for GPU controlled communication.Options, based on dynamic parallelism provide on-device synchronization.GPU controlled communication have a lower performance than CPU assisted approaches.Relieving the CPU from the ...
Comments