skip to main content
10.1145/3126908.3126950acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article
Public Access

GPU triggered networking for intra-kernel communications

Published:12 November 2017Publication History

ABSTRACT

GPUs are widespread across clusters of compute nodes due to their attractive performance for data parallel codes. However, communicating between GPUs across the cluster is cumbersome when compared to CPU networking implementations. A number of recent works have enabled GPUs to more naturally access the network, but suffer from performance problems, require hidden CPU helper threads, or restrict communications to kernel boundaries.

In this paper, we propose GPU Triggered Networking, a novel, GPU-centric networking approach which leverages the best of CPUs and GPUs. In this model, CPUs create and stage network messages and GPUs trigger the network interface when data is ready to send. GPU Triggered Networking decouples these two operations, thereby removing the CPU from the critical path. We illustrate how this approach can provide up to 25% speedup compared to standard GPU networking across microbenchmarks, a Jacobi stencil, an important MPI collective operation, and machine-learning workloads.

References

  1. Amit Agarwal, Eldar Akchurin, Chris Basoglu, Guoguo Chen, Scott Cyphers, Jasha Droppo, Adam Eversole, Brian Guenter, Mark Hillebrand, T. Ryan Hoens, Xuedong Huang, Zhiheng Huang, Vladimir Ivanov, Alexey Kamenev, Philipp Kranen, Oleksii Kuchaiev, Wolfgang Manousek, Avner May, Bhaskar Mitra, Olivier Nano, Gaizka Navarro, Alexey Orlov, Hari Parthasarathi, Baolin Peng, Marko Radmilac, Alexey Reznichenko, Frank Seide, Michael L. Seltzer, Malcolm Slaney, Andreas Stolcke, Huaming Wang, Yongqiang Wang, Kaisheng Yao, Dong Yu, Yu Zhang, and Geoffrey Zweig. 2014. An Introduction to Computational Networks and the Computational Network Toolkit. Technical Report. Microsoft. https://www.microsoft.com/en-us/research/wp-content/uploads/2014/08/CNTKBook-20160217.pdfGoogle ScholarGoogle Scholar
  2. Elena Agostini, Davide Rossetti, and Sreeram Potluri. 2017. Offloading communication control logic in GPU accelerated applications. In Intl. Symp. on Cluster, Cloud and Grid Computing (CCGrid). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Amazon. 2016. Amazon EC2 Cloud Computing. https://aws.amazon.com/ec2Google ScholarGoogle Scholar
  4. AMD. 2015. The AMD gem5 APU Simulator: Modeling Heterogeneous Systems in gem5. http://gem5.org/GPU_ModelsGoogle ScholarGoogle Scholar
  5. Nathan Binkert, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, David A. Wood, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, and Tushar Krishna. 2011. The gem5 simulator. ACM SIGARCH Computer Architecture News 39, 2 (2011), 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Bull. 2017. BXI: Bull eXascale Interconnect. https://atos.net/en/products/high-performance-computing-hpc/bxi-bull-exascale-interconnectGoogle ScholarGoogle Scholar
  7. Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh, Ammar Ahmad Awan, and Dhabaleswar K. Panda. 2016. CUDA Kernel Based Collective Reduction Operations on Large-scale GPU Clusters. In Intl. Symp. on Cluster, Cloud and Grid Computing (CCGrid).Google ScholarGoogle Scholar
  8. Feras Daoud, Amir Watad, and Mark Silberstein. 2016. GPUrdma: GPU-side Library for High Performance Networking from GPU Kernels. In Intl. Workshop on Runtime and Operating Systems for Supercomputers (ROSS). 6:1--6:8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2011. Dark silicon and the end of multicore scaling. In Intl. Symp. on Computer Architecture (ISCA). 365--376. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Zhe Fan, Feng Qiu, and Arie E. Kaufman. 2008. Zippy: A Framework for Computation and Visualization on a GPU Cluster. Computer Graphics Forum 27, 2 (2008), 341--350.Google ScholarGoogle ScholarCross RefCross Ref
  11. Ryan E Grant, Anthony Skjellum, and V Purushotham. 2015. Lightweight threading with MPI using Persistent Communications Semantics. In Workshop on Exascale MPI (ExaMPI).Google ScholarGoogle Scholar
  12. Khronos Group. 2017. OpenCL. https://www.khronos.org/opencl/Google ScholarGoogle Scholar
  13. Tobias Gysi Jeremia Bär, and Torsten Hoefler. 2016. dCUDA: Hardware Supported Overlap of Computation and Communication. In Intl. Conf. for High Performance Computing, Networking, Storage and Analysis (SC) (SC '16). Article 52, 12 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Khaled Hamidouche, Akshay Venkatesh, Ammar Ahmad Awan, Hari Subramoni, Ching-Hsiang Chu, and Dhabaleswar K. Panda. 2016. CUDA-Aware OpenSHMEM: Extensions and Designs for High Performance OpenSHMEM on GPU Clusters. Parallel Comput. 58 (2016), 27--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Torsten Hoefler and Andrew Lumsdaine. 2006. Design, Implementation, and Usage of LibNBC. Technical Report. Indiana University Bloomington. http://www.cs.indiana.edu/cgi-bin/techreports/TRNNN.cgi?trnum=TR637Google ScholarGoogle Scholar
  16. Derek R. Hower, Blake A. Hechtman, Bradford M. Beckmann, Benedict R. Gaster, Mark D. Hill, Steven K. Reinhardt, and David A. Wood. 2014. Heterogeneous-race-free Memory Models. In Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. InfiniBand Trade Association. 2000. InfiniBand Architecture Specification: Release 1.0.2. http://www.infinibandta.org/content/pages.php?pg=technology_downloadGoogle ScholarGoogle Scholar
  18. InfiniBand Trade Association. 2014. RDMA over Converged Ethernet v2. https://cw.infinibandta.org/document/dl/7781Google ScholarGoogle Scholar
  19. Intel. 2010. Internet Wide Area RDMA Protocol (iWARP). http://www.intel.com/content/dam/doc/technology-brief/iwarp-brief.pdfGoogle ScholarGoogle Scholar
  20. Intel. 2015. Omni-Path Fabric 100 Series. https://fabricbuilders.intel.com/Google ScholarGoogle Scholar
  21. Sangman Kim, Seonggu Huh, Yige Hu, Xinya Zhang, Emmett Witchel, Amir Wated, and Mark Silberstein. 2014. GPUnet: Networking Abstractions for GPU Programs. In USENLX Conf. on Operating Systems Design and Implementation (OSDI). 201--216. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Benjamin Klenk, Lena Oden, and Holger Froning. 2014. Analyzing Put/Get APIs for Thread-Collaborative Processors. In Intl. Conf. on Parallel Processing (ICPP) Workshops. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Benjamin Klenk, Lena Oden, and Holger Froning. 2015. Analyzing communication models for distributed thread-collaborative processors in terms of energy and time. In Intl. Symp. on Performance Analysis of Systems and Software (ISPASS).Google ScholarGoogle ScholarCross RefCross Ref
  24. Jim Lambers. 2010. Jacobi Methods. http://web.stanford.edu/class/cme335/lecture7.pdfGoogle ScholarGoogle Scholar
  25. Mellanox. 2017. Mellanox OFED GPUDirect RDMA. http://www.mellanox.com/page/products_dyn?product_family=116Google ScholarGoogle Scholar
  26. Takefumi Miyoshi, Hidetsugu Irie, Keigo Shima, Hiroki Honda, Masaaki Kondo, and Tsutomu Yoshinaga. 2012. FLAT: A GPU Programming Framework to Provide Embedded MPI. In Workshop on General Purpose Processing with Graphics Processing Units (GPGPU). 20--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. MPI Forum. 2012. MPI: A Message-Passing Interface Standard. Ver. 3. www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdfGoogle ScholarGoogle Scholar
  28. Nvidia. 2016. CUDA Toolkit 8.0. https://developer.nvidia.com/cuda-toolkitGoogle ScholarGoogle Scholar
  29. Nvidia. 2016. Fast Multi-GPU collectives with NCCL. https://devblogs.nvidia.com/parallelforall/fast-multi-gpu-collectives-nccl/Google ScholarGoogle Scholar
  30. Lena Oden and Holger Froning. 2013. GGAS: Global GPU address spaces for efficient communication in heterogeneous clusters. In Intl. Conf. on Cluster Computing (CLUSTER). 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  31. Lena Oden, Holger Froning, and Franz-Joseph Pfreundt. 2014. Infiniband-Verbs on GPU: A Case Study of Controlling an Infiniband Network Device from the GPU. In Intl. Conf. on Parallel Distributed Processing Symposium Workshops (IPDPSW). 976--983. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Sreeram Potluri, Nathan Luehr, and Nikolay Sakharnykh. 2016. Simplifying Multi-GPU Communication with NVSHMEM. http://on-demand-gtc.gputechconf.com/gtc-quicklink/7D7mUGoogle ScholarGoogle Scholar
  33. Davide Rossetti. 2015. GPUDirect Async. http://on-demand.gputechconf.com/gtc/2015/presentation/S5412-Davide-Rossetti.pdfGoogle ScholarGoogle Scholar
  34. Sandia National Laboratories. 2014. The Portals 4.0.2 Network Programming Interface. http://www.cs.sandia.gov/Portals/portals402.pdfGoogle ScholarGoogle Scholar
  35. Magnus Strengert, Christoph Müller, Carsten Dachsbacher, and Thomas Ertl. 2008. CUDASA: Compute Unified Device and Systems Architecture. In Eurographics Conf. on Parallel Graphics and Visualization (EGPGV). Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Jeff A. Stuart and John D. Owens. 2009. Message passing on data-parallel architectures. In Intl. Symp. on Parallel Distributed Processing (IPDPS). 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. TACC. 2015. Stampede Supercomputer User Guide. https://portal.tacc.utexas.edu/user-guides/stampedeGoogle ScholarGoogle Scholar
  38. TOP500.org. 2016. Green 500. http://www.top500.org/green500Google ScholarGoogle Scholar
  39. TOP500.org. 2017. Highlights - June 2017. https://www.top500.org/lists/2017/06/highlights/Google ScholarGoogle Scholar
  40. Keith D. Underwood, Jerrie Coffman, Roy Larsen, K. Scott Hemmert, Brian W. Barrett, Ron Brightwell, and Michael Levenhagen. 2011. Enabling Flexible Collective Communication Offload with Triggered Operations. In Symp. on High Performance Interconnects (Hot Interconnects). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. GPU triggered networking for intra-kernel communications

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
        November 2017
        801 pages
        ISBN:9781450351140
        DOI:10.1145/3126908
        • General Chair:
        • Bernd Mohr,
        • Program Chair:
        • Padma Raghavan

        Copyright © 2017 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 12 November 2017

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        SC '17 Paper Acceptance Rate61of327submissions,19%Overall Acceptance Rate1,516of6,373submissions,24%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader