skip to main content
10.1145/3332466.3374544acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

<u>G</u>PU <u>i</u>nitiated <u>O</u>penSHMEM: correct and efficient intra-kernel networking for dGPUs

Published:19 February 2020Publication History

ABSTRACT

Current state-of-the-art in GPU networking utilizes a host-centric, kernel-boundary communication model that reduces performance and increases code complexity. To address these concerns, recent works have explored performing network operations from within a GPU kernel itself. However, these approaches typically involve the CPU in the critical path, which leads to high latency and inefficient utilization of network and/or GPU resources.

In this work, we introduce GPU Initiated OpenSHMEM (GIO), a new intra-kernel PGAS programming model and runtime that enables GPUs to communicate directly with a NIC without the intervention of the CPU. We accomplish this by exploring the GPU's coarse-grained memory model and correcting semantic mismatches when GPUs wish to directly interact with the network. GIO also reduces latency by relying on a novel template-based design to minimize the overhead of initiating a network operation. We illustrate that for structured applications like a Jacobi 2D stencil, GIO can improve application performance by up to 40% compared to traditional kernel-boundary networking. Furthermore, we demonstrate that on irregular applications like Sparse Triangular Solve (SpTS), GIO provides up to 44% improvement compared to existing intra-kernel networking schemes.

References

  1. Elena Agostini, Davide Rossetti, and Sreeram Potluri. 2017. Offloading Communication Control Logic in GPU Accelerated Applications. In Intl. Symp. on Cluster, Cloud and Grid Computing (CCGrid).Google ScholarGoogle Scholar
  2. Johnathan Alsop, Marc S. Orr, Bradford M. Beckmann, and David A. Wood. 2016. Lazy release consistency for GPUs. In Intl. Symp. on Microarchitecture (MICRO).Google ScholarGoogle Scholar
  3. AMD. 2017. Graphics Core Next Architecture, Generation 3 ISA. http://gpuopen.com/compute-product/amd-gcn3-isa-architecture-manual/Google ScholarGoogle Scholar
  4. AMD. 2017. ROCn RDMA. https://github.com/rocmarchive/ROCnRDMAGoogle ScholarGoogle Scholar
  5. AMD. 2018. HIP: Heterogeneous-computing Interface for Portability. http://rocm-developer-tools.github.io/HIP/Google ScholarGoogle Scholar
  6. Matthew Baker, Swen Boehm, Aurelien Bouteiller, Barbara Chapman, Robert Cernohous, James Culhane, Tony Curtis, James Dinan, Mike Dubman, Karl Feind, Manjunath Gorentla Venkata, Max Grossman, Khaled Hamidouche, Jeff Hammond, Yossi Itigin, Bryant Lam, David Knaak, Jeff Kuehn, Jens Manser, Tiffany M. Mintz, David Ozog, Nicholas Park, Steve Poole, Wendy Poole, Swaroop Pophale, Sreeram Potluri, Howard Pritchard, Naveen Ravichandrasekaran, Michael Raymond, James Ross, Pavel Shamis, Sameer Shende, and Lauren Smith. 2018. OpenSHMEM Specification, http://openshmem.org/site/sites/default/site_files/OpenSHMEM-1.4.pdfGoogle ScholarGoogle Scholar
  7. Feras Daoud, Amir Watad, and Mark Silberstein. 2016. GPUrdma: GPU-side Library for High Performance Networking from GPU Kernels. In Intl. Workshop on Runtime and Operating Systems for Supercomputers (ROSS). 6:1--6:8.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Tobias Gysi, Jeremia Bär, and Torsten Hoefler. 2016. dCUDA: Hardware Supported Overlap of Computation and Communication. In Intl. Conf. for High Performance Computing, Networking, Storage and Analysis (SC).Google ScholarGoogle ScholarCross RefCross Ref
  9. Khaled Hamidouche, Akshay Venkatesh, Ammar Ahmad Awan, Hari Subramoni, Ching-Hsiang Chu, and Dhabaleswar K. Panda. 2016. CUDA-Aware OpenSHMEM: Extensions and Designs for High Performance OpenSHMEM on GPU Clusters. Parallel Computing 58 (2016), 27--36.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Derek R. Hower, Blake A. Hechtman, Bradford M. Beckmann, Benedict R. Gaster, Mark D. Hill, Steven K. Reinhardt, and David A. Wood. 2014. Heterogeneous-race-free Memory Models. In Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS).Google ScholarGoogle Scholar
  11. InfiniBand Trade Association. 2000. InfiniBand Architecture Specification: Release 1.0.2. http://www.infinibandta.org/content/pages.php?pg=technology_downloadGoogle ScholarGoogle Scholar
  12. Sangman Kim, Seonggu Huh, Yige Hu, Xinya Zhang, Emmett Witchel, Amir Wated, and Mark Silberstein. 2014. GPUnet: Networking Abstractions for GPU Programs. In USENIX Conf. on Operating Systems Design and Implementation (OSDI). 201--216.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Benjamin Klenk, Lena Oden, and Holger Froning. 2014. Analyzing Put/Get APIs for Thread-Collaborative Processors. In Intl. Conf. on Parallel Processing (ICPP) Workshops.Google ScholarGoogle Scholar
  14. Benjamin Klenk, Lena Oden, and Holger Froning. 2015. Analyzing Communication Models for Distributed Thread-collaborative Processors in Terms of Energy and Time. In Intl. Symp. on Performance Analysis of Systems and Software (ISPASS).Google ScholarGoogle Scholar
  15. Michael LeBeane, Khaled Hamidouche, Brad Benton, Mauricio Breternitz, Steven K. Reinhardt, and Lizy K. John. 2017. GPU Triggered Networking for Intra-Kernel Communications. In Intl. Conf. for High Performance Computing, Networking, Storage and Analysis (SC).Google ScholarGoogle Scholar
  16. Michael LeBeane, Khaled Hamidouche, Brad Benton, Mauricio Breternitz, Steven K. Reinhardt, and Lizy K. John. 2018. ComP-Net: Command Processor Networking for Efficient Intra-kernel Communications on GPUs. In Intl. Conf. on Parallel Architectures and Compilation Techniques (PACT'18).Google ScholarGoogle Scholar
  17. Weifeng Liu, Ang Li, Jonathan Hogg, Iain S. Duff, and Brian Vinter. 2016. A Synchronization-Free Algorithm for Parallel Sparse Triangular Solves. In Intl. Conf. on Parallel Processing (Euro-Par).Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Weifeng Liu, Ang Li, Jonathan D. Hogg, Iain S. Duff, and Brian Vinter. 2017. Fast Synchronization-Free Algorithms for Parallel Sparse Triangular Solves with Multiple Right-Hand Sides (SpTRSM). In Journal of Concurrency and Computation: Practice and Experience.Google ScholarGoogle Scholar
  19. Mellanox. 2017. Mellanox OFED GPUDirect RDMA. http://www.mellanox.com/page/products_dyn?product_family=116Google ScholarGoogle Scholar
  20. Mellanox. 2018. How To Implement PeerDirect Client using MLNX_OFED. https://community.mellanox.com/s/article/howto-implement-peerdirect-client-using-mlnx-ofedGoogle ScholarGoogle Scholar
  21. Mellanox. 2018. InfiniBand Performance, http://www.mellanox.com/page/performance_infinibandGoogle ScholarGoogle Scholar
  22. MPI Forum. 2012. MPI: A Message-Passing Interface Standard. Ver. 3. www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdfGoogle ScholarGoogle Scholar
  23. Maxim Naumov. 2011. Parallel Solution of Sparse Triangular Linear Systems in the Preconditioned Iterative Methods on the GPU. In Technical Report NVR-2011-001, Nvidia.Google ScholarGoogle Scholar
  24. Nvidia. 2017. GPU Applications. http://www.nvidia.com/object/gpu-applications-domain.htmlGoogle ScholarGoogle Scholar
  25. Nvidia. 2018. CUDA Toolkit 9.2. https://developer.nvidia.com/cuda-toolkitGoogle ScholarGoogle Scholar
  26. Nvidia. 2019. Developing a Linux Kernel Module using GPUDirect RDMA and CUDA APIs for Memory Ordering. https://docs.nvidia.com/cuda/gpudirect-rdma/index.html#sync-behaviorGoogle ScholarGoogle Scholar
  27. Nvidia. 2019. GPUDirect RDMA. https://docs.nvidia.com/cuda/gpudirect-rdma/index.html#sync-behaviorGoogle ScholarGoogle Scholar
  28. Lena Oden and Holger Froning. 2013. GGAS: Global GPU Address Spaces for Efficient Communication in Heterogeneous Clusters. In Intl. Conf. on Cluster Computing (CLUSTER).Google ScholarGoogle Scholar
  29. Lena Oden, Holger Froning, and Franz-Joseph Pfreundt. 2014. Infiniband-Verbs on GPU: A Case Study of Controlling an Infiniband Network Device from the GPU. In Intl. Conf. on Parallel Distributed Processing Symposium Workshops (IPDPSW). 976--983.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Lena Oden, Benjamin Klenk, and Holger Froning. 2014. Energy-Efficient Collective Reduce and Allreduce Operations on Distributed GPUs. In Intl. Symp. on Cluster, Cloud and Grid Computing (CCGrid). 483--492.Google ScholarGoogle Scholar
  31. Marc S. Orr, Shuai Che, Bradford M. Beckmann, Mark Oskin, Steven K. Reinhardt, and David A. Wood. 2017. Gravel: Fine-Grain GPU-Initiated Network Messages. In Intl. Conf. for High Performance Computing, Networking, Storage and Analysis (SC).Google ScholarGoogle Scholar
  32. Linux Man Pages. 2019. Diret Verbs. http://man7.org/linux/man-pages/man3/mlx5dv_init_obj.3.htmlSGoogle ScholarGoogle Scholar
  33. Sreeram Potluri, Anshuman Goswami, Davide Rossetti, C. J. Newburn, Manjunath G. Venkata, and Neena Imam. 2017. GPU-Centric Communication on NVIDIA GPU Clusters with InfiniBand: A Case Study with OpenSHMEM. In Intl. Conf. on High Performance Computing (HiPC). 253--262.Google ScholarGoogle Scholar
  34. Sreeram Potluri, Davide Rossetti, Becker Donald, Poole Duncan, Venkata Manjunath, Hernandez Oscar, Shamis Pavel, Lopez M. Graham, Baker Mathew, and Poole Wendy. 2015. Exploring OpenSHMEM Model to Program GPU-based Extreme-Scale Systems. In Workshop on OpenSHMEM and related techonlogies.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Davide Rossetti. 2015. GPUDirect Async. http://on-demand.gputechconf.com/gtc/2015/presentation/S5412-Davide-Rossetti.pdfGoogle ScholarGoogle Scholar
  36. Sandia National Laboratories. 2017. The Portals 4.1 Network Programming Interface. http://www.cs.sandia.gov/Portals/portals41.pdfGoogle ScholarGoogle Scholar
  37. Jeff A. Stuart and John D. Owens. 2009. Message Passing on Data-parallel Architectures. In Intl. Symp. on Parallel Distributed Processing (IPDPS).Google ScholarGoogle Scholar
  38. TOP500.org. 2019. Highlights - June 2019. https://www.top500.org/lists/2019/06/Google ScholarGoogle Scholar

Index Terms

  1. <u>G</u>PU <u>i</u>nitiated <u>O</u>penSHMEM: correct and efficient intra-kernel networking for dGPUs

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          PPoPP '20: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
          February 2020
          454 pages
          ISBN:9781450368186
          DOI:10.1145/3332466

          Copyright © 2020 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 19 February 2020

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          PPoPP '20 Paper Acceptance Rate28of121submissions,23%Overall Acceptance Rate230of1,014submissions,23%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader