skip to main content
10.1145/3126908.3126914acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Gravel: fine-grain GPU-initiated network messages

Published:12 November 2017Publication History

ABSTRACT

Distributed systems incorporate GPUs because they provide massive parallelism in an energy-efficient manner. Unfortunately, existing programming models make it difficult to route a GPU-initiated network message. The traditional coprocessor model forces programmers to manually route messages through the host CPU. Other models allow GPU-initiated communication, but are inefficient for small messages.

To enable fine-grain PGAS-style communication between threads executing on different GPUs, we introduce Gravel. GPU-initiated messages are offloaded through a GPU-efficient concurrent queue to an aggregator (implemented with CPU threads), which combines messages targeting to the same destination. Gravel leverages diverged work-group-level semantics to amortize synchronization across the GPU's data-parallel lanes.

Using Gravel, we can distribute six applications, each with frequent small messages, across a cluster of eight GPU-accelerated nodes. Compared to one node, these applications run 5.3x faster, on average. Furthermore, we show Gravel is more programmable and usually performs better than prior GPU networking models.

References

  1. The Green500 List. {Online}. http://www.green500.org/Google ScholarGoogle Scholar
  2. Amazon Elastic Compute Cloud User Guide for Linux Instances. {Online}. http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using_cluster_computing.html.Google ScholarGoogle Scholar
  3. Microsoft Azure, N Series, GPU enabled Virtual Machines. {Online}. https://azure.microsoft.com/en-us/pricing/details/virtual-machines/series/#n-series.Google ScholarGoogle Scholar
  4. Google Cloud Platform: GRAPHICS PROCESSING UNIT (GPU), Leverage GPUs on Google Cloud for machine learning and scientific computing. {Online}. https://cloud.google.com/gpu/.Google ScholarGoogle Scholar
  5. T. Geller. 2011. Supercomputing's Exaflop Target. In Commun. Of the ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Abadi et al. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. Google preliminary whitepaper.Google ScholarGoogle Scholar
  7. D. Yu, K. Yao, and Y. Zhang. 2015. The Computational Network Toolkit {Best of the Web}. In IEEE Signal Processing Magazine.Google ScholarGoogle Scholar
  8. R. Collobert, K. Kavukcuoglu, and C. Farabet. 2011. Torch7: A Matlab-Like Environment for Machine Learning. In BigLearn NIPS Workshop.Google ScholarGoogle Scholar
  9. Install GraphLab Create with GPU Acceleration. {Online}. https://dato.com/download/install-graphlab-create-gpu.html/.Google ScholarGoogle Scholar
  10. G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. 2010. Pregel: A System for Large-Scale Graph Processing. In Proc. of the ACM SIGMOD International Conference on Management of Data. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Nelson, B. Holt, B. Myers, P. Briggs, L. Ceze, S. Kahan, and M. Oskin. 2015. Latency-tolerant Software Distributed Shared Memory. In Proc. of the USENIX Annual Technical Conference (ATC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. 2012. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs. In Proc. of the USENIX Conference on Operating Systems Design and Implementation (OSDI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Potluri, K. Hamidouche, A. Venkatesh, D. Bureddy, and D. Panda. 2013. Efficient Inter-Node MPI Communication Using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs. In Proc. of the International Conference on Parallel Processing (ICPP). Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. H. Wang, S. Potluri, M. Luo, A. Singh, S. Sur, and D. Panda. 2011. MVAPICH2-GPU: Optimized GPU to GPU Communication for InfiniBand Clusters. In Proc. of the International Supercomputing Conference (ISC).Google ScholarGoogle Scholar
  15. L. Oden and H. Fröning. 2013. GGAS: Global GPU Address Spaces for Efficient Communication in Heterogeneous Clusters. In Proc. of the IEEE International Conference Cluster Computing (Cluster).Google ScholarGoogle Scholar
  16. J. Stuart and J. Owens. 2009. Message Passing on Data-Parallel Architectures. In Proc. of the IEEE International Symposium on Parallel Distributed Processing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Potluri, N. Luehr, and N. Sakharnykh. 2016. Simplifying Multi-GPU Communication with NVSHMEM. {Online}. http://on-demand.gputechconf.com/gtc/2016/presentation/s6378-nathan-luehr-simplyfing-multi-gpu-communication-nvshmem.pdf.Google ScholarGoogle Scholar
  18. S. Kim, S. Huh, Y. Hu, X. Zhang, E. Witchel, A. Wated, and M. Silberstein. 2014. GPUnet: Networking Abstractions for GPU Programs. In Proc. of the USENIX Symp. on Operating Systems Design and Implementation (OSDI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. F. Daoud, A. Watad, and M. Silberstein. 2016. GPUrdma: GPU-Side Library for High Performance Networking from GPU Kernels. In Workshop on Runtime and OS Support for Supercomputers. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. OpenCL 2.0 Reference Pages. {Online}. http://www.khronos.org/registry/cl/sdk/2.0/docs/man/xhtml/.Google ScholarGoogle Scholar
  21. CUDA C Programming Guide. {Online}. http://docs.nvidia.com/cuda/cuda-c-programming-guide/.Google ScholarGoogle Scholar
  22. HSA Foundation. 2015. HSA Programmer's Reference Manual: HSAIL Virtual ISA and Programming Model, Compiler Writer's Guide, and Object Format (BRIG) Version 1.0.1.Google ScholarGoogle Scholar
  23. S. Junkins. 2016. The Compute Architecture of Intel® Processor Graphics Gen9. Intel whitepaper, v1.0.Google ScholarGoogle Scholar
  24. HPC Challenge Benchmark: RandomAccess. {Online}. http://icl.cs.utk.edu/projectsfiles/hpcc/RandomAccess/.Google ScholarGoogle Scholar
  25. Wikipedia. Counting Sort. {Online}. https://en.wikipedia.org/wiki/Counting_sort.Google ScholarGoogle Scholar
  26. M. Orr, B. Beckmann, S. Reinhardt, and D. Wood. 2014. Fine-Grain Task Aggregation and Coordination on GPUs. In Proc. of the International Symp. on Computer Architecture (ISCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. H. Levy. 2003. Single Producer Consumer on a Bounded Array Problem. Course notes. {Online}. https://courses.cs.washington.edu/courses/cse451/03wi/section/prodcons.htm.Google ScholarGoogle Scholar
  28. W. Fung and T. Aamodt. 2011. Thread Block Compaction for Efficient SIMT Control Flow. In Proc. of the International Symp. on High Performance Computer Architecture (HPCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. University of Florida Sparse Matrix Collection. {Online}. http://www.cise.ufl.edu/research/sparse/matrices/.Google ScholarGoogle Scholar
  30. NERSC. Meraculous Data. {Online}. http://portal.nersc.gov/project/m888/apex/Meraculous_data/.Google ScholarGoogle Scholar
  31. OpenMPI FAQ. {Online}. https://www.open-mpi.org/faq/?category=supported-systems#thread-support.Google ScholarGoogle Scholar
  32. S. Che. 2014. GasCL: A Vertex-Centric Graph Model for GPUs. In Proc. of the IEEE High Performance Extreme Computing Conference (HPEC).Google ScholarGoogle ScholarCross RefCross Ref
  33. E. Georganas, A. Buluç, J. Chapman, L. Oliker, D. Rokhsar, and Katherine Yelick. 2014. Parallel De Bruijn Graph Construction and Traversal for De Novo Genome Assembly. In Proc. of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. M. Silberstein, B. Ford, I. Keidar, and E. Witchel. 2013. GPUfs: Integrating File Systems with GPUs. In Proc. of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. J. Wang, N. Rubin, A. Sidelnik, and S. Yalamanchili. 2015. Dynamic Thread Block Launch: a Lightweight Execution Mechanism to Support Irregular Applications on GPUs. In Proc. of the International Symp. on Computer Architecture (ISCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. W. Hillis and G. Steele. 1986. Data Parallel Algorithms. In Comm. of the ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. A. Morari, A. Tumeo, D. Chavarria-Miranda, O. Villa, and M. Valero. 2014. Scaling Irregular Applications through Data Aggregation and Software Multithreading. In Proc. of the International Parallel and Distributed Processing Symp. (IPDPS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. CCIX Consortium. Cache Coherent Interconnect for Accelerators (CCIX). {Online}. http://www.ccixconsortium.comGoogle ScholarGoogle Scholar

Index Terms

  1. Gravel: fine-grain GPU-initiated network messages

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
      November 2017
      801 pages
      ISBN:9781450351140
      DOI:10.1145/3126908
      • General Chair:
      • Bernd Mohr,
      • Program Chair:
      • Padma Raghavan

      Copyright © 2017 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 12 November 2017

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      SC '17 Paper Acceptance Rate61of327submissions,19%Overall Acceptance Rate1,516of6,373submissions,24%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader