research-article

Gravel: fine-grain GPU-initiated network messages

Authors:
Marc S. Orr

UW-Madison

UW-Madison
View Profile

,
Shuai Che

AMD Research

AMD Research
View Profile

,
Bradford M. Beckmann

AMD Research

AMD Research
View Profile

,
Mark Oskin

University of Washington

University of Washington
View Profile

,
Steven K. Reinhardt

Microsoft

Microsoft
View Profile

,
David A. Wood

AMD Research, UW-Madison

AMD Research, UW-Madison
View Profile

SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisNovember 2017Article No.: 23Pages 1–12https://doi.org/10.1145/3126908.3126914

Published:12 November 2017Publication History

SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Pages 1–12

ABSTRACT

Distributed systems incorporate GPUs because they provide massive parallelism in an energy-efficient manner. Unfortunately, existing programming models make it difficult to route a GPU-initiated network message. The traditional coprocessor model forces programmers to manually route messages through the host CPU. Other models allow GPU-initiated communication, but are inefficient for small messages.

To enable fine-grain PGAS-style communication between threads executing on different GPUs, we introduce Gravel. GPU-initiated messages are offloaded through a GPU-efficient concurrent queue to an aggregator (implemented with CPU threads), which combines messages targeting to the same destination. Gravel leverages diverged work-group-level semantics to amortize synchronization across the GPU's data-parallel lanes.

Using Gravel, we can distribute six applications, each with frequent small messages, across a cluster of eight GPU-accelerated nodes. Compared to one node, these applications run 5.3x faster, on average. Furthermore, we show Gravel is more programmable and usually performs better than prior GPU networking models.

References

The Green500 List. {Online}. http://www.green500.org/Google Scholar
Amazon Elastic Compute Cloud User Guide for Linux Instances. {Online}. http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using_cluster_computing.html.Google Scholar
Microsoft Azure, N Series, GPU enabled Virtual Machines. {Online}. https://azure.microsoft.com/en-us/pricing/details/virtual-machines/series/#n-series.Google Scholar
Google Cloud Platform: GRAPHICS PROCESSING UNIT (GPU), Leverage GPUs on Google Cloud for machine learning and scientific computing. {Online}. https://cloud.google.com/gpu/.Google Scholar
T. Geller. 2011. Supercomputing's Exaflop Target. In Commun. Of the ACM. Google ScholarDigital Library
M. Abadi et al. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. Google preliminary whitepaper.Google Scholar
D. Yu, K. Yao, and Y. Zhang. 2015. The Computational Network Toolkit {Best of the Web}. In IEEE Signal Processing Magazine.Google Scholar
R. Collobert, K. Kavukcuoglu, and C. Farabet. 2011. Torch7: A Matlab-Like Environment for Machine Learning. In BigLearn NIPS Workshop.Google Scholar
Install GraphLab Create with GPU Acceleration. {Online}. https://dato.com/download/install-graphlab-create-gpu.html/.Google Scholar
G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. 2010. Pregel: A System for Large-Scale Graph Processing. In Proc. of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
J. Nelson, B. Holt, B. Myers, P. Briggs, L. Ceze, S. Kahan, and M. Oskin. 2015. Latency-tolerant Software Distributed Shared Memory. In Proc. of the USENIX Annual Technical Conference (ATC). Google ScholarDigital Library
J. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. 2012. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs. In Proc. of the USENIX Conference on Operating Systems Design and Implementation (OSDI). Google ScholarDigital Library
S. Potluri, K. Hamidouche, A. Venkatesh, D. Bureddy, and D. Panda. 2013. Efficient Inter-Node MPI Communication Using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs. In Proc. of the International Conference on Parallel Processing (ICPP). Google ScholarDigital Library
H. Wang, S. Potluri, M. Luo, A. Singh, S. Sur, and D. Panda. 2011. MVAPICH2-GPU: Optimized GPU to GPU Communication for InfiniBand Clusters. In Proc. of the International Supercomputing Conference (ISC).Google Scholar
L. Oden and H. Fröning. 2013. GGAS: Global GPU Address Spaces for Efficient Communication in Heterogeneous Clusters. In Proc. of the IEEE International Conference Cluster Computing (Cluster).Google Scholar
J. Stuart and J. Owens. 2009. Message Passing on Data-Parallel Architectures. In Proc. of the IEEE International Symposium on Parallel Distributed Processing. Google ScholarDigital Library
S. Potluri, N. Luehr, and N. Sakharnykh. 2016. Simplifying Multi-GPU Communication with NVSHMEM. {Online}. http://on-demand.gputechconf.com/gtc/2016/presentation/s6378-nathan-luehr-simplyfing-multi-gpu-communication-nvshmem.pdf.Google Scholar
S. Kim, S. Huh, Y. Hu, X. Zhang, E. Witchel, A. Wated, and M. Silberstein. 2014. GPUnet: Networking Abstractions for GPU Programs. In Proc. of the USENIX Symp. on Operating Systems Design and Implementation (OSDI). Google ScholarDigital Library
F. Daoud, A. Watad, and M. Silberstein. 2016. GPUrdma: GPU-Side Library for High Performance Networking from GPU Kernels. In Workshop on Runtime and OS Support for Supercomputers. Google ScholarDigital Library
OpenCL 2.0 Reference Pages. {Online}. http://www.khronos.org/registry/cl/sdk/2.0/docs/man/xhtml/.Google Scholar
CUDA C Programming Guide. {Online}. http://docs.nvidia.com/cuda/cuda-c-programming-guide/.Google Scholar
HSA Foundation. 2015. HSA Programmer's Reference Manual: HSAIL Virtual ISA and Programming Model, Compiler Writer's Guide, and Object Format (BRIG) Version 1.0.1.Google Scholar
S. Junkins. 2016. The Compute Architecture of Intel® Processor Graphics Gen9. Intel whitepaper, v1.0.Google Scholar
HPC Challenge Benchmark: RandomAccess. {Online}. http://icl.cs.utk.edu/projectsfiles/hpcc/RandomAccess/.Google Scholar
Wikipedia. Counting Sort. {Online}. https://en.wikipedia.org/wiki/Counting_sort.Google Scholar
M. Orr, B. Beckmann, S. Reinhardt, and D. Wood. 2014. Fine-Grain Task Aggregation and Coordination on GPUs. In Proc. of the International Symp. on Computer Architecture (ISCA). Google ScholarDigital Library
H. Levy. 2003. Single Producer Consumer on a Bounded Array Problem. Course notes. {Online}. https://courses.cs.washington.edu/courses/cse451/03wi/section/prodcons.htm.Google Scholar
W. Fung and T. Aamodt. 2011. Thread Block Compaction for Efficient SIMT Control Flow. In Proc. of the International Symp. on High Performance Computer Architecture (HPCA). Google ScholarDigital Library
University of Florida Sparse Matrix Collection. {Online}. http://www.cise.ufl.edu/research/sparse/matrices/.Google Scholar
NERSC. Meraculous Data. {Online}. http://portal.nersc.gov/project/m888/apex/Meraculous_data/.Google Scholar
OpenMPI FAQ. {Online}. https://www.open-mpi.org/faq/?category=supported-systems#thread-support.Google Scholar
S. Che. 2014. GasCL: A Vertex-Centric Graph Model for GPUs. In Proc. of the IEEE High Performance Extreme Computing Conference (HPEC).Google ScholarCross Ref
E. Georganas, A. Buluç, J. Chapman, L. Oliker, D. Rokhsar, and Katherine Yelick. 2014. Parallel De Bruijn Graph Construction and Traversal for De Novo Genome Assembly. In Proc. of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). Google ScholarDigital Library
M. Silberstein, B. Ford, I. Keidar, and E. Witchel. 2013. GPUfs: Integrating File Systems with GPUs. In Proc. of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google ScholarDigital Library
J. Wang, N. Rubin, A. Sidelnik, and S. Yalamanchili. 2015. Dynamic Thread Block Launch: a Lightweight Execution Mechanism to Support Irregular Applications on GPUs. In Proc. of the International Symp. on Computer Architecture (ISCA). Google ScholarDigital Library
W. Hillis and G. Steele. 1986. Data Parallel Algorithms. In Comm. of the ACM. Google ScholarDigital Library
A. Morari, A. Tumeo, D. Chavarria-Miranda, O. Villa, and M. Valero. 2014. Scaling Irregular Applications through Data Aggregation and Software Multithreading. In Proc. of the International Parallel and Distributed Processing Symp. (IPDPS). Google ScholarDigital Library
CCIX Consortium. Cache Coherent Interconnect for Accelerators (CCIX). {Online}. http://www.ccixconsortium.comGoogle Scholar

Index Terms

Gravel: fine-grain GPU-initiated network messages
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms
      1. Massively parallel algorithms

Recommendations

Parallel GPU architecture framework for the WRF Single Moment 6-class microphysics scheme

An Earth-observing remote sensing instrument is used to collect information about the physical environment within its instantaneous-field-of-view and is often placed aboard a suborbital or satellite platform for maximal spatial coverage. Remote sensing ...
Read More
Accelerating incompressible flow computations with a Pthreads-CUDA implementation on small-footprint multi-GPU platforms

Graphics processor units (GPU) that are originally designed for graphics rendering have emerged as massively-parallel "co-processors" to the central processing unit (CPU). Small-footprint multi-GPU workstations with hundreds of processing elements can ...
Read More
Accelerate video decoding with generic GPU

Most modern computers or game consoles are equipped with powerful yet cost-effective graphics processing units (GPUs) to accelerate graphics operations. Though the graphics engines in these GPUs are specially designed for graphics operations, can we ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2017
801 pages
ISBN:9781450351140
DOI:10.1145/3126908
General Chair:
Bernd Mohr
Jülich Supercomputing Center, Jülich, Germany
,
Program Chair:
Padma Raghavan
Vanderbilt University, Nashville, TN
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 November 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
fine-grain communication
graphics processing unit (GPU)
message aggregation
partitioned global address space (PGAS)
Qualifiers
- research-article
Conference

Acceptance Rates
SC '17 Paper Acceptance Rate61of327submissions,19%Overall Acceptance Rate1,516of6,373submissions,24%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 236
  Total Downloads
- Downloads (Last 12 months)9
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Gravel: fine-grain GPU-initiated network messages

SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

Parallel GPU architecture framework for the WRF Single Moment 6-class microphysics scheme

Accelerating incompressible flow computations with a Pthreads-CUDA implementation on small-footprint multi-GPU platforms

Accelerate video decoding with generic GPU

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Gravel: fine-grain GPU-initiated network messages

SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

Parallel GPU architecture framework for the WRF Single Moment 6-class microphysics scheme

Accelerating incompressible flow computations with a Pthreads-CUDA implementation on small-footprint multi-GPU platforms

Accelerate video decoding with generic GPU

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media