ABSTRACT
Distributed systems incorporate GPUs because they provide massive parallelism in an energy-efficient manner. Unfortunately, existing programming models make it difficult to route a GPU-initiated network message. The traditional coprocessor model forces programmers to manually route messages through the host CPU. Other models allow GPU-initiated communication, but are inefficient for small messages.
To enable fine-grain PGAS-style communication between threads executing on different GPUs, we introduce Gravel. GPU-initiated messages are offloaded through a GPU-efficient concurrent queue to an aggregator (implemented with CPU threads), which combines messages targeting to the same destination. Gravel leverages diverged work-group-level semantics to amortize synchronization across the GPU's data-parallel lanes.
Using Gravel, we can distribute six applications, each with frequent small messages, across a cluster of eight GPU-accelerated nodes. Compared to one node, these applications run 5.3x faster, on average. Furthermore, we show Gravel is more programmable and usually performs better than prior GPU networking models.
- The Green500 List. {Online}. http://www.green500.org/Google Scholar
- Amazon Elastic Compute Cloud User Guide for Linux Instances. {Online}. http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using_cluster_computing.html.Google Scholar
- Microsoft Azure, N Series, GPU enabled Virtual Machines. {Online}. https://azure.microsoft.com/en-us/pricing/details/virtual-machines/series/#n-series.Google Scholar
- Google Cloud Platform: GRAPHICS PROCESSING UNIT (GPU), Leverage GPUs on Google Cloud for machine learning and scientific computing. {Online}. https://cloud.google.com/gpu/.Google Scholar
- T. Geller. 2011. Supercomputing's Exaflop Target. In Commun. Of the ACM. Google ScholarDigital Library
- M. Abadi et al. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. Google preliminary whitepaper.Google Scholar
- D. Yu, K. Yao, and Y. Zhang. 2015. The Computational Network Toolkit {Best of the Web}. In IEEE Signal Processing Magazine.Google Scholar
- R. Collobert, K. Kavukcuoglu, and C. Farabet. 2011. Torch7: A Matlab-Like Environment for Machine Learning. In BigLearn NIPS Workshop.Google Scholar
- Install GraphLab Create with GPU Acceleration. {Online}. https://dato.com/download/install-graphlab-create-gpu.html/.Google Scholar
- G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. 2010. Pregel: A System for Large-Scale Graph Processing. In Proc. of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- J. Nelson, B. Holt, B. Myers, P. Briggs, L. Ceze, S. Kahan, and M. Oskin. 2015. Latency-tolerant Software Distributed Shared Memory. In Proc. of the USENIX Annual Technical Conference (ATC). Google ScholarDigital Library
- J. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. 2012. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs. In Proc. of the USENIX Conference on Operating Systems Design and Implementation (OSDI). Google ScholarDigital Library
- S. Potluri, K. Hamidouche, A. Venkatesh, D. Bureddy, and D. Panda. 2013. Efficient Inter-Node MPI Communication Using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs. In Proc. of the International Conference on Parallel Processing (ICPP). Google ScholarDigital Library
- H. Wang, S. Potluri, M. Luo, A. Singh, S. Sur, and D. Panda. 2011. MVAPICH2-GPU: Optimized GPU to GPU Communication for InfiniBand Clusters. In Proc. of the International Supercomputing Conference (ISC).Google Scholar
- L. Oden and H. Fröning. 2013. GGAS: Global GPU Address Spaces for Efficient Communication in Heterogeneous Clusters. In Proc. of the IEEE International Conference Cluster Computing (Cluster).Google Scholar
- J. Stuart and J. Owens. 2009. Message Passing on Data-Parallel Architectures. In Proc. of the IEEE International Symposium on Parallel Distributed Processing. Google ScholarDigital Library
- S. Potluri, N. Luehr, and N. Sakharnykh. 2016. Simplifying Multi-GPU Communication with NVSHMEM. {Online}. http://on-demand.gputechconf.com/gtc/2016/presentation/s6378-nathan-luehr-simplyfing-multi-gpu-communication-nvshmem.pdf.Google Scholar
- S. Kim, S. Huh, Y. Hu, X. Zhang, E. Witchel, A. Wated, and M. Silberstein. 2014. GPUnet: Networking Abstractions for GPU Programs. In Proc. of the USENIX Symp. on Operating Systems Design and Implementation (OSDI). Google ScholarDigital Library
- F. Daoud, A. Watad, and M. Silberstein. 2016. GPUrdma: GPU-Side Library for High Performance Networking from GPU Kernels. In Workshop on Runtime and OS Support for Supercomputers. Google ScholarDigital Library
- OpenCL 2.0 Reference Pages. {Online}. http://www.khronos.org/registry/cl/sdk/2.0/docs/man/xhtml/.Google Scholar
- CUDA C Programming Guide. {Online}. http://docs.nvidia.com/cuda/cuda-c-programming-guide/.Google Scholar
- HSA Foundation. 2015. HSA Programmer's Reference Manual: HSAIL Virtual ISA and Programming Model, Compiler Writer's Guide, and Object Format (BRIG) Version 1.0.1.Google Scholar
- S. Junkins. 2016. The Compute Architecture of Intel® Processor Graphics Gen9. Intel whitepaper, v1.0.Google Scholar
- HPC Challenge Benchmark: RandomAccess. {Online}. http://icl.cs.utk.edu/projectsfiles/hpcc/RandomAccess/.Google Scholar
- Wikipedia. Counting Sort. {Online}. https://en.wikipedia.org/wiki/Counting_sort.Google Scholar
- M. Orr, B. Beckmann, S. Reinhardt, and D. Wood. 2014. Fine-Grain Task Aggregation and Coordination on GPUs. In Proc. of the International Symp. on Computer Architecture (ISCA). Google ScholarDigital Library
- H. Levy. 2003. Single Producer Consumer on a Bounded Array Problem. Course notes. {Online}. https://courses.cs.washington.edu/courses/cse451/03wi/section/prodcons.htm.Google Scholar
- W. Fung and T. Aamodt. 2011. Thread Block Compaction for Efficient SIMT Control Flow. In Proc. of the International Symp. on High Performance Computer Architecture (HPCA). Google ScholarDigital Library
- University of Florida Sparse Matrix Collection. {Online}. http://www.cise.ufl.edu/research/sparse/matrices/.Google Scholar
- NERSC. Meraculous Data. {Online}. http://portal.nersc.gov/project/m888/apex/Meraculous_data/.Google Scholar
- OpenMPI FAQ. {Online}. https://www.open-mpi.org/faq/?category=supported-systems#thread-support.Google Scholar
- S. Che. 2014. GasCL: A Vertex-Centric Graph Model for GPUs. In Proc. of the IEEE High Performance Extreme Computing Conference (HPEC).Google ScholarCross Ref
- E. Georganas, A. Buluç, J. Chapman, L. Oliker, D. Rokhsar, and Katherine Yelick. 2014. Parallel De Bruijn Graph Construction and Traversal for De Novo Genome Assembly. In Proc. of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). Google ScholarDigital Library
- M. Silberstein, B. Ford, I. Keidar, and E. Witchel. 2013. GPUfs: Integrating File Systems with GPUs. In Proc. of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google ScholarDigital Library
- J. Wang, N. Rubin, A. Sidelnik, and S. Yalamanchili. 2015. Dynamic Thread Block Launch: a Lightweight Execution Mechanism to Support Irregular Applications on GPUs. In Proc. of the International Symp. on Computer Architecture (ISCA). Google ScholarDigital Library
- W. Hillis and G. Steele. 1986. Data Parallel Algorithms. In Comm. of the ACM. Google ScholarDigital Library
- A. Morari, A. Tumeo, D. Chavarria-Miranda, O. Villa, and M. Valero. 2014. Scaling Irregular Applications through Data Aggregation and Software Multithreading. In Proc. of the International Parallel and Distributed Processing Symp. (IPDPS). Google ScholarDigital Library
- CCIX Consortium. Cache Coherent Interconnect for Accelerators (CCIX). {Online}. http://www.ccixconsortium.comGoogle Scholar
Index Terms
- Gravel: fine-grain GPU-initiated network messages
Recommendations
Parallel GPU architecture framework for the WRF Single Moment 6-class microphysics scheme
An Earth-observing remote sensing instrument is used to collect information about the physical environment within its instantaneous-field-of-view and is often placed aboard a suborbital or satellite platform for maximal spatial coverage. Remote sensing ...
Accelerating incompressible flow computations with a Pthreads-CUDA implementation on small-footprint multi-GPU platforms
Graphics processor units (GPU) that are originally designed for graphics rendering have emerged as massively-parallel "co-processors" to the central processing unit (CPU). Small-footprint multi-GPU workstations with hundreds of processing elements can ...
Accelerate video decoding with generic GPU
Most modern computers or game consoles are equipped with powerful yet cost-effective graphics processing units (GPUs) to accelerate graphics operations. Though the graphics engines in these GPUs are specially designed for graphics operations, can we ...
Comments