skip to main content
10.1145/2642769.2642773acmotherconferencesArticle/Chapter ViewAbstractPublication Pageseurompi-asiaConference Proceedingsconference-collections
research-article

GPU-Aware Intranode MPI_Allreduce

Published: 09 September 2014 Publication History

Abstract

Modern multi-core clusters are increasingly using GPUs to achieve higher performance and power efficiency. In such clusters, efficient communication among processes with data residing in GPU memory is of paramount importance to the performance of MPI applications. This paper investigates the efficient design of intranode MPI_Allreduce operation in GPU clusters. We propose two design alternatives that exploit in-GPU reduction and fast intranode communication capabilities of modern GPUs. Our GPU shared-buffer aware design and GPU-aware Binomial reduce-broadcast algorithmic approach provide significant speedup over MVAPICH2 by up to 22 and 16 times, respectively.

References

[1]
A. M. Aji, L. S. Panwar, F. Ji, M. Chabbi, K. Murthy, P. Balaji, K. R. Bisset, J. Dinan, W.-c. Feng, J. Mellor-Crummey, X. Ma, and R. Thakur. On the Efficacy of GPU-integrated MPI for Scientific Applications. In HPDC'13, pages 191--202, 2013.
[2]
D. Bureddy, H. Wang, A. Venkatesh, S. Potluri, and D. K. Panda. OMB-GPU: A Micro-benchmark Suite for Evaluating MPI Libraries on GPU Clusters. In EuroMPI'12, pages 110--120, 2012.
[3]
CUDA, http://www.nvidia.ca/object/cuda_home_new.html (Last accessed 5/02/2014).
[4]
R. L. Graham and G. Shipman. MPI Support for Multi-core Architectures: Optimized Shared Memory Collectives. In PVM/MPI'2008, pages 130--140. 2008.
[5]
F. Ji, A. M. Aji, J. Dinan, D. Buntinas, P. Balaji, R. Thakur, W.-c. Feng, and X. Ma. DMA-Assisted, Intranode Communication in GPU Accelerated Systems. In HPCC'12, pages 461--468, 2012.
[6]
S. Li, T. Hoefler, and M. Snir. NUMA-Aware Shared Memory Collective Communication for MPI. In HPDC'13, pages 85--96, 2013.
[7]
MPI-3, http://www.mpi-forum.org/docs/mpi-3.0/ (Last accessed 4/28/2014).
[8]
MVAPICH, http://mvapich.cse.ohio-state.edu (Last accessed 4/30/2014).
[9]
NVIDIA Corporation: NVIDIA CUDA C Programming Guide Version 4.1, 2011.
[10]
A. J. Pena and S. R. Alam. Evaluation of Inter- and Intra-node Data Transfer Efficiencies between GPU Devices and their Impact on Scalable Applications. In CCGrid'13, pages 144--151, 2013.
[11]
S. Potluri, K. Hamidouche, A. Venkatesh, D. Bureddy, and D. K. Panda. Efficient Inter-node MPI Communication using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs. In ICPP'13, pages 80--89, 2013.
[12]
S. Potluri, H. Wang, D. Bureddy, A. K. Singh, C. Rosales, and D. K. Panda. Optimizing MPI Communication on Multi-GPU Systems Using CUDA Inter-Process Communication. In IPDPSW'12, pages 1848--1857, 2012.
[13]
R. Rabenseifner. Automatic Profiling of MPI Applications with Hardware Performance Counters. In MPIDC'99, pages 77--85, 1999.
[14]
A. K. Singh, S. Potluri, H. Wang, K. Kandalla, S. Sur, and D. K. Panda. MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefit. In Cluster'11, pages 420--427, 2011.
[15]
R. Thakur, R. Rabenseifner, and W. Gropp. Optimization of Collective Communication Operations in MPICH. IJHPCA'05, 19(1):49--66, 2005.
[16]
The TOP500 June 2014 List, http://www.top500.org/lists/2014/06/.

Cited By

View all
  • (2022)Accelerating Deep Learning Using Interconnect-Aware UCX Communication for MPI CollectivesIEEE Micro10.1109/MM.2022.314867042:2(68-76)Online publication date: 1-Mar-2022
  • (2021)Efficient Multi-Path NVLink/PCIe-Aware UCX based Collective Communication for Deep Learning2021 IEEE Symposium on High-Performance Interconnects (HOTI)10.1109/HOTI52880.2021.00018(25-34)Online publication date: Aug-2021
  • (2019)Node-Aware Improvements to Allreduce2019 IEEE/ACM Workshop on Exascale MPI (ExaMPI)10.1109/ExaMPI49596.2019.00008(19-28)Online publication date: Nov-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
EuroMPI/ASIA '14: Proceedings of the 21st European MPI Users' Group Meeting
September 2014
183 pages
ISBN:9781450328753
DOI:10.1145/2642769
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

  • Kyoto University: Kyoto University
  • University of Tokyo
  • University of Tsukuba: University of Tsukuba

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 September 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPU
  2. IPC
  3. Intranode MPI_Allreduce
  4. MPI
  5. Shared Buffer

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

EuroMPI/ASIA '14

Acceptance Rates

EuroMPI/ASIA '14 Paper Acceptance Rate 18 of 39 submissions, 46%;
Overall Acceptance Rate 18 of 39 submissions, 46%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)13
  • Downloads (Last 6 weeks)3
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Accelerating Deep Learning Using Interconnect-Aware UCX Communication for MPI CollectivesIEEE Micro10.1109/MM.2022.314867042:2(68-76)Online publication date: 1-Mar-2022
  • (2021)Efficient Multi-Path NVLink/PCIe-Aware UCX based Collective Communication for Deep Learning2021 IEEE Symposium on High-Performance Interconnects (HOTI)10.1109/HOTI52880.2021.00018(25-34)Online publication date: Aug-2021
  • (2019)Node-Aware Improvements to Allreduce2019 IEEE/ACM Workshop on Exascale MPI (ExaMPI)10.1109/ExaMPI49596.2019.00008(19-28)Online publication date: Nov-2019
  • (2018)Design considerations for GPU‐aware collective communications in MPIConcurrency and Computation: Practice and Experience10.1002/cpe.466730:17Online publication date: 18-May-2018
  • (2016)Topology-Aware GPU Selection on Multi-GPU Nodes2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2016.44(712-720)Online publication date: May-2016
  • (2016)CUDA kernel based collective reduction operations on large-scale GPU clustersProceedings of the 16th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing10.1109/CCGrid.2016.111(726-735)Online publication date: 16-May-2016
  • (2015)Hyper-Q aware intranode MPI collectives on the GPUProceedings of the First International Workshop on Extreme Scale Programming Models and Middleware10.1145/2832241.2832247(47-50)Online publication date: 15-Nov-2015

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media