research-article

GPU-Aware Intranode MPI_Allreduce

Authors:

Ahmad AfsahiAuthors Info & Claims

EuroMPI/ASIA '14: Proceedings of the 21st European MPI Users' Group Meeting

Pages 45 - 50

https://doi.org/10.1145/2642769.2642773

Published: 09 September 2014 Publication History

Abstract

Modern multi-core clusters are increasingly using GPUs to achieve higher performance and power efficiency. In such clusters, efficient communication among processes with data residing in GPU memory is of paramount importance to the performance of MPI applications. This paper investigates the efficient design of intranode MPI_Allreduce operation in GPU clusters. We propose two design alternatives that exploit in-GPU reduction and fast intranode communication capabilities of modern GPUs. Our GPU shared-buffer aware design and GPU-aware Binomial reduce-broadcast algorithmic approach provide significant speedup over MVAPICH2 by up to 22 and 16 times, respectively.

References

[1]

A. M. Aji, L. S. Panwar, F. Ji, M. Chabbi, K. Murthy, P. Balaji, K. R. Bisset, J. Dinan, W.-c. Feng, J. Mellor-Crummey, X. Ma, and R. Thakur. On the Efficacy of GPU-integrated MPI for Scientific Applications. In HPDC'13, pages 191--202, 2013.

Digital Library

[2]

D. Bureddy, H. Wang, A. Venkatesh, S. Potluri, and D. K. Panda. OMB-GPU: A Micro-benchmark Suite for Evaluating MPI Libraries on GPU Clusters. In EuroMPI'12, pages 110--120, 2012.

Digital Library

[3]

CUDA, http://www.nvidia.ca/object/cuda_home_new.html (Last accessed 5/02/2014).

[4]

R. L. Graham and G. Shipman. MPI Support for Multi-core Architectures: Optimized Shared Memory Collectives. In PVM/MPI'2008, pages 130--140. 2008.

Digital Library

[5]

F. Ji, A. M. Aji, J. Dinan, D. Buntinas, P. Balaji, R. Thakur, W.-c. Feng, and X. Ma. DMA-Assisted, Intranode Communication in GPU Accelerated Systems. In HPCC'12, pages 461--468, 2012.

Digital Library

[6]

S. Li, T. Hoefler, and M. Snir. NUMA-Aware Shared Memory Collective Communication for MPI. In HPDC'13, pages 85--96, 2013.

Digital Library

[7]

MPI-3, http://www.mpi-forum.org/docs/mpi-3.0/ (Last accessed 4/28/2014).

[8]

MVAPICH, http://mvapich.cse.ohio-state.edu (Last accessed 4/30/2014).

[9]

NVIDIA Corporation: NVIDIA CUDA C Programming Guide Version 4.1, 2011.

[10]

A. J. Pena and S. R. Alam. Evaluation of Inter- and Intra-node Data Transfer Efficiencies between GPU Devices and their Impact on Scalable Applications. In CCGrid'13, pages 144--151, 2013.

Digital Library

[11]

S. Potluri, K. Hamidouche, A. Venkatesh, D. Bureddy, and D. K. Panda. Efficient Inter-node MPI Communication using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs. In ICPP'13, pages 80--89, 2013.

Digital Library

[12]

S. Potluri, H. Wang, D. Bureddy, A. K. Singh, C. Rosales, and D. K. Panda. Optimizing MPI Communication on Multi-GPU Systems Using CUDA Inter-Process Communication. In IPDPSW'12, pages 1848--1857, 2012.

Digital Library

[13]

R. Rabenseifner. Automatic Profiling of MPI Applications with Hardware Performance Counters. In MPIDC'99, pages 77--85, 1999.

[14]

A. K. Singh, S. Potluri, H. Wang, K. Kandalla, S. Sur, and D. K. Panda. MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefit. In Cluster'11, pages 420--427, 2011.

Digital Library

[15]

R. Thakur, R. Rabenseifner, and W. Gropp. Optimization of Collective Communication Operations in MPICH. IJHPCA'05, 19(1):49--66, 2005.

[16]

The TOP500 June 2014 List, http://www.top500.org/lists/2014/06/.

Cited By

Temuçin YSojoodi AAlizadeh PKitor BAfsahi A(2022)Accelerating Deep Learning Using Interconnect-Aware UCX Communication for MPI CollectivesIEEE Micro10.1109/MM.2022.314867042:2(68-76)Online publication date: 1-Mar-2022
https://dl.acm.org/doi/10.1109/MM.2022.3148670
Temucin YSojoodi AAlizadeh PAfsahi A(2021)Efficient Multi-Path NVLink/PCIe-Aware UCX based Collective Communication for Deep Learning2021 IEEE Symposium on High-Performance Interconnects (HOTI)10.1109/HOTI52880.2021.00018(25-34)Online publication date: Aug-2021
https://doi.org/10.1109/HOTI52880.2021.00018
Bienz AOlson LGropp W(2019)Node-Aware Improvements to Allreduce2019 IEEE/ACM Workshop on Exascale MPI (ExaMPI)10.1109/ExaMPI49596.2019.00008(19-28)Online publication date: Nov-2019
https://doi.org/10.1109/ExaMPI49596.2019.00008
Show More Cited By

Index Terms

GPU-Aware Intranode MPI_Allreduce

Recommendations

Hyper-Q aware intranode MPI collectives on the GPU
ESPM '15: Proceedings of the First International Workshop on Extreme Scale Programming Models and Middleware

In GPU clusters, high GPU utilization and efficient communication play an important role in the performance of the MPI applications. To improve the GPU utilization, NVIDIA has introduced the Multi Process Service (MPS), eliminating the context-switching ...
Optimizing MPI Communication on Multi-GPU Systems Using CUDA Inter-Process Communication
IPDPSW '12: Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum

Many modern clusters are being equipped with multiple GPUs per node to achieve better compute density and power efficiency. However, moving data in/out of GPUs continues to remain a major performance bottleneck. With CUDA 4.1, NVIDIA has introduced ...
DMA-Assisted, Intranode Communication in GPU Accelerated Systems
HPCC '12: Proceedings of the 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems

Accelerator awareness has become a pressing issue in data movement models, such as MPI, because of the rapid deployment of systems that utilize accelerators. In our previous work, we developed techniques to enhance MPI with accelerator awareness, thus ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

EuroMPI/ASIA '14: Proceedings of the 21st European MPI Users' Group Meeting

September 2014

183 pages

ISBN:9781450328753

DOI:10.1145/2642769

General Chair:
Jack Dongarra
University of Tennessee
,
Program Chairs:
Yutaka Ishikawa
University of Tokyo
,
Atsushi Hori
RIKEN AICS

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

Kyoto University: Kyoto University
University of Tokyo
University of Tsukuba: University of Tsukuba

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 September 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

EuroMPI/ASIA '14

EuroMPI/ASIA '14: 21st European MPI Users' Group Meeting

September 9 - 12, 2014

Kyoto, Japan

Acceptance Rates

EuroMPI/ASIA '14 Paper Acceptance Rate 18 of 39 submissions, 46%;

Overall Acceptance Rate 18 of 39 submissions, 46%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
141
Total Downloads

Downloads (Last 12 months)13
Downloads (Last 6 weeks)3

Reflects downloads up to 17 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Temuçin YSojoodi AAlizadeh PKitor BAfsahi A(2022)Accelerating Deep Learning Using Interconnect-Aware UCX Communication for MPI CollectivesIEEE Micro10.1109/MM.2022.314867042:2(68-76)Online publication date: 1-Mar-2022
https://dl.acm.org/doi/10.1109/MM.2022.3148670
Temucin YSojoodi AAlizadeh PAfsahi A(2021)Efficient Multi-Path NVLink/PCIe-Aware UCX based Collective Communication for Deep Learning2021 IEEE Symposium on High-Performance Interconnects (HOTI)10.1109/HOTI52880.2021.00018(25-34)Online publication date: Aug-2021
https://doi.org/10.1109/HOTI52880.2021.00018
Bienz AOlson LGropp W(2019)Node-Aware Improvements to Allreduce2019 IEEE/ACM Workshop on Exascale MPI (ExaMPI)10.1109/ExaMPI49596.2019.00008(19-28)Online publication date: Nov-2019
https://doi.org/10.1109/ExaMPI49596.2019.00008
Faraji IAfsahi A(2018)Design considerations for GPU‐aware collective communications in MPIConcurrency and Computation: Practice and Experience10.1002/cpe.466730:17Online publication date: 18-May-2018
https://doi.org/10.1002/cpe.4667
Faraji IMirsadeghi SAfsahi A(2016)Topology-Aware GPU Selection on Multi-GPU Nodes2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2016.44(712-720)Online publication date: May-2016
https://doi.org/10.1109/IPDPSW.2016.44
Chu CHamidouche KVenkatesh AAwan APanda DVarela CCastro HBarrios C(2016)CUDA kernel based collective reduction operations on large-scale GPU clustersProceedings of the 16th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing10.1109/CCGrid.2016.111(726-735)Online publication date: 16-May-2016
https://dl.acm.org/doi/10.1109/CCGrid.2016.111
Faraji IAfsahi APanda DSchulz KHamidouche KSubramoni H(2015)Hyper-Q aware intranode MPI collectives on the GPUProceedings of the First International Workshop on Extreme Scale Programming Models and Middleware10.1145/2832241.2832247(47-50)Online publication date: 15-Nov-2015
https://dl.acm.org/doi/10.1145/2832241.2832247

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents