research-article

ADAPT: an event-based adaptive collective communication framework

Authors:

George Bosilca,

Thananon Patinyasakdikul,

Jack DongarraAuthors Info & Claims

HPDC '18: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing

Pages 118 - 130

https://doi.org/10.1145/3208040.3208054

Published: 11 June 2018 Publication History

Abstract

The increase in scale and heterogeneity of high-performance computing (HPC) systems predispose the performance of Message Passing Interface (MPI) collective communications to be susceptible to noise, and to adapt to a complex mix of hardware capabilities. The designs of state of the art MPI collectives heavily rely on synchronizations; these designs magnify noise across the participating processes, resulting in significant performance slowdown. Therefore, such design philosophy must be reconsidered to efficiently and robustly run on the large-scale heterogeneous platforms. In this paper, we present ADAPT, a new collective communication framework in Open MPI, using event-driven techniques to morph collective algorithms to heterogeneous environments. The core concept of ADAPT is to relax synchronizations, while mamtaining the minimal data dependencies of MPI collectives. To fully exploit the different bandwidths of data movement lanes in heterogeneous systems, we extend the ADAPT collective framework with a topology-aware communication tree. This removes the boundaries of different hardware topologies while maximizing the speed of data movements. We evaluate our framework with two popular collective operations: broadcast and reduce on both CPU and GPU clusters. Our results demonstrate drastic performance improvements and a strong resistance against noise compared to other state of the art MPI libraries. In particular, we demonstrate at least 1.3X and 1.5X speedup for CPU data and 2X and 10X speedup for GPU data using ADAPT event-based broadcast and reduce operations.

References

[1]

A. A. Awan, K. Hamidouche, A. Venkatesh, and D. K. Panda. 2016. Efficient Large Message Broadcast Using NCCL and CUDA-Aware MPI for Deep Learning (EuroMPI 2016). 15--22.

Digital Library

[2]

P. Beckman, K. Iskra, K. Yoshii, and S. Coghlan. 2006. The Influence of Operating Systems on the Performance of Collective Operations at Extreme Scale. In 2006 IEEE International Conference on Cluster Computing. 1--12.

[3]

Suparna Bhattacharya, Steven Pratt, Badari Pulavarty, and Janet Morgan. 2003. Asynchronous I/O support in Linux 2.5. In Proceedings of the Linux Symposium. 371--386.

[4]

F. Broquedis, J. Clet-Ortega, S. Moreaud, N. Furmento, B. Goglin, G. Mercier, S. Thibault, and R. Namyst. 2010. hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications. In 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing. 180--186.

Digital Library

[5]

C. H. Chu, K. Hamidouche, A. Venkatesh, A. A. Awan, and D. K. Panda. 2016. CUDA Kernel Based Collective Reduction Operations on Large-scale GPU Clusters. In 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. 726--735.

[6]

C. H. Chu, X. Lu, A. A. Awan, H. Subramoni, J. Hashmi, B. Elton, and D. K. Panda. 2017. Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning. In 2017 46th International Conference on Parallel Processing (ICPP). 161--170.

[7]

T. V. Cutsem, S. Mostinckx, E. G. Boix, J. Dedecker, and W. D. Meuter. 2007. AmbientTalk: Object-oriented Event-driven Programming in Mobile Ad hoc Networks. In Chilean Society of Computer Science, 2007. SCCC '07. XXVI International Conference of the. 3--12.

Digital Library

[8]

Frank Dabek, Nickolai Zeldovich, Frans Kaashoek, David Mazières, and Robert Morris. 2002. Event-driven Programming for Robust Software. In Proceedings of the 10th Workshop on ACM SIGOPS European Workshop (EW 10). ACM, 186--189.

Digital Library

[9]

Adam Dunkels, Oliver Schmidt, Thiemo Voigt, and Muneeb Ali. 2006. Protothreads: Simplifying Event-driven Programming of Memory-constrained Embedded Systems. In Proceedings of the 4th International Conference on Embedded Networked Sensor Systems (SenSys '06). ACM, 29--42.

Digital Library

[10]

K. B. Ferreira, P. Bridges, and R. Brightwell. 2008. Characterizing application sensitivity to OS interference using kernel-level noise injection. In 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis. 1--12.

Digital Library

[11]

K. B. Ferreira, P. G. Bridges, R. Brightwell, and K. T. Pedretti. 2010. The Impact of System Design Parameters on Application Noise Sensitivity. In IEEE Cluster 2010. 146--155.

Digital Library

[12]

Kurt B. Ferreira, Patrick Widener, Scott Levy, Dorian Arnold, and Torsten Hoefler. 2014. Understanding the Effects of Communication and Coordination on Checkpointing at Scale. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '14). 883--894.

Digital Library

[13]

Message Passing Interface Forum. 2012. MPI: A Message-Passing Interface Standard, http://www.mpi-forum.org/. (September 2012).

[14]

V. W. Freeh, Feng Pan, N. Kappiah, D. K. Lowenthal, and R. Springer. 2005. Exploring the Energy-Time Tradeoff in MPI Programs on a Power-Scalable Cluster. In 19th IEEE International Parallel and Distributed Processing Symposium. 4a--4a.

Digital Library

[15]

R. Graham, M. G. Venkata, J. Ladd, P. Shamis, I. Rabinovitz, V. Filipov, and G. Shainer. 2011. Cheetah: A Framework for Scalable Hierarchical Collective Operations. In 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. 73--83.

Digital Library

[16]

Roger W. Hockney. 1994. The Communication Challenge for MPP: Intel Paragon and Meiko CS-2. Parallel Comput. 20, 3 (March 1994), 389--398.

Digital Library

[17]

Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine. 2010. Characterizing the Influence of System Noise on Large-Scale Applications by Simulation. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC '10). 1--11.

Digital Library

[18]

Torsten Hoefler, Christian Siebert, and Andrew Lumsdaine. 2009. Group Operation Assembly Language - A Flexible Way to Express Collective Communication. In Proceedings of the 2009 International Conference on Parallel Processing (ICPP '09). 574--581.

Digital Library

[19]

K. Kandalla, H. Subramoni, A. Vishnu, and D. K. Panda. 2010. Designing topology-aware collective communication algorithms for large scale InfiniBand clusters: Case studies with Scatter and Gather. In 2010 IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW). 1--8.

[20]

N. T. Karonis, B. R. de Supinski, I. Foster, W. Gropp, E. Lusk, and J. Bresnahan. 2000. Exploiting hierarchy in parallel computer networks to optimize collective operation performance. In IPDPS 2000. 377--384.

Digital Library

[21]

Thilo Kielmann, Rutger F. H. Hofman, Henri E. Bai, Aske Plaat, and Raoul A. F. Bhoedjang. 1999. MagPIe: MPI's Collective Communication Operations for Clustered Wide Area Systems (PPoPP '99). 131--140.

Digital Library

[22]

Scott Levy, Bryan Topp, Kurt B. Ferreira, Dorian Arnold, Torsten Hoefler, and Patrick Widener. 2014. Using Simulation to Evaluate the Performance of Resilience Strategies at Scale. Springer International Publishing, Cham, 91--114.

[23]

T. Ma, G. Bosilca, A. Bouteiller, and J. Dongarra. 2012. HierKNEM: An Adaptive Framework for Kernel-Assisted and Topology-Aware Collective Communications on Many-core Clusters. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium. 970--982.

Digital Library

[24]

O. H. Mondragon, P. G. Bridges, S. Levy, K. B. Ferreira, and P. Widener. 2016. Scheduling In-Situ Analytics in Next-Generation Applications. In 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. 102--105.

[25]

NVIDIA. 2016. NCCL. https://github.com/NVIDIA/nccl. (2016).

[26]

L. Oden, B. Klenk, and H. Fröning. 2014. Energy-Efficient Collective Reduce and Allreduce Operations on Distributed GPUs. In 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. 483--492.

[27]

Vivek S. Pai, Peter Druschel, and Willy Zwaenepoel. 1999. Flash: An efficient and portable Web server. (1999).

[28]

Benjamin S. Parsons and Vijay S. Pai. 2014. Accelerating MPI Collective Communications Through Hierarchical Algorithms Without Sacrificing Inter-Node Communication Flexibility (IPDPS '14). 208--218.

Digital Library

[29]

Jelena Pješivac-Grbović, Thara Angskun, George Bosilca, Graham E. Fagg, Edgar Gabriel, and Jack J. Dongarra. 2007. Performance analysis of MPI collective operations. Cluster Computing 10, 2 (2007), 127--143.

Digital Library

[30]

A. Plaat, H. E. Bal, and R. F. H. Hofman. 1999. Sensitivity of parallel applications to large differences in bandwidth and latency in two-layer interconnects. In Proceedings Fifth International Symposium on High-Performance Computer Architecture. 244--253.

Digital Library

[31]

Peter Sanders, Jochen Speck, and Jesper Larsson Träff. 2009. Two-tree Algorithms for Full Bandwidth Broadcast, Reduction and Scan. Parallel Comput. 35, 12 (Dec. 2009), 581--594.

Digital Library

[32]

A. K. Singh, S. Potluri, H. Wang, K. Kandalla, S. Sur, and D. K. Panda. 2011. MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefit. In 2011 IEEE International Conference on Cluster Computing. 420--427.

Digital Library

[33]

H. Subramoni, S. Potluri, K. Kandalla, B. Barth, J. Vienne, J. Keasler, K. Tomko, K. Schulz, A. Moody, and D. K. Panda. 2012. Design of a Scalable InfiniBand Topology Service to Enable Network-topology-aware Placement of Processes. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '12). Article 70, 12 pages.

Digital Library

[34]

Vinod Tipparaju, Jarek Nieplocha, and Dhabaleswar Panda. 2003. Fast Collective Operations Using Shared and Remote Memory Access Protocols on Clusters (IPDPS '03). 84.1-.

Digital Library

[35]

Nisheeth K. Vishnoi. 2007. The Impact of Noise on the Scaling of Collectives: The Nearest Neighbor Model. Springer, 476--487.

Digital Library

[36]

Hao Wang, Sreeram Potluri, Miao Luo, Ashish Kumar Singh, Sayantan Sur, and Dhabaleswar K. Panda. 2011. MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters. Computer Science - Research and Development 26, 3 (2011), 257.

Digital Library

[37]

Linnan Wang, Wei Wu, Zenglin Xu, Jianxiong Xiao, and Yi Yang. 2016. Blasx: A high performance level-3 bias library for heterogeneous multi-gpu computing. In Proceedings of the 2016 International Conference on Supercomputing. 20.

Digital Library

[38]

Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Zenglin Xu, and Tim Kraska. 2018. Superneurons: Dynamic GPU Memory Management for Training Deep Neural Networks. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '18). ACM, New York, NY, USA, 41--53.

Digital Library

[39]

Patrick M Widener, Scott Levy, Kurt B Ferreira, and Torsten Hoefler. 2016. On Noise and the Performance Benefit of Nonblocking Collectives. Int. J. High Perform. Comput. Appl. 30, 1 (Feb. 2016), 121--133.

Digital Library

[40]

Wei Wu, George Bosilca, Rolf vandeVaart, Sylvain Jeaugey, and Jack Dongarra. 2016. GPU-Aware Non-contiguous Data Movement In Open MPI. In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC '16). 231--242.

Digital Library

[41]

W. Wu, A. Bouteiller, G. Bosilca, M. Faverge, and J. Dongarra. 2015. Hierarchical DAG Scheduling for Hybrid Distributed Systems. In 2015 IEEE International Parallel and Distributed Processing Symposium. 156--165.

Digital Library

[42]

K. Yoshii, K. Iskra, H. Naik, P. Beckmanm, and P. C. Broekema. 2009. Characterizing the Performance of Big Memory on Blue Gene Linux. In 2009 ICPP Workshops. 65--72.

Digital Library

[43]

F. Zheng, H. Yu, C. Hantas, M. Wolf, G. Eisenhauer, K. Schwan, H. Abbasi, and S. Klasky. 2013. GoldRush: Resource efficient in situ scientific data analytics using fine-grained interference aware execution. In 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 1--12.

Digital Library

[44]

Hao Zhu, David Goodell, William Gropp, and Rajeev Thakur. 2009. Hierarchical Collectives in MPICH2. Springer Berlin Heidelberg, Berlin, Heidelberg, 325--326.

Digital Library

Cited By

Wang JZhao TWang Y(2024)Network states-aware collective communication optimizationCluster Computing10.1007/s10586-024-04330-927:5(6869-6887)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.1007/s10586-024-04330-9
Schuchart JHunold SBosilca G(2023)Synchronizing MPI Processes in Space and TimeProceedings of the 30th European MPI Users' Group Meeting10.1145/3615318.3615325(1-11)Online publication date: 11-Sep-2023
https://dl.acm.org/doi/10.1145/3615318.3615325
Zhou QChu CKumar NKousha PGhazimirsaeed SSubramoni HPanda D(2021) Designing High-Performance MPI Libraries with On-the-fly Compression for Modern GPU Clusters * 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00053(444-453)Online publication date: May-2021
https://doi.org/10.1109/IPDPS49936.2021.00053
Show More Cited By

Index Terms

ADAPT: an event-based adaptive collective communication framework

Recommendations

TuCCompi: A Multi-layer Model for Distributed Heterogeneous Computing with Tuning Capabilities

During the last decade, parallel processing architectures have become a powerful tool to deal with massively-parallel problems that require high performance computing (HPC). The last trend of HPC is the use of heterogeneous environments, that combine ...
Evaluation of XcalableACC with tightly coupled accelerators/InfiniBand hybrid communication on accelerated cluster

Accelerated clusters, which are cluster systems equipped with accelerators, are one of the most common systems in parallel computing. In order to exploit the performance of such systems, it is important to reduce communication latency between accelerator ...
Hybrid Parallel Programming on GPU Clusters
ISPA '10: Proceedings of the International Symposium on Parallel and Distributed Processing with Applications

Nowadays, NVIDIA’s CUDA is a general purpose scalable parallel programming model for writing highly parallel applications. It provides several key abstractions – a hierarchy of thread blocks, shared memory, and barrier synchronization. This model has ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

HPDC '18: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing

June 2018

291 pages

ISBN:9781450357852

DOI:10.1145/3208040

General Chair:
Ming Zhao
Arizona State University
,
Program Chairs:
Abhishek Chandra
University of Minnesota
,
Lavanya Ramakrishnan
Lawrence Berkeley National Lab

Copyright © 2018 ACM.

© 2018 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

HPDC '18

Sponsor:

HPDC '18: The 27th International Symposium on High-Performance Parallel and Distributed Computing

June 11 - 15, 2018

Arizona, Tempe

Acceptance Rates

HPDC '18 Paper Acceptance Rate 22 of 111 submissions, 20%;

Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
433
Total Downloads

Downloads (Last 12 months)50
Downloads (Last 6 weeks)7

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wang JZhao TWang Y(2024)Network states-aware collective communication optimizationCluster Computing10.1007/s10586-024-04330-927:5(6869-6887)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.1007/s10586-024-04330-9
Schuchart JHunold SBosilca G(2023)Synchronizing MPI Processes in Space and TimeProceedings of the 30th European MPI Users' Group Meeting10.1145/3615318.3615325(1-11)Online publication date: 11-Sep-2023
https://dl.acm.org/doi/10.1145/3615318.3615325
Zhou QChu CKumar NKousha PGhazimirsaeed SSubramoni HPanda D(2021) Designing High-Performance MPI Libraries with On-the-fly Compression for Modern GPU Clusters * 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00053(444-453)Online publication date: May-2021
https://doi.org/10.1109/IPDPS49936.2021.00053
Khorassani KChu CAnthony QSubramoni HPanda D(2021)Adaptive and Hierarchical Large Message All-to-all Communication Algorithms for Large-scale Dense GPU Systems2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid51090.2021.00021(113-122)Online publication date: May-2021
https://doi.org/10.1109/CCGrid51090.2021.00021
Zhong DCao QBosilca GDongarra J(2020)Using Advanced Vector Extensions AVX-512 for MPI ReductionsProceedings of the 27th European MPI Users' Group Meeting10.1145/3416315.3416316(1-10)Online publication date: 21-Sep-2020
https://dl.acm.org/doi/10.1145/3416315.3416316
Chu CKousha PAwan AKhorassani KSubramoni HPanda DAyguadé EHwu WBadia RHofstee H(2020)NV-groupProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392771(1-12)Online publication date: 29-Jun-2020
https://dl.acm.org/doi/10.1145/3392717.3392771
Chu CKhorassani KZhou QSubramoni HPanda D(2020)Dynamic Kernel Fusion for Bulk Non-contiguous Data Transfer on GPU Clusters2020 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER49012.2020.00023(130-141)Online publication date: Sep-2020
https://doi.org/10.1109/CLUSTER49012.2020.00023
Luo XWu WBosilca GPei YCao QPatinyasakdikul TZhong DDongarra J(2020)HAN: a Hierarchical AutotuNed Collective Communication Framework2020 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER49012.2020.00013(23-34)Online publication date: Sep-2020
https://doi.org/10.1109/CLUSTER49012.2020.00013
Zhong DShamis PCao QBosilca GSumimoto SMiura KDongarra J(2020)Using Arm Scalable Vector Extension to Optimize OPEN MPI2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID)10.1109/CCGrid49817.2020.00-71(222-231)Online publication date: May-2020
https://doi.org/10.1109/CCGrid49817.2020.00-71
Kurnosov MTokmasheva E(2020)Shared Memory Based MPI Broadcast Algorithms for NUMA SystemsSupercomputing10.1007/978-3-030-64616-5_41(473-485)Online publication date: 4-Dec-2020
https://doi.org/10.1007/978-3-030-64616-5_41
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten