skip to main content
10.1145/3208040.3208054acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

ADAPT: an event-based adaptive collective communication framework

Published: 11 June 2018 Publication History

Abstract

The increase in scale and heterogeneity of high-performance computing (HPC) systems predispose the performance of Message Passing Interface (MPI) collective communications to be susceptible to noise, and to adapt to a complex mix of hardware capabilities. The designs of state of the art MPI collectives heavily rely on synchronizations; these designs magnify noise across the participating processes, resulting in significant performance slowdown. Therefore, such design philosophy must be reconsidered to efficiently and robustly run on the large-scale heterogeneous platforms. In this paper, we present ADAPT, a new collective communication framework in Open MPI, using event-driven techniques to morph collective algorithms to heterogeneous environments. The core concept of ADAPT is to relax synchronizations, while mamtaining the minimal data dependencies of MPI collectives. To fully exploit the different bandwidths of data movement lanes in heterogeneous systems, we extend the ADAPT collective framework with a topology-aware communication tree. This removes the boundaries of different hardware topologies while maximizing the speed of data movements. We evaluate our framework with two popular collective operations: broadcast and reduce on both CPU and GPU clusters. Our results demonstrate drastic performance improvements and a strong resistance against noise compared to other state of the art MPI libraries. In particular, we demonstrate at least 1.3X and 1.5X speedup for CPU data and 2X and 10X speedup for GPU data using ADAPT event-based broadcast and reduce operations.

References

[1]
A. A. Awan, K. Hamidouche, A. Venkatesh, and D. K. Panda. 2016. Efficient Large Message Broadcast Using NCCL and CUDA-Aware MPI for Deep Learning (EuroMPI 2016). 15--22.
[2]
P. Beckman, K. Iskra, K. Yoshii, and S. Coghlan. 2006. The Influence of Operating Systems on the Performance of Collective Operations at Extreme Scale. In 2006 IEEE International Conference on Cluster Computing. 1--12.
[3]
Suparna Bhattacharya, Steven Pratt, Badari Pulavarty, and Janet Morgan. 2003. Asynchronous I/O support in Linux 2.5. In Proceedings of the Linux Symposium. 371--386.
[4]
F. Broquedis, J. Clet-Ortega, S. Moreaud, N. Furmento, B. Goglin, G. Mercier, S. Thibault, and R. Namyst. 2010. hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications. In 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing. 180--186.
[5]
C. H. Chu, K. Hamidouche, A. Venkatesh, A. A. Awan, and D. K. Panda. 2016. CUDA Kernel Based Collective Reduction Operations on Large-scale GPU Clusters. In 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. 726--735.
[6]
C. H. Chu, X. Lu, A. A. Awan, H. Subramoni, J. Hashmi, B. Elton, and D. K. Panda. 2017. Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning. In 2017 46th International Conference on Parallel Processing (ICPP). 161--170.
[7]
T. V. Cutsem, S. Mostinckx, E. G. Boix, J. Dedecker, and W. D. Meuter. 2007. AmbientTalk: Object-oriented Event-driven Programming in Mobile Ad hoc Networks. In Chilean Society of Computer Science, 2007. SCCC '07. XXVI International Conference of the. 3--12.
[8]
Frank Dabek, Nickolai Zeldovich, Frans Kaashoek, David Mazières, and Robert Morris. 2002. Event-driven Programming for Robust Software. In Proceedings of the 10th Workshop on ACM SIGOPS European Workshop (EW 10). ACM, 186--189.
[9]
Adam Dunkels, Oliver Schmidt, Thiemo Voigt, and Muneeb Ali. 2006. Protothreads: Simplifying Event-driven Programming of Memory-constrained Embedded Systems. In Proceedings of the 4th International Conference on Embedded Networked Sensor Systems (SenSys '06). ACM, 29--42.
[10]
K. B. Ferreira, P. Bridges, and R. Brightwell. 2008. Characterizing application sensitivity to OS interference using kernel-level noise injection. In 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis. 1--12.
[11]
K. B. Ferreira, P. G. Bridges, R. Brightwell, and K. T. Pedretti. 2010. The Impact of System Design Parameters on Application Noise Sensitivity. In IEEE Cluster 2010. 146--155.
[12]
Kurt B. Ferreira, Patrick Widener, Scott Levy, Dorian Arnold, and Torsten Hoefler. 2014. Understanding the Effects of Communication and Coordination on Checkpointing at Scale. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '14). 883--894.
[13]
Message Passing Interface Forum. 2012. MPI: A Message-Passing Interface Standard, http://www.mpi-forum.org/. (September 2012).
[14]
V. W. Freeh, Feng Pan, N. Kappiah, D. K. Lowenthal, and R. Springer. 2005. Exploring the Energy-Time Tradeoff in MPI Programs on a Power-Scalable Cluster. In 19th IEEE International Parallel and Distributed Processing Symposium. 4a--4a.
[15]
R. Graham, M. G. Venkata, J. Ladd, P. Shamis, I. Rabinovitz, V. Filipov, and G. Shainer. 2011. Cheetah: A Framework for Scalable Hierarchical Collective Operations. In 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. 73--83.
[16]
Roger W. Hockney. 1994. The Communication Challenge for MPP: Intel Paragon and Meiko CS-2. Parallel Comput. 20, 3 (March 1994), 389--398.
[17]
Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine. 2010. Characterizing the Influence of System Noise on Large-Scale Applications by Simulation. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC '10). 1--11.
[18]
Torsten Hoefler, Christian Siebert, and Andrew Lumsdaine. 2009. Group Operation Assembly Language - A Flexible Way to Express Collective Communication. In Proceedings of the 2009 International Conference on Parallel Processing (ICPP '09). 574--581.
[19]
K. Kandalla, H. Subramoni, A. Vishnu, and D. K. Panda. 2010. Designing topology-aware collective communication algorithms for large scale InfiniBand clusters: Case studies with Scatter and Gather. In 2010 IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW). 1--8.
[20]
N. T. Karonis, B. R. de Supinski, I. Foster, W. Gropp, E. Lusk, and J. Bresnahan. 2000. Exploiting hierarchy in parallel computer networks to optimize collective operation performance. In IPDPS 2000. 377--384.
[21]
Thilo Kielmann, Rutger F. H. Hofman, Henri E. Bai, Aske Plaat, and Raoul A. F. Bhoedjang. 1999. MagPIe: MPI's Collective Communication Operations for Clustered Wide Area Systems (PPoPP '99). 131--140.
[22]
Scott Levy, Bryan Topp, Kurt B. Ferreira, Dorian Arnold, Torsten Hoefler, and Patrick Widener. 2014. Using Simulation to Evaluate the Performance of Resilience Strategies at Scale. Springer International Publishing, Cham, 91--114.
[23]
T. Ma, G. Bosilca, A. Bouteiller, and J. Dongarra. 2012. HierKNEM: An Adaptive Framework for Kernel-Assisted and Topology-Aware Collective Communications on Many-core Clusters. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium. 970--982.
[24]
O. H. Mondragon, P. G. Bridges, S. Levy, K. B. Ferreira, and P. Widener. 2016. Scheduling In-Situ Analytics in Next-Generation Applications. In 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. 102--105.
[25]
NVIDIA. 2016. NCCL. https://github.com/NVIDIA/nccl. (2016).
[26]
L. Oden, B. Klenk, and H. Fröning. 2014. Energy-Efficient Collective Reduce and Allreduce Operations on Distributed GPUs. In 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. 483--492.
[27]
Vivek S. Pai, Peter Druschel, and Willy Zwaenepoel. 1999. Flash: An efficient and portable Web server. (1999).
[28]
Benjamin S. Parsons and Vijay S. Pai. 2014. Accelerating MPI Collective Communications Through Hierarchical Algorithms Without Sacrificing Inter-Node Communication Flexibility (IPDPS '14). 208--218.
[29]
Jelena Pješivac-Grbović, Thara Angskun, George Bosilca, Graham E. Fagg, Edgar Gabriel, and Jack J. Dongarra. 2007. Performance analysis of MPI collective operations. Cluster Computing 10, 2 (2007), 127--143.
[30]
A. Plaat, H. E. Bal, and R. F. H. Hofman. 1999. Sensitivity of parallel applications to large differences in bandwidth and latency in two-layer interconnects. In Proceedings Fifth International Symposium on High-Performance Computer Architecture. 244--253.
[31]
Peter Sanders, Jochen Speck, and Jesper Larsson Träff. 2009. Two-tree Algorithms for Full Bandwidth Broadcast, Reduction and Scan. Parallel Comput. 35, 12 (Dec. 2009), 581--594.
[32]
A. K. Singh, S. Potluri, H. Wang, K. Kandalla, S. Sur, and D. K. Panda. 2011. MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefit. In 2011 IEEE International Conference on Cluster Computing. 420--427.
[33]
H. Subramoni, S. Potluri, K. Kandalla, B. Barth, J. Vienne, J. Keasler, K. Tomko, K. Schulz, A. Moody, and D. K. Panda. 2012. Design of a Scalable InfiniBand Topology Service to Enable Network-topology-aware Placement of Processes. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '12). Article 70, 12 pages.
[34]
Vinod Tipparaju, Jarek Nieplocha, and Dhabaleswar Panda. 2003. Fast Collective Operations Using Shared and Remote Memory Access Protocols on Clusters (IPDPS '03). 84.1-.
[35]
Nisheeth K. Vishnoi. 2007. The Impact of Noise on the Scaling of Collectives: The Nearest Neighbor Model. Springer, 476--487.
[36]
Hao Wang, Sreeram Potluri, Miao Luo, Ashish Kumar Singh, Sayantan Sur, and Dhabaleswar K. Panda. 2011. MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters. Computer Science - Research and Development 26, 3 (2011), 257.
[37]
Linnan Wang, Wei Wu, Zenglin Xu, Jianxiong Xiao, and Yi Yang. 2016. Blasx: A high performance level-3 bias library for heterogeneous multi-gpu computing. In Proceedings of the 2016 International Conference on Supercomputing. 20.
[38]
Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Zenglin Xu, and Tim Kraska. 2018. Superneurons: Dynamic GPU Memory Management for Training Deep Neural Networks. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '18). ACM, New York, NY, USA, 41--53.
[39]
Patrick M Widener, Scott Levy, Kurt B Ferreira, and Torsten Hoefler. 2016. On Noise and the Performance Benefit of Nonblocking Collectives. Int. J. High Perform. Comput. Appl. 30, 1 (Feb. 2016), 121--133.
[40]
Wei Wu, George Bosilca, Rolf vandeVaart, Sylvain Jeaugey, and Jack Dongarra. 2016. GPU-Aware Non-contiguous Data Movement In Open MPI. In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC '16). 231--242.
[41]
W. Wu, A. Bouteiller, G. Bosilca, M. Faverge, and J. Dongarra. 2015. Hierarchical DAG Scheduling for Hybrid Distributed Systems. In 2015 IEEE International Parallel and Distributed Processing Symposium. 156--165.
[42]
K. Yoshii, K. Iskra, H. Naik, P. Beckmanm, and P. C. Broekema. 2009. Characterizing the Performance of Big Memory on Blue Gene Linux. In 2009 ICPP Workshops. 65--72.
[43]
F. Zheng, H. Yu, C. Hantas, M. Wolf, G. Eisenhauer, K. Schwan, H. Abbasi, and S. Klasky. 2013. GoldRush: Resource efficient in situ scientific data analytics using fine-grained interference aware execution. In 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 1--12.
[44]
Hao Zhu, David Goodell, William Gropp, and Rajeev Thakur. 2009. Hierarchical Collectives in MPICH2. Springer Berlin Heidelberg, Berlin, Heidelberg, 325--326.

Cited By

View all
  • (2024)Network states-aware collective communication optimizationCluster Computing10.1007/s10586-024-04330-927:5(6869-6887)Online publication date: 1-Aug-2024
  • (2023)Synchronizing MPI Processes in Space and TimeProceedings of the 30th European MPI Users' Group Meeting10.1145/3615318.3615325(1-11)Online publication date: 11-Sep-2023
  • (2021) Designing High-Performance MPI Libraries with On-the-fly Compression for Modern GPU Clusters * 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00053(444-453)Online publication date: May-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
HPDC '18: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing
June 2018
291 pages
ISBN:9781450357852
DOI:10.1145/3208040
© 2018 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPU
  2. MPI
  3. collectives operations
  4. event-driven
  5. heterogeneous system
  6. system noise

Qualifiers

  • Research-article

Conference

HPDC '18

Acceptance Rates

HPDC '18 Paper Acceptance Rate 22 of 111 submissions, 20%;
Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)50
  • Downloads (Last 6 weeks)7
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Network states-aware collective communication optimizationCluster Computing10.1007/s10586-024-04330-927:5(6869-6887)Online publication date: 1-Aug-2024
  • (2023)Synchronizing MPI Processes in Space and TimeProceedings of the 30th European MPI Users' Group Meeting10.1145/3615318.3615325(1-11)Online publication date: 11-Sep-2023
  • (2021) Designing High-Performance MPI Libraries with On-the-fly Compression for Modern GPU Clusters * 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00053(444-453)Online publication date: May-2021
  • (2021)Adaptive and Hierarchical Large Message All-to-all Communication Algorithms for Large-scale Dense GPU Systems2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid51090.2021.00021(113-122)Online publication date: May-2021
  • (2020)Using Advanced Vector Extensions AVX-512 for MPI ReductionsProceedings of the 27th European MPI Users' Group Meeting10.1145/3416315.3416316(1-10)Online publication date: 21-Sep-2020
  • (2020)NV-groupProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392771(1-12)Online publication date: 29-Jun-2020
  • (2020)Dynamic Kernel Fusion for Bulk Non-contiguous Data Transfer on GPU Clusters2020 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER49012.2020.00023(130-141)Online publication date: Sep-2020
  • (2020)HAN: a Hierarchical AutotuNed Collective Communication Framework2020 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER49012.2020.00013(23-34)Online publication date: Sep-2020
  • (2020)Using Arm Scalable Vector Extension to Optimize OPEN MPI2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID)10.1109/CCGrid49817.2020.00-71(222-231)Online publication date: May-2020
  • (2020)Shared Memory Based MPI Broadcast Algorithms for NUMA SystemsSupercomputing10.1007/978-3-030-64616-5_41(473-485)Online publication date: 4-Dec-2020
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media