Accelerating MPI All-to-All Communication with Online Compression on Modern GPU Clusters

Zhou, Qinghua; Kousha, Pouya; Anthony, Quentin; Shafie Khorassani, Kawthar; Shafi, Aamir; Subramoni, Hari; Panda, Dhabaleswar K.

doi:10.1007/978-3-031-07312-0_1

Accelerating MPI All-to-All Communication with Online Compression on Modern GPU Clusters

Qinghua Zhou¹¹,
Pouya Kousha¹¹,
Quentin Anthony¹¹,
Kawthar Shafie Khorassani¹¹,
Aamir Shafi¹¹,
Hari Subramoni¹¹ &
…
Dhabaleswar K. Panda¹¹

Conference paper
First Online: 29 May 2022

1499 Accesses
5 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13289))

Abstract

As more High-Performance Computing (HPC) and Deep Learning (DL) applications are adapting to scale using GPUs, the communication of GPU-resident data is becoming vital to end-to-end application performance. Among the available MPI operations in such applications, All-to-All is one of the most communication-intensive operations that becomes the bottleneck of efficiently scaling applications to larger GPU systems. Over the last decade, most research has focused on the optimization of large GPU-resident data transfers. However, for state-of-the-art GPU-Aware MPI libraries, MPI_Alltoall communication for large GPU-resident data still suffers from poor performance due to the throughput limitation of commodity networks. However, the development of GPU-based compression algorithms with high throughput can reduce the volume of data transferred. The recent research of point-to-point-based online compression with these compression algorithms has shown potential on modern GPU clusters.

In this paper, we redesign an MPI library to enable efficient collective-level online compression with an optimized host-staging scheme for All-to-All communication. We demonstrate that the proposed design achieves benefits at both microbenchmark and application levels. At the microbenchmark level, the proposed design can reduce the All-to-All communication latency by up to 87%. For PSDNS, a traditional HPC application, our proposed design can reduce the All-to-All communication latency and total runtime by up to 29.2% and 21.8%, respectively, while ensuring data validation and not affecting the application convergence time. For Microsoft’s DeepSpeed, a DL optimization library, the proposed design reduces the MPI_Alltoall runtime by up to 26.4% compared to a state-of-the-art MPI library with point-to-point compression while ensuring data validation. To the best of our knowledge, this is the first work that leverages online GPU-based compression techniques to significantly accelerate MPI_Alltoall communication for HPC and DL applications.

*This research is supported in part by NSF grants #1818253, #1854828, #1931537, #2007991, #2018627, #2112606, and XRAC grant #NCR-130002.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Bruck, J., Ho, C.T., Kipnis, S., Upfal, E., Weathersby, D.: Efficient algorithms for All-to-All communications in multiport message-passing systems. IEEE Trans. Parallel Distrib. Syst. 8(11), 1143–1156 (1997)
Article Google Scholar
Chu, C.H., Kousha, P., Awan, A.A., Khorassani, K.S., Subramoni, H., Panda, D.K.: NV-group: link-efficient reduction for distributed deep learning on modern dense GPU systems. In: Proceedings of the 34th ACM International Conference on Supercomputing (2020)
Google Scholar
Di, S., Cappello, F.: Fast error-bounded lossy HPC data compression with SZ. In: International Parallel and Distributed Processing Symposium (IPDPS) (2016)
Google Scholar
Filgueira, R., Singh, D., Calderón, A., Carretero, J.: CoMPI: enhancing MPI based applications performance and scalability using run-time compression. In: European Parallel Virtual Machine/Message Passing Interface Users’ Group Meeting, pp. 207–218 (2009)
Google Scholar
IBM: IBM Spectrum MPI: accelerating high-performance application parallelization (2018). https://www.ibm.com/us-en/marketplace/spectrum-mpi. Accessed 13 May 2022
Jin, S., et al.: Understanding GPU-Based Lossy Compression for Extreme-Scale Cosmological Simulations. ArXiv:abs/2004.00224 (2020)
Kale, L., Kumar, S., Varadarajan, K.: A framework for collective personalized communication. In: Proceedings International Parallel and Distributed Processing Symposium, p. 9 (2003). https://doi.org/10.1109/IPDPS.2003.1213166
Khorassani, K.S., Chu, C.H., Anthony, Q.G., Subramoni, H., Panda, D.K.: Adaptive and hierarchical large message All-to-All communication algorithms for large-scale dense GPU systems. In: 2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid), pp. 113–122 (2021). https://doi.org/10.1109/CCGrid51090.2021.00021
Khorassani, K.S., Chu, C.H., Subramoni, H., Panda, D.K.: Performance evaluation of MPI libraries on GPU-enabled OpenPOWER architectures: early experiences. In: International Workshop on OpenPOWER for HPC (IWOPH 19) at the 2019 ISC High Performance Conference (2018)
Google Scholar
Kim, Y.J., et al.: Scalable and efficient MOE training for multitask multilingual models (2021)
Google Scholar
Kousha, P., et al.: Designing a profiling and visualization tool for scalable and in-depth analysis of high-performance GPU clusters. In: 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC), pp. 93–102 (2019). https://doi.org/10.1109/HiPC.2019.00022
Kousha, P., et al.: INAM: Cross-Stack Profiling and Analysis of Communication in MPI-Based Applications. Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3437359.3465582
Lawrence Livermore National Laboratory: lassen—high performance computing (2018). https://hpc.llnl.gov/hardware/platforms/lassen. Accessed 13 March 2022
Lindstrom, P.: Fixed-rate compressed floating-point arrays. IEEE Trans. Visualiz. Comput. Graph. 20 (2014). https://doi.org/10.1109/TVCG.2014.2346458
Liquid Submerged System - Texas Advanced Computing Center, Frontera - Specifications. https://www.tacc.utexas.edu/systems/frontera
Longhorn - Texas Advanced Computing Center Frontera - User Guide. https://portal.tacc.utexas.edu/user-guides/longhorn
Network-Based Computing Laboratory: MVAPICH: MPI over InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE (2001). http://mvapich.cse.ohio-state.edu/. Accessed 13 March 2022
NVIDIA: NVIDIA GPUDirect (2011). https://developer.nvidia.com/gpudirect. Accessed 13 March 2022
NVIDIA: nvCOMP (2020). https://github.com/NVIDIA/nvcomp. Accessed 13 March 2022
Open MPI: Open MPI: Open Source High Performance Computing (2004). https://www.open-mpi.org/. Accessed 13 March 2022
Potluri, S., Hamidouche, K., Venkatesh, A., Bureddy, D., Panda, D.K.: Efficient inter-node MPI communication using GPUDirect RDMA for infiniBand clusters with NVIDIA GPUs. In: 42nd International Conference on Parallel Processing (ICPP), pp. 80–89. IEEE (2013)
Google Scholar
Rasley, J., Rajbhandari, S., Ruwase, O., He, Y.: Deepspeed: system optimizations enable training deep learning models with over 100 billion parameters. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 3505–3506. KDD 2020, Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3394486.3406703
Ravikumar, K., Appelhans, D., Yeung, P.K.: GPU acceleration of extreme scale pseudo-spectral simulations of turbulence using asynchronism. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC 2019, Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3295500.3356209
Sharkawi, S.S., Chochia, G.A.: Communication protocol optimization for enhanced GPU performance. IBM J. Res. Develop. 64(3/4), 9:1–9:9 (2020)
Google Scholar
Shi, R., et al.: Designing efficient small message transfer mechanism for inter-node MPI communication on InfiniBand GPU clusters. In: 2014 21st International Conference on High Performance Computing (HiPC), pp. 1–10 (2014)
Google Scholar
Singh, A.K., Potluri, S., Wang, H., Kandalla, K., Sur, S., Panda, D.K.: MPI AlltoAll personalized exchange on GPGPU clusters: design alternatives and benefit. In: 2011 IEEE International Conference on Cluster Computing, pp. 420–427 (2011)
Google Scholar
Singh, A.K.: Optimizing All-to-All and Allgather Communications on GPGPU Clusters. Master’s thesis, The Ohio State University (2012)
Google Scholar
Singh, A.K., Potluri, S., Wang, H., Kandalla, K., Sur, S., Panda, D.K.: MPI AlltoAll personalized exchange on GPGPU clusters: design alternatives and benefit. In: 2011 IEEE International Conference on Cluster Computing, pp. 420–427 (2011). https://doi.org/10.1109/CLUSTER.2011.67
Thakur, R., Rabenseifner, R., Gropp, W.: Optimization of collective communication operations in MPICH. Int. J. High Perform. Comput. Appl. 19(1), 49–66 (2005). https://doi.org/10.1177/1094342005051521
Tian, J., et al.: CUSZ: an efficient GPU-based error-bounded lossy compression framework for scientific data. In: Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques, pp. 3–15. PACT 2020, Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3410463.3414624
Yang, A., Mukka, H., Hesaaraki, F., Burtscher, M.: MPC: a massively parallel compression algorithm for scientific data. In: IEEE Cluster Conference (2015)
Google Scholar
Zhou, Q., et al.: Designing high-performance MPI libraries with on-the-fly compression for modern GPU clusters*. In: 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 444–453 (2021). https://doi.org/10.1109/IPDPS49936.2021.00053

Download references

Acknowledgment

The authors would like to thank Kiran Ravikumar and Prof. P.K. Yeung from Georgia Institute of Technology for guiding conducting experiments with the 3D-FFT kernel of application PSDNS.

Author information

Authors and Affiliations

The Ohio State University, Columbus, OH, 43210, USA
Qinghua Zhou, Pouya Kousha, Quentin Anthony, Kawthar Shafie Khorassani, Aamir Shafi, Hari Subramoni & Dhabaleswar K. Panda

Authors

Qinghua Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Pouya Kousha
View author publications
You can also search for this author in PubMed Google Scholar
Quentin Anthony
View author publications
You can also search for this author in PubMed Google Scholar
Kawthar Shafie Khorassani
View author publications
You can also search for this author in PubMed Google Scholar
Aamir Shafi
View author publications
You can also search for this author in PubMed Google Scholar
Hari Subramoni
View author publications
You can also search for this author in PubMed Google Scholar
Dhabaleswar K. Panda
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qinghua Zhou .

Editor information

Editors and Affiliations

University of Twente, Enschede, The Netherlands
Ana-Lucia Varbanescu
University of Maryland, College Park, MD, USA
Abhinav Bhatele
University of Tennessee, Knoxville, TN, USA
Piotr Luszczek
Université Paris-Saclay, Orsay, France
Baboulin Marc

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhou, Q. et al. (2022). Accelerating MPI All-to-All Communication with Online Compression on Modern GPU Clusters. In: Varbanescu, AL., Bhatele, A., Luszczek, P., Marc, B. (eds) High Performance Computing. ISC High Performance 2022. Lecture Notes in Computer Science, vol 13289. Springer, Cham. https://doi.org/10.1007/978-3-031-07312-0_1

Download citation

DOI: https://doi.org/10.1007/978-3-031-07312-0_1
Published: 29 May 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-07311-3
Online ISBN: 978-3-031-07312-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics