Skip to main content

Accelerating MPI All-to-All Communication with Online Compression on Modern GPU Clusters

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13289))

Abstract

As more High-Performance Computing (HPC) and Deep Learning (DL) applications are adapting to scale using GPUs, the communication of GPU-resident data is becoming vital to end-to-end application performance. Among the available MPI operations in such applications, All-to-All is one of the most communication-intensive operations that becomes the bottleneck of efficiently scaling applications to larger GPU systems. Over the last decade, most research has focused on the optimization of large GPU-resident data transfers. However, for state-of-the-art GPU-Aware MPI libraries, MPI_Alltoall communication for large GPU-resident data still suffers from poor performance due to the throughput limitation of commodity networks. However, the development of GPU-based compression algorithms with high throughput can reduce the volume of data transferred. The recent research of point-to-point-based online compression with these compression algorithms has shown potential on modern GPU clusters.

In this paper, we redesign an MPI library to enable efficient collective-level online compression with an optimized host-staging scheme for All-to-All communication. We demonstrate that the proposed design achieves benefits at both microbenchmark and application levels. At the microbenchmark level, the proposed design can reduce the All-to-All communication latency by up to 87%. For PSDNS, a traditional HPC application, our proposed design can reduce the All-to-All communication latency and total runtime by up to 29.2% and 21.8%, respectively, while ensuring data validation and not affecting the application convergence time. For Microsoft’s DeepSpeed, a DL optimization library, the proposed design reduces the MPI_Alltoall runtime by up to 26.4% compared to a state-of-the-art MPI library with point-to-point compression while ensuring data validation. To the best of our knowledge, this is the first work that leverages online GPU-based compression techniques to significantly accelerate MPI_Alltoall communication for HPC and DL applications.

*This research is supported in part by NSF grants #1818253, #1854828, #1931537, #2007991, #2018627, #2112606, and XRAC grant #NCR-130002.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Bruck, J., Ho, C.T., Kipnis, S., Upfal, E., Weathersby, D.: Efficient algorithms for All-to-All communications in multiport message-passing systems. IEEE Trans. Parallel Distrib. Syst. 8(11), 1143–1156 (1997)

    Article  Google Scholar 

  2. Chu, C.H., Kousha, P., Awan, A.A., Khorassani, K.S., Subramoni, H., Panda, D.K.: NV-group: link-efficient reduction for distributed deep learning on modern dense GPU systems. In: Proceedings of the 34th ACM International Conference on Supercomputing (2020)

    Google Scholar 

  3. Di, S., Cappello, F.: Fast error-bounded lossy HPC data compression with SZ. In: International Parallel and Distributed Processing Symposium (IPDPS) (2016)

    Google Scholar 

  4. Filgueira, R., Singh, D., Calderón, A., Carretero, J.: CoMPI: enhancing MPI based applications performance and scalability using run-time compression. In: European Parallel Virtual Machine/Message Passing Interface Users’ Group Meeting, pp. 207–218 (2009)

    Google Scholar 

  5. IBM: IBM Spectrum MPI: accelerating high-performance application parallelization (2018). https://www.ibm.com/us-en/marketplace/spectrum-mpi. Accessed 13 May 2022

  6. Jin, S., et al.: Understanding GPU-Based Lossy Compression for Extreme-Scale Cosmological Simulations. ArXiv:abs/2004.00224 (2020)

  7. Kale, L., Kumar, S., Varadarajan, K.: A framework for collective personalized communication. In: Proceedings International Parallel and Distributed Processing Symposium, p. 9 (2003). https://doi.org/10.1109/IPDPS.2003.1213166

  8. Khorassani, K.S., Chu, C.H., Anthony, Q.G., Subramoni, H., Panda, D.K.: Adaptive and hierarchical large message All-to-All communication algorithms for large-scale dense GPU systems. In: 2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid), pp. 113–122 (2021). https://doi.org/10.1109/CCGrid51090.2021.00021

  9. Khorassani, K.S., Chu, C.H., Subramoni, H., Panda, D.K.: Performance evaluation of MPI libraries on GPU-enabled OpenPOWER architectures: early experiences. In: International Workshop on OpenPOWER for HPC (IWOPH 19) at the 2019 ISC High Performance Conference (2018)

    Google Scholar 

  10. Kim, Y.J., et al.: Scalable and efficient MOE training for multitask multilingual models (2021)

    Google Scholar 

  11. Kousha, P., et al.: Designing a profiling and visualization tool for scalable and in-depth analysis of high-performance GPU clusters. In: 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC), pp. 93–102 (2019). https://doi.org/10.1109/HiPC.2019.00022

  12. Kousha, P., et al.: INAM: Cross-Stack Profiling and Analysis of Communication in MPI-Based Applications. Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3437359.3465582

  13. Lawrence Livermore National Laboratory: lassen—high performance computing (2018). https://hpc.llnl.gov/hardware/platforms/lassen. Accessed 13 March 2022

  14. Lindstrom, P.: Fixed-rate compressed floating-point arrays. IEEE Trans. Visualiz. Comput. Graph. 20 (2014). https://doi.org/10.1109/TVCG.2014.2346458

  15. Liquid Submerged System - Texas Advanced Computing Center, Frontera - Specifications. https://www.tacc.utexas.edu/systems/frontera

  16. Longhorn - Texas Advanced Computing Center Frontera - User Guide. https://portal.tacc.utexas.edu/user-guides/longhorn

  17. Network-Based Computing Laboratory: MVAPICH: MPI over InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE (2001). http://mvapich.cse.ohio-state.edu/. Accessed 13 March 2022

  18. NVIDIA: NVIDIA GPUDirect (2011). https://developer.nvidia.com/gpudirect. Accessed 13 March 2022

  19. NVIDIA: nvCOMP (2020). https://github.com/NVIDIA/nvcomp. Accessed 13 March 2022

  20. Open MPI: Open MPI: Open Source High Performance Computing (2004). https://www.open-mpi.org/. Accessed 13 March 2022

  21. Potluri, S., Hamidouche, K., Venkatesh, A., Bureddy, D., Panda, D.K.: Efficient inter-node MPI communication using GPUDirect RDMA for infiniBand clusters with NVIDIA GPUs. In: 42nd International Conference on Parallel Processing (ICPP), pp. 80–89. IEEE (2013)

    Google Scholar 

  22. Rasley, J., Rajbhandari, S., Ruwase, O., He, Y.: Deepspeed: system optimizations enable training deep learning models with over 100 billion parameters. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 3505–3506. KDD 2020, Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3394486.3406703

  23. Ravikumar, K., Appelhans, D., Yeung, P.K.: GPU acceleration of extreme scale pseudo-spectral simulations of turbulence using asynchronism. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC 2019, Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3295500.3356209

  24. Sharkawi, S.S., Chochia, G.A.: Communication protocol optimization for enhanced GPU performance. IBM J. Res. Develop. 64(3/4), 9:1–9:9 (2020)

    Google Scholar 

  25. Shi, R., et al.: Designing efficient small message transfer mechanism for inter-node MPI communication on InfiniBand GPU clusters. In: 2014 21st International Conference on High Performance Computing (HiPC), pp. 1–10 (2014)

    Google Scholar 

  26. Singh, A.K., Potluri, S., Wang, H., Kandalla, K., Sur, S., Panda, D.K.: MPI AlltoAll personalized exchange on GPGPU clusters: design alternatives and benefit. In: 2011 IEEE International Conference on Cluster Computing, pp. 420–427 (2011)

    Google Scholar 

  27. Singh, A.K.: Optimizing All-to-All and Allgather Communications on GPGPU Clusters. Master’s thesis, The Ohio State University (2012)

    Google Scholar 

  28. Singh, A.K., Potluri, S., Wang, H., Kandalla, K., Sur, S., Panda, D.K.: MPI AlltoAll personalized exchange on GPGPU clusters: design alternatives and benefit. In: 2011 IEEE International Conference on Cluster Computing, pp. 420–427 (2011). https://doi.org/10.1109/CLUSTER.2011.67

  29. Thakur, R., Rabenseifner, R., Gropp, W.: Optimization of collective communication operations in MPICH. Int. J. High Perform. Comput. Appl. 19(1), 49–66 (2005). https://doi.org/10.1177/1094342005051521

  30. Tian, J., et al.: CUSZ: an efficient GPU-based error-bounded lossy compression framework for scientific data. In: Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques, pp. 3–15. PACT 2020, Association for Computing Machinery, New York, NY, USA (2020). https://doi.org/10.1145/3410463.3414624

  31. Yang, A., Mukka, H., Hesaaraki, F., Burtscher, M.: MPC: a massively parallel compression algorithm for scientific data. In: IEEE Cluster Conference (2015)

    Google Scholar 

  32. Zhou, Q., et al.: Designing high-performance MPI libraries with on-the-fly compression for modern GPU clusters*. In: 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 444–453 (2021). https://doi.org/10.1109/IPDPS49936.2021.00053

Download references

Acknowledgment

The authors would like to thank Kiran Ravikumar and Prof. P.K. Yeung from Georgia Institute of Technology for guiding conducting experiments with the 3D-FFT kernel of application PSDNS.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qinghua Zhou .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhou, Q. et al. (2022). Accelerating MPI All-to-All Communication with Online Compression on Modern GPU Clusters. In: Varbanescu, AL., Bhatele, A., Luszczek, P., Marc, B. (eds) High Performance Computing. ISC High Performance 2022. Lecture Notes in Computer Science, vol 13289. Springer, Cham. https://doi.org/10.1007/978-3-031-07312-0_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-07312-0_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-07311-3

  • Online ISBN: 978-3-031-07312-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics