2D-THA-ADMM: communication efficient distributed ADMM algorithm framework based on two-dimensional torus hierarchical AllReduce

Wang, Guozheng; Lei, Yongmei; Zhang, Zeyu; Peng, Cunlu

doi:10.1007/s13042-023-01903-9

2D-THA-ADMM: communication efficient distributed ADMM algorithm framework based on two-dimensional torus hierarchical AllReduce

Original Article
Published: 28 June 2023

Volume 15, pages 207–226, (2024)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Guozheng Wang ORCID: orcid.org/0000-0002-5260-3458¹,
Yongmei Lei¹,
Zeyu Zhang¹ &
…
Cunlu Peng¹

473 Accesses
Explore all metrics

Abstract

Model synchronization refers to the communication process involved in large-scale distributed machine learning tasks. As the cluster scales up, the synchronization of model parameters becomes a challenging task that has to be coordinated among thousands of workers. Firstly, this study proposes a hierarchical AllReduce algorithm structured on a two-dimensional torus (2D-THA), which utilizes a hierarchical structure to synchronize model parameters and maximize bandwidth utilization. Secondly, this study introduces a distributed consensus algorithm called 2D-THA-ADMM, which combines the 2D-THA synchronization algorithm with the alternating direction method of multipliers (ADMM). Thirdly, we evaluate the model parameter synchronization performance of 2D-THA and the scalability of 2D-THA-ADMM on the Tianhe-2 supercomputing platform using real public datasets. Our experiments demonstrate that 2D-THA significantly reduces synchronization time by $63.447\%$ compared to MPI_Allreduce. Furthermore, the proposed 2D-THA-ADMM algorithm exhibits excellent scalability, with a training speed increase of over 3$\times $ compared to the state-of-the-art methods, while maintaining high accuracy and computational efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data availability

The data underlying this article are available in the article.

Notes

References

Gu R, Qi Y, Wu T, Wang Z, Xu X, Yuan C, Huang Y (2021) Sparkdq: efficient generic big data quality management on distributed data-parallel computation. J Parall Distrib Comput 156:132–147
Article Google Scholar
Nagrecha K (2021) Model-parallel model selection for deep learning systems. In: Proceedings of the 2021 international conference on management of data, pp 2929–2931
Shang F, Xu T, Liu Y, Liu H, Shen L, Gong M (2021) Differentially private ADMM algorithms for machine learning. IEEE Trans Inf Forens Secur 16:4733–4745
Article Google Scholar
Boyd S, Parikh N, Chu E (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Now Publishers Inc, Norwell
Google Scholar
Yang Y, Guan X, Jia Q.-S, Yu L, Xu B, Spanos CJ (2022) A survey of ADMM variants for distributed optimization: problems, algorithms and features. arXiv preprint arXiv:2208.03700
Elgabli A, Park J, Bedi AS, Issaid CB, Bennis M, Aggarwal V (2020) Q-GADMM: quantized group ADMM for communication efficient decentralized machine learning. IEEE Trans Commun 69(1):164–181
Article Google Scholar
Wang D, Lei Y, Xie J, Wang G (2021) HSAC-ALADMM: an asynchronous lazy ADMM algorithm based on hierarchical sparse allreduce communication. J Supercomput 77:8111–8134
Article Google Scholar
Liu Z, Xu Y (2022) Multi-task nonparallel support vector machine for classification. Appl Soft Comput 124:109051
Article Google Scholar
Zhou S, Li GY (2023) Federated learning via inexact ADMM. IEEE Trans Pattern Anal Mach Intell
Liu Y, Wu G, Tian Z, Ling Q (2021) Dqc-admm: decentralized dynamic admm with quantized and censored communications’’. IEEE Trans Neural Netw Learn Syst 33(8):3290–3304
Article MathSciNet Google Scholar
Wang S, Geng J, Li D (2021) Impact of synchronization topology on dml performance: both logical topology and physical topology. IEEE/ACM Trans Netw 30(2):572–585
Article Google Scholar
Sun DL, Fevotte C (2014) Alternating direction method of multipliers for non-negative matrix factorization with the beta-divergence. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6201–6205
Shi S, Tang Z, Chu X, Liu C, Wang W, Li B (2020) A quantitative survey of communication optimizations in distributed deep learning. IEEE Netw 35(3):230–237
Article Google Scholar
Thakur R, Rabenseifner R, Gropp W (2005) Optimization of collective communication operations in mpich. Int J High Perform Comput Appl 19(1):49–66
Article Google Scholar
Graham RL, Barrett BW, Shipman GM, Woodall TS, Bosilca G (2007) Open mpi: a high performance, flexible implementation of mpi point-to-point communications. Parall Process Lett 17(01):79–88
Article MathSciNet Google Scholar
Patarasuk P, Yuan X (2009) Bandwidth optimal all-reduce algorithms for clusters of workstations. J Parall Distrib Comput 69(2):117–124
Article Google Scholar
Research B (2017) “baidu-allreduce.” [Online]. https://github.com/baidu-research/baidu-allreduce
Lee J, Hwang I, Shah S, Cho M (2020) Flexreduce: Flexible all-reduce for distributed deep learning on asymmetric network topology. In: 2020 57th ACM/IEEE design automation conference (DAC). IEEE, pp 1–6
Sanghoon J, Son H, Kim J (2023) Logical/physical topology-aware collective communication in deep learning training. In: 2023 IEEE International symposium on high-performance computer architecture (HPCA). IEEE, pp 56–68
França G, Bento J (2020) Distributed optimization, averaging via admm, and network topology. Proc IEEE 108(11):1939–1952
Article Google Scholar
Tavara S, Schliep A (2018) Effect of network topology on the performance of admm-based svms. In: 2018 30th international symposium on computer architecture and high performance computing (SBAC-PAD). IEEE, pp 388–393
Wang D, Lei Y, Zhou J (2021) Hybrid mpi/openmp parallel asynchronous distributed alternating direction method of multipliers. Computing 103(12):2737–2762
Article MathSciNet Google Scholar
Xie J, Lei Y (2019) Admmlib: a library of communication-efficient ad-admm for distributed machine learning. In: IFIP international conference on network and parallel computing. Springer, pp 322–326
Wang Q, Wu W, Wang B, Wang G, Xi Y, Liu H, Wang S, Zhang J (2022)Asynchronous decomposition method for the coordinated operation of virtual power plants. IEEE Trans Power Syst
Li M, Andersen DG, Smola AJ, Yu K (2014) Communication efficient distributed machine learning with the parameter server. Adv Neural Inf Process Syst 27
Zhang Z, Yang S, Xu W, Di K (2022) Privacy-preserving distributed admm with event-triggered communication. IEEE Trans Neural Netw Learn Syst
Huang J, Majumder P, Kim S, Muzahid A, Yum KH, Kim EJ (2021) “Communication algorithm-architecture co-design for distributed deep learning. In: 2021 ACM/IEEE 48th annual international symposium on computer architecture (ISCA). IEEE, pp 181–194
Mikami H, Suganuma H, Tanaka Y, Kageyama Y et al (2018) Massively distributed sgd: Imagenet/resnet-50 training in a flash. arXiv preprint arXiv:1811.05233
Cho M, Finkler U, Kung D, Hunter H (2019) Blueconnect: decomposing all-reduce for deep learning on heterogeneous network hierarchy. Proc Mach Learn Syst 1:241–251
Google Scholar
Wang G, Venkataraman S, Phanishayee A, Devanur N, Thelin J, Stoica I (2020) Blink: fast and generic collectives for distributed ml. Proc Mach Learn Syst 2:172–186
Google Scholar
Kielmann T, Hofman RF, Bal HE, Plaat A, Bhoedjang RA (1999) Magpie: Mpi’s collective communication operations for clustered wide area systems. In: Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming, pp 131–140
Zhu H, Goodell D, Gropp W, Thakur R (2009) Hierarchical collectives in mpich2. In: European parallel virtual machine/message passing interface users’ group meeting. Springer, pp 325–326
Bayatpour M, Chakraborty S, Subramoni H, Lu X, Panda DK (2017) Scalable reduction collectives with data partitioning-based multi-leader design. In: Proceedings of the international conference for high performance computing, networking, storage and analysis, pp 1–11
Jia X, Song S, He W, Wang Y, Rong H, Zhou F, Xie L, Guo Z, Yang Y, Yu L et al (2018) Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes. arXiv preprint arXiv:1807.11205
Ryabinin M, Gorbunov E, Plokhotnyuk V, Pekhimenko G (2021) Moshpit sgd: communication-efficient decentralized training on heterogeneous unreliable devices. Adv Neural Inf Process Syst 34:18195–18211
Google Scholar
Lin CJ, Weng RC, Keerthi SS (2008) Trust region newton method for large-scale logistic regression. J Mach Learn Res 9(4)
Mamidala AR, Liu J, Panda DK (2004) Efficient barrier and allreduce on infiniband clusters using multicast and adaptive algorithms. In: 2004 IEEE international conference on cluster computing (IEEE Cat. No. 04EX935). IEEE, pp 135–144
Ho Q, Cipar J, Cui H, Lee S, Kim JK, Gibbons PB, Gibson GA, Ganger G, Xing EP (2013) More effective distributed ml via a stale synchronous parallel parameter server. In: Advances in neural information processing systems, pp 1223–1231
Zhang R, Kwok J (2014) Asynchronous distributed admm for consensus optimization. In: International conference on machine learning. PMLR, pp 1701–1709

Download references

Author information

Authors and Affiliations

School of Computer Engineering and Science, Shanghai University, Shanghai, 200444, China
Guozheng Wang, Yongmei Lei, Zeyu Zhang & Cunlu Peng

Authors

Guozheng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yongmei Lei
View author publications
You can also search for this author in PubMed Google Scholar
Zeyu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Cunlu Peng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yongmei Lei.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, G., Lei, Y., Zhang, Z. et al. 2D-THA-ADMM: communication efficient distributed ADMM algorithm framework based on two-dimensional torus hierarchical AllReduce. Int. J. Mach. Learn. & Cyber. 15, 207–226 (2024). https://doi.org/10.1007/s13042-023-01903-9

Download citation

Received: 27 July 2022
Accepted: 10 June 2023
Published: 28 June 2023
Issue Date: February 2024
DOI: https://doi.org/10.1007/s13042-023-01903-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

2D-THA-ADMM: communication efficient distributed ADMM algorithm framework based on two-dimensional torus hierarchical AllReduce

Abstract

Access this article

Subscribe and save

Buy Now

Data availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now