Abstract
Model synchronization refers to the communication process involved in large-scale distributed machine learning tasks. As the cluster scales up, the synchronization of model parameters becomes a challenging task that has to be coordinated among thousands of workers. Firstly, this study proposes a hierarchical AllReduce algorithm structured on a two-dimensional torus (2D-THA), which utilizes a hierarchical structure to synchronize model parameters and maximize bandwidth utilization. Secondly, this study introduces a distributed consensus algorithm called 2D-THA-ADMM, which combines the 2D-THA synchronization algorithm with the alternating direction method of multipliers (ADMM). Thirdly, we evaluate the model parameter synchronization performance of 2D-THA and the scalability of 2D-THA-ADMM on the Tianhe-2 supercomputing platform using real public datasets. Our experiments demonstrate that 2D-THA significantly reduces synchronization time by \(63.447\%\) compared to MPI_Allreduce. Furthermore, the proposed 2D-THA-ADMM algorithm exhibits excellent scalability, with a training speed increase of over 3\(\times \) compared to the state-of-the-art methods, while maintaining high accuracy and computational efficiency.
















Data availability
The data underlying this article are available in the article.
References
Gu R, Qi Y, Wu T, Wang Z, Xu X, Yuan C, Huang Y (2021) Sparkdq: efficient generic big data quality management on distributed data-parallel computation. J Parall Distrib Comput 156:132–147
Nagrecha K (2021) Model-parallel model selection for deep learning systems. In: Proceedings of the 2021 international conference on management of data, pp 2929–2931
Shang F, Xu T, Liu Y, Liu H, Shen L, Gong M (2021) Differentially private ADMM algorithms for machine learning. IEEE Trans Inf Forens Secur 16:4733–4745
Boyd S, Parikh N, Chu E (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Now Publishers Inc, Norwell
Yang Y, Guan X, Jia Q.-S, Yu L, Xu B, Spanos CJ (2022) A survey of ADMM variants for distributed optimization: problems, algorithms and features. arXiv preprint arXiv:2208.03700
Elgabli A, Park J, Bedi AS, Issaid CB, Bennis M, Aggarwal V (2020) Q-GADMM: quantized group ADMM for communication efficient decentralized machine learning. IEEE Trans Commun 69(1):164–181
Wang D, Lei Y, Xie J, Wang G (2021) HSAC-ALADMM: an asynchronous lazy ADMM algorithm based on hierarchical sparse allreduce communication. J Supercomput 77:8111–8134
Liu Z, Xu Y (2022) Multi-task nonparallel support vector machine for classification. Appl Soft Comput 124:109051
Zhou S, Li GY (2023) Federated learning via inexact ADMM. IEEE Trans Pattern Anal Mach Intell
Liu Y, Wu G, Tian Z, Ling Q (2021) Dqc-admm: decentralized dynamic admm with quantized and censored communications’’. IEEE Trans Neural Netw Learn Syst 33(8):3290–3304
Wang S, Geng J, Li D (2021) Impact of synchronization topology on dml performance: both logical topology and physical topology. IEEE/ACM Trans Netw 30(2):572–585
Sun DL, Fevotte C (2014) Alternating direction method of multipliers for non-negative matrix factorization with the beta-divergence. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6201–6205
Shi S, Tang Z, Chu X, Liu C, Wang W, Li B (2020) A quantitative survey of communication optimizations in distributed deep learning. IEEE Netw 35(3):230–237
Thakur R, Rabenseifner R, Gropp W (2005) Optimization of collective communication operations in mpich. Int J High Perform Comput Appl 19(1):49–66
Graham RL, Barrett BW, Shipman GM, Woodall TS, Bosilca G (2007) Open mpi: a high performance, flexible implementation of mpi point-to-point communications. Parall Process Lett 17(01):79–88
Patarasuk P, Yuan X (2009) Bandwidth optimal all-reduce algorithms for clusters of workstations. J Parall Distrib Comput 69(2):117–124
Research B (2017) “baidu-allreduce.” [Online]. https://github.com/baidu-research/baidu-allreduce
Lee J, Hwang I, Shah S, Cho M (2020) Flexreduce: Flexible all-reduce for distributed deep learning on asymmetric network topology. In: 2020 57th ACM/IEEE design automation conference (DAC). IEEE, pp 1–6
Sanghoon J, Son H, Kim J (2023) Logical/physical topology-aware collective communication in deep learning training. In: 2023 IEEE International symposium on high-performance computer architecture (HPCA). IEEE, pp 56–68
França G, Bento J (2020) Distributed optimization, averaging via admm, and network topology. Proc IEEE 108(11):1939–1952
Tavara S, Schliep A (2018) Effect of network topology on the performance of admm-based svms. In: 2018 30th international symposium on computer architecture and high performance computing (SBAC-PAD). IEEE, pp 388–393
Wang D, Lei Y, Zhou J (2021) Hybrid mpi/openmp parallel asynchronous distributed alternating direction method of multipliers. Computing 103(12):2737–2762
Xie J, Lei Y (2019) Admmlib: a library of communication-efficient ad-admm for distributed machine learning. In: IFIP international conference on network and parallel computing. Springer, pp 322–326
Wang Q, Wu W, Wang B, Wang G, Xi Y, Liu H, Wang S, Zhang J (2022)Asynchronous decomposition method for the coordinated operation of virtual power plants. IEEE Trans Power Syst
Li M, Andersen DG, Smola AJ, Yu K (2014) Communication efficient distributed machine learning with the parameter server. Adv Neural Inf Process Syst 27
Zhang Z, Yang S, Xu W, Di K (2022) Privacy-preserving distributed admm with event-triggered communication. IEEE Trans Neural Netw Learn Syst
Huang J, Majumder P, Kim S, Muzahid A, Yum KH, Kim EJ (2021) “Communication algorithm-architecture co-design for distributed deep learning. In: 2021 ACM/IEEE 48th annual international symposium on computer architecture (ISCA). IEEE, pp 181–194
Mikami H, Suganuma H, Tanaka Y, Kageyama Y et al (2018) Massively distributed sgd: Imagenet/resnet-50 training in a flash. arXiv preprint arXiv:1811.05233
Cho M, Finkler U, Kung D, Hunter H (2019) Blueconnect: decomposing all-reduce for deep learning on heterogeneous network hierarchy. Proc Mach Learn Syst 1:241–251
Wang G, Venkataraman S, Phanishayee A, Devanur N, Thelin J, Stoica I (2020) Blink: fast and generic collectives for distributed ml. Proc Mach Learn Syst 2:172–186
Kielmann T, Hofman RF, Bal HE, Plaat A, Bhoedjang RA (1999) Magpie: Mpi’s collective communication operations for clustered wide area systems. In: Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming, pp 131–140
Zhu H, Goodell D, Gropp W, Thakur R (2009) Hierarchical collectives in mpich2. In: European parallel virtual machine/message passing interface users’ group meeting. Springer, pp 325–326
Bayatpour M, Chakraborty S, Subramoni H, Lu X, Panda DK (2017) Scalable reduction collectives with data partitioning-based multi-leader design. In: Proceedings of the international conference for high performance computing, networking, storage and analysis, pp 1–11
Jia X, Song S, He W, Wang Y, Rong H, Zhou F, Xie L, Guo Z, Yang Y, Yu L et al (2018) Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes. arXiv preprint arXiv:1807.11205
Ryabinin M, Gorbunov E, Plokhotnyuk V, Pekhimenko G (2021) Moshpit sgd: communication-efficient decentralized training on heterogeneous unreliable devices. Adv Neural Inf Process Syst 34:18195–18211
Lin CJ, Weng RC, Keerthi SS (2008) Trust region newton method for large-scale logistic regression. J Mach Learn Res 9(4)
Mamidala AR, Liu J, Panda DK (2004) Efficient barrier and allreduce on infiniband clusters using multicast and adaptive algorithms. In: 2004 IEEE international conference on cluster computing (IEEE Cat. No. 04EX935). IEEE, pp 135–144
Ho Q, Cipar J, Cui H, Lee S, Kim JK, Gibbons PB, Gibson GA, Ganger G, Xing EP (2013) More effective distributed ml via a stale synchronous parallel parameter server. In: Advances in neural information processing systems, pp 1223–1231
Zhang R, Kwok J (2014) Asynchronous distributed admm for consensus optimization. In: International conference on machine learning. PMLR, pp 1701–1709
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, G., Lei, Y., Zhang, Z. et al. 2D-THA-ADMM: communication efficient distributed ADMM algorithm framework based on two-dimensional torus hierarchical AllReduce. Int. J. Mach. Learn. & Cyber. 15, 207–226 (2024). https://doi.org/10.1007/s13042-023-01903-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-023-01903-9