Iteration number-based hierarchical gradient aggregation for distributed deep learning

Xiao, Danyang; Li, Xinxin; Zhou, Jieying; Du, Yunfei; Wu, Weigang

doi:10.1007/s11227-021-04083-x

Iteration number-based hierarchical gradient aggregation for distributed deep learning

Published: 01 October 2021

Volume 78, pages 5565–5587, (2022)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Danyang Xiao ORCID: orcid.org/0000-0001-6798-9683¹,
Xinxin Li¹,
Jieying Zhou¹,
Yunfei Du^1,2 &
…
Weigang Wu¹

318 Accesses
2 Citations
Explore all metrics

Abstract

Distributed deep learning can effectively accelerate neural model training, which employs multiple workers at a cluster of nodes to train a neural network in a parallel way. In this paper, we propose InHAD, an asynchronous distributed deep learning protocol, whose key novelty lies in the design of hierarchical gradient communication and aggregation. The local aggregation is conducted inside a computing node to aggregate the gradients produced by local workers, while global aggregation is conducted at the parameter server to calculate new model parameters based on results of local aggregations. An iteration number-based mechanism is designed to guarantee the convergence of the training. Worker nodes keep sending IN to parameter server and the latter decides when to pull gradients from workers by counting IN and updates global model weights/parameters. With the IN-based hierarchical aggregation technique, InHAD can save communication cost by reducing the number of gradients transferred and speed up the convergence by limiting the staleness of gradients. We conduct extensive experiments at the Tianhe-2 supercomputer system to evaluate the performance of InHAD. Two neural networks are trained on two classical datasets, and similar protocols like Horovod and ASP are tested for comparisons. The results show that InHAD can achieve much higher acceleration than ASP and nearly the same accuracy as Horovod .

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

HPSGD: Hierarchical Parallel SGD with Stale Gradients Featuring

More Effective Distributed Deep Learning Using Staleness Based Parameter Updating

FLSGD: free local SGD with parallel synchronization

Article 03 March 2022

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

Notes

References

Vipul G, Dhruv C, Ping Tak Peter T, Xiaohan W, Xing W, Yuzhen H, Arun K, Kannan R, and Michael WM (2021) Training recommender systems at scale: Communication-efficient model and data parallelism. In KDD ’21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, Singapore, August 14-18, 2021, pages 2928–2936. ACM,
Wayne X, Lingfeng W, Fil A, Jasha D, Xuedong H, and Andreas S (2018) The microsoft 2017 conversational speech recognition system. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5934–5938. IEEE
Langer M, He Z, Rahayu W, Xue Y (2020) Distributed training of deep learning models: a taxonomic perspective. IEEE Trans Parallel Distrib Syst 31(12):2802–2818
Article Google Scholar
You Y, Zhang Z, Hsieh C-J, Demmel J, Keutzer K (2019) Fast deep neural network training on distributed systems and cloud tpus. IEEE Trans Parallel Distrib Syst 30(11):2449–2462
Article Google Scholar
J-H Park, G Yun, C-M Yi, N-T Nguyen, S Lee, J Choi, S-H Noh, and Y-R Choi (2020) Hetpipe: Enabling large DNN training on (whimpy) heterogeneous GPU clusters through integration of pipelined model parallelism and data parallelism. In 2020 USENIX Annual Technical Conference, USENIX ATC 2020, July 15-17, 2020, pages 307–321. USENIX Association
Alexander T (2011) BSP (bulk synchronous parallelism). In: Padua David A (ed) Encyclopedia of Parallel Computing. Springer, New York, pp 192–199
Barrachina S, Castelló A, Catalán M, Dolz Manuel F, Mestre José I (2021) Pydtnn: a user-friendly and extensible framework for distributed deep learning. J Supercomput 77(9):9971–9987
Article Google Scholar
Yang E, Kang D-K, Youn C-H (2020) BOA: batch orchestration algorithm for straggler mitigation of distributed DL training in heterogeneous GPU cluster. J Supercomput 76(1):47–67
Article Google Scholar
Henriksen T, Thorøe F, Elsman M, Oancea Cosmin E (2019) Incremental flattening for nested data parallelism. pp 53–67
Wang F, Zhang W, Guo H, Hao M, Gangzhao L, Wang Z (2021) Automatic translation of data parallel programs for heterogeneous parallelism through openmp offloading. J Supercomput 77(5):4957–4987
Yanping H, Youlong C, Ankur B, Orhan F, Dehao C, Mia C, HyoukJoong L, Jiquan N, Quoc V L, Yonghui W, and Zhifeng C (2019) Gpipe: Efficient training of giant neural networks using pipeline parallelism. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’ Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 103–112. 2019
Jiangsu D, Zhu X, Shen M, Yunfei D, Yutong L, Xiao N, Liao X (2021) Model parallelism optimization for distributed inference via decoupled CNN structure. IEEE Trans Parallel Distrib Syst 32(7):1665–1676
Google Scholar
Alexander S and Mike D B (2018) Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799
Mu L, David GA, Jun WP, Alexander JS, Amr A, Vanja J, James L, Eugene JS, and Bor-Yiing S (2014) Scaling distributed machine learning with the parameter server. In 11th $\{$USENIX$\}$ Symposium on Operating Systems Design and Implementation ($\{$OSDI$\}$ 14), pages 583–598
Renz-Wieland A, Gemulla R, Zeuch S, Markl V (2020) Dynamic parameter allocation in parameter servers. Proc VLDB Endow 13(11):1877–1890
Article Google Scholar
Jeffrey D, Greg C, Rajat M, Kai C, Matthieu D, Mark M, Marc’aurelio R, Andrew S, Paul T, Ke Y, et al (2012) Large scale distributed deep networks. In Advances in neural information processing systems, pages 1223–1231
Xiao D, Mei Y, Kuang D, Chen M, Guo B, Weigang W (2021) Egc: entropy-based gradient compression for distributed deep learning. Inf Sci 548:118–134
Article MathSciNet Google Scholar
Zijie Y, Danyang X, Mengqiang C, Jieying Z, and Weigang W (2020) Dual-way gradient sparsification for asynchronous distributed deep learning. In 49th International Conference on Parallel Processing - ICPP, ICPP ’20. Association for Computing Machinery
Cui H, Zhang H, Ganger GR, Gibbons Phillip B, Xing Eric P (2016). Geeps: scalable deep learning on distributed gpus with a gpu-specialized parameter server. London, United Kingdom
Zhang R, Kwok J (2014) Asynchronous distributed admm for consensus optimization. pp 1701–1709
Wei Z, Suyog G, Xiangru L, and Ji L. Staleness-aware async-sgd for distributed deep learning. arXiv preprint arXiv:1511.05950, 2015
Qirong H, James C, Henggang C, Seunghak L, Jin KK, Phillip BG, Garth AG, Greg G, and Eric PX (2013) More effective distributed ml via a stale synchronous parallel parameter server. In Advances in neural information processing systems, pages 1223–1231
(2014) Caffe (2014) Convolutional architecture for fast feature embedding. In: J Yangqing, S Evan, D Jeff, K Sergey, L Jonathan, G Ross, G Sergio, D Trevor (eds). pp 675–678
Feyzmahdavian Hamid R, Aytekin A, Johansson M (2016) An asynchronous mini-batch algorithm for regularized stochastic optimization. IEEE Trans Autom Control 61(12):3740–3754
Article MathSciNet Google Scholar
LeCun Y et al (2015) Lenet-5, convolutional neural networks. 20:5http://yann.lecun.com/exdb/lenet
LeCun Y, Cortes C, Burges CJ (2010) Mnist handwritten digit database. AT&T Labs [Online]. Available 2:18http://yann.lecun.com/exdb/mnist
Alex K, Vinod N, and Geoffrey H (2009) Cifar-10 and cifar-100 datasets. https://www.cs.toronto.edu/kriz/cifar.html, 6
Chen M, Yan Z, Ren J, Wu W (2020) Standard deviation based adaptive gradient compression for distributed deep learning. pp 529–538
Stich Sebastian U, Cordonnier J-B, Jaggi M (2018) Sparsified SGD with memory. Adv Neural Inform Process Syst 31:4452–4463
Google Scholar
Hanlin T, Chen Y, Xiangru L, Tong Z, and Ji L. Doublesqueeze: Parallel stochastic gradient descent with double-pass error-compensated compression. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 6155–6165. PMLR, 2019
(2018). In: L Yujun, S Han, M Huizi, W Yu, D William (eds)
Chia-Yu C, Jungwook C, Daniel B, Ankur A, Wei Z, and Kailash G. Adacomp: Adaptive residual gradient compression for data-parallel distributed training. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018
Seide F, Hao F, Droppo J, Li G, Dong Y (2014) 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns
Wei W, Cong X, Feng Y, Chunpeng W, Yandan W, Yiran C, Hai L (2017) Terngrad: ternary gradients to reduce communication in distributed deep learning. Adv Neural Inf Process Syst 5:1509–1519
Google Scholar
Ron B, Itay H, Elad H, and Daniel S. Scalable methods for 8-bit training of neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 5145–5153. Curran Associates, Inc., 2018
Emily LD, Wojciech Z, Joan B, Yann L, Rob F (2014) Exploiting linear structure within convolutional networks for efficient evaluation. Adv Neural Inf Process Syst 14:1269–1277
Google Scholar
Dan A, Li J, Tomioka R (2016) and Milan Vojnovic. Randomized quantization for communication-optimal stochastic gradient descent, Qsgd
Ofer D, Ran G-B, Ohad S, Lin X (2012) Optimal distributed online prediction using mini-batches. J Mach Learn Res 13:165–202
MathSciNet MATH Google Scholar
Smola A, Narayanamurthy S (2010) An architecture for parallel topic models. Proc VLDB Endow 3(1–2):703–710
Article Google Scholar
Ahmed A, Aly M, Gonzalez J, Narayanamurthy S, Smola Alexander J (2012) Scalable inference in latent variable models. pp 123–132

Download references

Acknowledgements

This research is partially supported by The National Key Research and Development Program of China (No 2018YFB0203803), National Natural Science Foundation of China (U1801266, U1711263), and Guangdong Natural Science Foundation (2018B030312002).

Author information

Authors and Affiliations

School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou, China
Danyang Xiao, Xinxin Li, Jieying Zhou, Yunfei Du & Weigang Wu
National Supercomputer Center, Guangzhou, China
Yunfei Du

Authors

Danyang Xiao
View author publications
You can also search for this author inPubMed Google Scholar
Xinxin Li
View author publications
You can also search for this author inPubMed Google Scholar
Jieying Zhou
View author publications
You can also search for this author inPubMed Google Scholar
Yunfei Du
View author publications
You can also search for this author inPubMed Google Scholar
Weigang Wu
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Weigang Wu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xiao, D., Li, X., Zhou, J. et al. Iteration number-based hierarchical gradient aggregation for distributed deep learning. J Supercomput 78, 5565–5587 (2022). https://doi.org/10.1007/s11227-021-04083-x

Download citation

Accepted: 07 September 2021
Published: 01 October 2021
Issue Date: March 2022
DOI: https://doi.org/10.1007/s11227-021-04083-x

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Iteration number-based hierarchical gradient aggregation for distributed deep learning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

HPSGD: Hierarchical Parallel SGD with Stale Gradients Featuring

More Effective Distributed Deep Learning Using Staleness Based Parameter Updating

FLSGD: free local SGD with parallel synchronization

Explore related subjects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now