Skip to main content
Log in

Iteration number-based hierarchical gradient aggregation for distributed deep learning

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Distributed deep learning can effectively accelerate neural model training, which employs multiple workers at a cluster of nodes to train a neural network in a parallel way. In this paper, we propose InHAD, an asynchronous distributed deep learning protocol, whose key novelty lies in the design of hierarchical gradient communication and aggregation. The local aggregation is conducted inside a computing node to aggregate the gradients produced by local workers, while global aggregation is conducted at the parameter server to calculate new model parameters based on results of local aggregations. An iteration number-based mechanism is designed to guarantee the convergence of the training. Worker nodes keep sending IN to parameter server and the latter decides when to pull gradients from workers by counting IN and updates global model weights/parameters. With the IN-based hierarchical aggregation technique, InHAD can save communication cost by reducing the number of gradients transferred and speed up the convergence by limiting the staleness of gradients. We conduct extensive experiments at the Tianhe-2 supercomputer system to evaluate the performance of InHAD. Two neural networks are trained on two classical datasets, and similar protocols like Horovod and ASP are tested for comparisons. The results show that InHAD can achieve much higher acceleration than ASP and nearly the same accuracy as Horovod .

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. https://github.com/junyuseu/uncommon-datasets-caffe.

  2. https://www.nscc-gz.cn.

  3. https://github.com/junyuseu/uncommon-datasets-caffe.

References

  1. Vipul G, Dhruv C, Ping Tak Peter T, Xiaohan W, Xing W, Yuzhen H, Arun K, Kannan R, and Michael WM (2021) Training recommender systems at scale: Communication-efficient model and data parallelism. In KDD ’21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, Singapore, August 14-18, 2021, pages 2928–2936. ACM,

  2. Wayne X, Lingfeng W, Fil A, Jasha D, Xuedong H, and Andreas S (2018) The microsoft 2017 conversational speech recognition system. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5934–5938. IEEE

  3. Langer M, He Z, Rahayu W, Xue Y (2020) Distributed training of deep learning models: a taxonomic perspective. IEEE Trans Parallel Distrib Syst 31(12):2802–2818

    Article  Google Scholar 

  4. You Y, Zhang Z, Hsieh C-J, Demmel J, Keutzer K (2019) Fast deep neural network training on distributed systems and cloud tpus. IEEE Trans Parallel Distrib Syst 30(11):2449–2462

    Article  Google Scholar 

  5. J-H Park, G Yun, C-M Yi, N-T Nguyen, S Lee, J Choi, S-H Noh, and Y-R Choi (2020) Hetpipe: Enabling large DNN training on (whimpy) heterogeneous GPU clusters through integration of pipelined model parallelism and data parallelism. In 2020 USENIX Annual Technical Conference, USENIX ATC 2020, July 15-17, 2020, pages 307–321. USENIX Association

  6. Alexander T (2011) BSP (bulk synchronous parallelism). In: Padua David A (ed) Encyclopedia of Parallel Computing. Springer, New York, pp 192–199

  7. Barrachina S, Castelló A, Catalán M, Dolz Manuel F, Mestre José I (2021) Pydtnn: a user-friendly and extensible framework for distributed deep learning. J Supercomput 77(9):9971–9987

    Article  Google Scholar 

  8. Yang E, Kang D-K, Youn C-H (2020) BOA: batch orchestration algorithm for straggler mitigation of distributed DL training in heterogeneous GPU cluster. J Supercomput 76(1):47–67

    Article  Google Scholar 

  9. Henriksen T, Thorøe F, Elsman M, Oancea Cosmin E (2019) Incremental flattening for nested data parallelism. pp 53–67

  10. Wang F, Zhang W, Guo H, Hao M, Gangzhao L, Wang Z (2021) Automatic translation of data parallel programs for heterogeneous parallelism through openmp offloading. J Supercomput 77(5):4957–4987

  11. Yanping H, Youlong C, Ankur B, Orhan F, Dehao C, Mia C, HyoukJoong L, Jiquan N, Quoc V L, Yonghui W, and Zhifeng C (2019) Gpipe: Efficient training of giant neural networks using pipeline parallelism. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’ Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 103–112. 2019

  12. Jiangsu D, Zhu X, Shen M, Yunfei D, Yutong L, Xiao N, Liao X (2021) Model parallelism optimization for distributed inference via decoupled CNN structure. IEEE Trans Parallel Distrib Syst 32(7):1665–1676

    Google Scholar 

  13. Alexander S and Mike D B (2018) Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799

  14. Mu L, David GA, Jun WP, Alexander JS, Amr A, Vanja J, James L, Eugene JS, and Bor-Yiing S (2014) Scaling distributed machine learning with the parameter server. In 11th \(\{\)USENIX\(\}\) Symposium on Operating Systems Design and Implementation (\(\{\)OSDI\(\}\) 14), pages 583–598

  15. Renz-Wieland A, Gemulla R, Zeuch S, Markl V (2020) Dynamic parameter allocation in parameter servers. Proc VLDB Endow 13(11):1877–1890

    Article  Google Scholar 

  16. Jeffrey D, Greg C, Rajat M, Kai C, Matthieu D, Mark M, Marc’aurelio R, Andrew S, Paul T, Ke Y, et al (2012) Large scale distributed deep networks. In Advances in neural information processing systems, pages 1223–1231

  17. Xiao D, Mei Y, Kuang D, Chen M, Guo B, Weigang W (2021) Egc: entropy-based gradient compression for distributed deep learning. Inf Sci 548:118–134

    Article  MathSciNet  Google Scholar 

  18. Zijie Y, Danyang X, Mengqiang C, Jieying Z, and Weigang W (2020) Dual-way gradient sparsification for asynchronous distributed deep learning. In 49th International Conference on Parallel Processing - ICPP, ICPP ’20. Association for Computing Machinery

  19. Cui H, Zhang H, Ganger GR, Gibbons Phillip B, Xing Eric P (2016). Geeps: scalable deep learning on distributed gpus with a gpu-specialized parameter server. London, United Kingdom

  20. Zhang R, Kwok J (2014) Asynchronous distributed admm for consensus optimization. pp 1701–1709

  21. Wei Z, Suyog G, Xiangru L, and Ji L. Staleness-aware async-sgd for distributed deep learning. arXiv preprint arXiv:1511.05950, 2015

  22. Qirong H, James C, Henggang C, Seunghak L, Jin KK, Phillip BG, Garth AG, Greg G, and Eric PX (2013) More effective distributed ml via a stale synchronous parallel parameter server. In Advances in neural information processing systems, pages 1223–1231

  23. (2014) Caffe (2014) Convolutional architecture for fast feature embedding. In: J Yangqing, S Evan, D Jeff, K Sergey, L Jonathan, G Ross, G Sergio, D Trevor (eds). pp 675–678

  24. Feyzmahdavian Hamid R, Aytekin A, Johansson M (2016) An asynchronous mini-batch algorithm for regularized stochastic optimization. IEEE Trans Autom Control 61(12):3740–3754

    Article  MathSciNet  Google Scholar 

  25. LeCun Y et al (2015) Lenet-5, convolutional neural networks. 20:5http://yann.lecun.com/exdb/lenet

  26. LeCun Y, Cortes C, Burges CJ (2010) Mnist handwritten digit database. AT&T Labs [Online]. Available 2:18http://yann.lecun.com/exdb/mnist

  27. Alex K, Vinod N, and Geoffrey H (2009) Cifar-10 and cifar-100 datasets. https://www.cs.toronto.edu/kriz/cifar.html, 6

  28. Chen M, Yan Z, Ren J, Wu W (2020) Standard deviation based adaptive gradient compression for distributed deep learning. pp 529–538

  29. Stich Sebastian U, Cordonnier J-B, Jaggi M (2018) Sparsified SGD with memory. Adv Neural Inform Process Syst 31:4452–4463

    Google Scholar 

  30. Hanlin T, Chen Y, Xiangru L, Tong Z, and Ji L. Doublesqueeze: Parallel stochastic gradient descent with double-pass error-compensated compression. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 6155–6165. PMLR, 2019

  31. (2018). In: L Yujun, S Han, M Huizi, W Yu, D William (eds)

  32. Chia-Yu C, Jungwook C, Daniel B, Ankur A, Wei Z, and Kailash G. Adacomp: Adaptive residual gradient compression for data-parallel distributed training. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018

  33. Seide F, Hao F, Droppo J, Li G, Dong Y (2014) 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns

  34. Wei W, Cong X, Feng Y, Chunpeng W, Yandan W, Yiran C, Hai L (2017) Terngrad: ternary gradients to reduce communication in distributed deep learning. Adv Neural Inf Process Syst 5:1509–1519

    Google Scholar 

  35. Ron B, Itay H, Elad H, and Daniel S. Scalable methods for 8-bit training of neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 5145–5153. Curran Associates, Inc., 2018

  36. Emily LD, Wojciech Z, Joan B, Yann L, Rob F (2014) Exploiting linear structure within convolutional networks for efficient evaluation. Adv Neural Inf Process Syst 14:1269–1277

    Google Scholar 

  37. Dan A, Li J, Tomioka R (2016) and Milan Vojnovic. Randomized quantization for communication-optimal stochastic gradient descent, Qsgd

  38. Ofer D, Ran G-B, Ohad S, Lin X (2012) Optimal distributed online prediction using mini-batches. J Mach Learn Res 13:165–202

    MathSciNet  MATH  Google Scholar 

  39. Smola A, Narayanamurthy S (2010) An architecture for parallel topic models. Proc VLDB Endow 3(1–2):703–710

    Article  Google Scholar 

  40. Ahmed A, Aly M, Gonzalez J, Narayanamurthy S, Smola Alexander J (2012) Scalable inference in latent variable models. pp 123–132

Download references

Acknowledgements

This research is partially supported by The National Key Research and Development Program of China (No 2018YFB0203803), National Natural Science Foundation of China (U1801266, U1711263), and Guangdong Natural Science Foundation (2018B030312002).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Weigang Wu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xiao, D., Li, X., Zhou, J. et al. Iteration number-based hierarchical gradient aggregation for distributed deep learning. J Supercomput 78, 5565–5587 (2022). https://doi.org/10.1007/s11227-021-04083-x

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-021-04083-x

Keywords

Navigation