Abstract
Distributed deep learning can effectively accelerate neural model training, which employs multiple workers at a cluster of nodes to train a neural network in a parallel way. In this paper, we propose InHAD, an asynchronous distributed deep learning protocol, whose key novelty lies in the design of hierarchical gradient communication and aggregation. The local aggregation is conducted inside a computing node to aggregate the gradients produced by local workers, while global aggregation is conducted at the parameter server to calculate new model parameters based on results of local aggregations. An iteration number-based mechanism is designed to guarantee the convergence of the training. Worker nodes keep sending IN to parameter server and the latter decides when to pull gradients from workers by counting IN and updates global model weights/parameters. With the IN-based hierarchical aggregation technique, InHAD can save communication cost by reducing the number of gradients transferred and speed up the convergence by limiting the staleness of gradients. We conduct extensive experiments at the Tianhe-2 supercomputer system to evaluate the performance of InHAD. Two neural networks are trained on two classical datasets, and similar protocols like Horovod and ASP are tested for comparisons. The results show that InHAD can achieve much higher acceleration than ASP and nearly the same accuracy as Horovod .
Similar content being viewed by others
References
Vipul G, Dhruv C, Ping Tak Peter T, Xiaohan W, Xing W, Yuzhen H, Arun K, Kannan R, and Michael WM (2021) Training recommender systems at scale: Communication-efficient model and data parallelism. In KDD ’21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, Singapore, August 14-18, 2021, pages 2928–2936. ACM,
Wayne X, Lingfeng W, Fil A, Jasha D, Xuedong H, and Andreas S (2018) The microsoft 2017 conversational speech recognition system. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5934–5938. IEEE
Langer M, He Z, Rahayu W, Xue Y (2020) Distributed training of deep learning models: a taxonomic perspective. IEEE Trans Parallel Distrib Syst 31(12):2802–2818
You Y, Zhang Z, Hsieh C-J, Demmel J, Keutzer K (2019) Fast deep neural network training on distributed systems and cloud tpus. IEEE Trans Parallel Distrib Syst 30(11):2449–2462
J-H Park, G Yun, C-M Yi, N-T Nguyen, S Lee, J Choi, S-H Noh, and Y-R Choi (2020) Hetpipe: Enabling large DNN training on (whimpy) heterogeneous GPU clusters through integration of pipelined model parallelism and data parallelism. In 2020 USENIX Annual Technical Conference, USENIX ATC 2020, July 15-17, 2020, pages 307–321. USENIX Association
Alexander T (2011) BSP (bulk synchronous parallelism). In: Padua David A (ed) Encyclopedia of Parallel Computing. Springer, New York, pp 192–199
Barrachina S, Castelló A, Catalán M, Dolz Manuel F, Mestre José I (2021) Pydtnn: a user-friendly and extensible framework for distributed deep learning. J Supercomput 77(9):9971–9987
Yang E, Kang D-K, Youn C-H (2020) BOA: batch orchestration algorithm for straggler mitigation of distributed DL training in heterogeneous GPU cluster. J Supercomput 76(1):47–67
Henriksen T, Thorøe F, Elsman M, Oancea Cosmin E (2019) Incremental flattening for nested data parallelism. pp 53–67
Wang F, Zhang W, Guo H, Hao M, Gangzhao L, Wang Z (2021) Automatic translation of data parallel programs for heterogeneous parallelism through openmp offloading. J Supercomput 77(5):4957–4987
Yanping H, Youlong C, Ankur B, Orhan F, Dehao C, Mia C, HyoukJoong L, Jiquan N, Quoc V L, Yonghui W, and Zhifeng C (2019) Gpipe: Efficient training of giant neural networks using pipeline parallelism. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’ Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 103–112. 2019
Jiangsu D, Zhu X, Shen M, Yunfei D, Yutong L, Xiao N, Liao X (2021) Model parallelism optimization for distributed inference via decoupled CNN structure. IEEE Trans Parallel Distrib Syst 32(7):1665–1676
Alexander S and Mike D B (2018) Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799
Mu L, David GA, Jun WP, Alexander JS, Amr A, Vanja J, James L, Eugene JS, and Bor-Yiing S (2014) Scaling distributed machine learning with the parameter server. In 11th \(\{\)USENIX\(\}\) Symposium on Operating Systems Design and Implementation (\(\{\)OSDI\(\}\) 14), pages 583–598
Renz-Wieland A, Gemulla R, Zeuch S, Markl V (2020) Dynamic parameter allocation in parameter servers. Proc VLDB Endow 13(11):1877–1890
Jeffrey D, Greg C, Rajat M, Kai C, Matthieu D, Mark M, Marc’aurelio R, Andrew S, Paul T, Ke Y, et al (2012) Large scale distributed deep networks. In Advances in neural information processing systems, pages 1223–1231
Xiao D, Mei Y, Kuang D, Chen M, Guo B, Weigang W (2021) Egc: entropy-based gradient compression for distributed deep learning. Inf Sci 548:118–134
Zijie Y, Danyang X, Mengqiang C, Jieying Z, and Weigang W (2020) Dual-way gradient sparsification for asynchronous distributed deep learning. In 49th International Conference on Parallel Processing - ICPP, ICPP ’20. Association for Computing Machinery
Cui H, Zhang H, Ganger GR, Gibbons Phillip B, Xing Eric P (2016). Geeps: scalable deep learning on distributed gpus with a gpu-specialized parameter server. London, United Kingdom
Zhang R, Kwok J (2014) Asynchronous distributed admm for consensus optimization. pp 1701–1709
Wei Z, Suyog G, Xiangru L, and Ji L. Staleness-aware async-sgd for distributed deep learning. arXiv preprint arXiv:1511.05950, 2015
Qirong H, James C, Henggang C, Seunghak L, Jin KK, Phillip BG, Garth AG, Greg G, and Eric PX (2013) More effective distributed ml via a stale synchronous parallel parameter server. In Advances in neural information processing systems, pages 1223–1231
(2014) Caffe (2014) Convolutional architecture for fast feature embedding. In: J Yangqing, S Evan, D Jeff, K Sergey, L Jonathan, G Ross, G Sergio, D Trevor (eds). pp 675–678
Feyzmahdavian Hamid R, Aytekin A, Johansson M (2016) An asynchronous mini-batch algorithm for regularized stochastic optimization. IEEE Trans Autom Control 61(12):3740–3754
LeCun Y et al (2015) Lenet-5, convolutional neural networks. 20:5http://yann.lecun.com/exdb/lenet
LeCun Y, Cortes C, Burges CJ (2010) Mnist handwritten digit database. AT&T Labs [Online]. Available 2:18http://yann.lecun.com/exdb/mnist
Alex K, Vinod N, and Geoffrey H (2009) Cifar-10 and cifar-100 datasets. https://www.cs.toronto.edu/kriz/cifar.html, 6
Chen M, Yan Z, Ren J, Wu W (2020) Standard deviation based adaptive gradient compression for distributed deep learning. pp 529–538
Stich Sebastian U, Cordonnier J-B, Jaggi M (2018) Sparsified SGD with memory. Adv Neural Inform Process Syst 31:4452–4463
Hanlin T, Chen Y, Xiangru L, Tong Z, and Ji L. Doublesqueeze: Parallel stochastic gradient descent with double-pass error-compensated compression. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 6155–6165. PMLR, 2019
(2018). In: L Yujun, S Han, M Huizi, W Yu, D William (eds)
Chia-Yu C, Jungwook C, Daniel B, Ankur A, Wei Z, and Kailash G. Adacomp: Adaptive residual gradient compression for data-parallel distributed training. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018
Seide F, Hao F, Droppo J, Li G, Dong Y (2014) 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns
Wei W, Cong X, Feng Y, Chunpeng W, Yandan W, Yiran C, Hai L (2017) Terngrad: ternary gradients to reduce communication in distributed deep learning. Adv Neural Inf Process Syst 5:1509–1519
Ron B, Itay H, Elad H, and Daniel S. Scalable methods for 8-bit training of neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 5145–5153. Curran Associates, Inc., 2018
Emily LD, Wojciech Z, Joan B, Yann L, Rob F (2014) Exploiting linear structure within convolutional networks for efficient evaluation. Adv Neural Inf Process Syst 14:1269–1277
Dan A, Li J, Tomioka R (2016) and Milan Vojnovic. Randomized quantization for communication-optimal stochastic gradient descent, Qsgd
Ofer D, Ran G-B, Ohad S, Lin X (2012) Optimal distributed online prediction using mini-batches. J Mach Learn Res 13:165–202
Smola A, Narayanamurthy S (2010) An architecture for parallel topic models. Proc VLDB Endow 3(1–2):703–710
Ahmed A, Aly M, Gonzalez J, Narayanamurthy S, Smola Alexander J (2012) Scalable inference in latent variable models. pp 123–132
Acknowledgements
This research is partially supported by The National Key Research and Development Program of China (No 2018YFB0203803), National Natural Science Foundation of China (U1801266, U1711263), and Guangdong Natural Science Foundation (2018B030312002).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Xiao, D., Li, X., Zhou, J. et al. Iteration number-based hierarchical gradient aggregation for distributed deep learning. J Supercomput 78, 5565–5587 (2022). https://doi.org/10.1007/s11227-021-04083-x
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-021-04083-x