Skip to main content
Log in

Efficient DNN training based on backpropagation parallelization

  • Regular Paper
  • Published:
Computing Aims and scope Submit manuscript

Abstract

Pipeline parallelism is an efficient way to speed up the training of deep neural networks (DNNs) by partitioning the model and pipelining the training process across a cluster of workers in distributed systems. In this paper, we propose a new pipeline parallelization approach (Q-FB pipeline) for distributed deep learning, which can achieve both high training speed and high hardware utilization. The major novelty of Q-FB pipeline lies in a mechanism that can parallelize the backpropagation training without loss of precision. Since the parameters update of the backward phase depends on the error calculated in the forward phase, paralleling the backpropagation process naively will hurt the model’s convergence behaviour. To provide convergence guarantees, Q-FB pipeline lets the forward phase and backward phase execute in parallel on different processors with the techniques of shared model memory and accumulated gradients update. To overcome the communication bottleneck, Q-FB pipeline compresses both activations and gradients before transferring them to other workers. We adopt an activation quantization scheme for reducing traffic in the forward phase and propose a gradient compression algorithm (2-Step GC algorithm) for reducing communication costs in the backward phase. Experiments at both small and large computing clusters (e.g. Tianhe-2 supercomputer system) show that Q-FB pipeline can effectively accelerate the training process without loss in convergence or precision.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: proceedings 3rd International Conference on Learning Representations, ICLR

  2. Dean J, Corrado G, Monga R, Chen K, Devin M, Mao MZ, Ranzato M, Senior AW, Tucker PA, Yang K, et al (2012) Large scale distributed deep networks. In: Advances in Neural Information Processing Systems 25, 1232–1240

  3. Ho Q, Cipar J, Cui H, Lee S, Kim JK, Gibbons PB, Gibson GA, Ganger G, Xing EP (2013) More effective distributed ml via a stale synchronous parallel parameter server. In: Advances in Neural Information Processing Systems 26, 1223–1231

  4. Li M, Andersen DG, Park JW, Smola AJ, Ahmed A, Josifovski V, Long J, Shekita EJ, Su B-Y (2014) Scaling distributed machine learning with the parameter server. In: Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI’14, 583–598

  5. Huang Y, Cheng Y, Bapna A, Firat O, Chen D, Chen M, Lee H, Ngiam J, Le QV, Wu Y, Chen z (2019) Gpipe: Efficient training of giant neural networks using pipeline parallelism. In: Advances in Neural Information Processing Systems 32:103–112

  6. Narayanan D, Harlap A, Phanishayee A, Seshadri V, Devanur NR, Ganger GR, Gibbons PB, Zaharia M (2019) Pipedream: Generalized pipeline parallelism for DNN training. In: Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP ’19, 1-15

  7. Aji AF, Heafield K (2017) Sparse communication for distributed gradient descent. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 440–445

  8. You Y, Zhang Z, Hsieh C-J, Demmel J, Keutzer K (2018) Imagenet training in minutes. In: Proceedings of the 47th International Conference on Parallel Processing, ICPP

  9. Lee N, Ajanthan T, Torr PHS, Jaggi M (2021) Understanding the effects of data parallelism and sparsity on neural network training. In: 9th International Conference on Learning Representations, ICLR

  10. Cui H, Hao Z, Ganger GR, Gibbons PB, Xing EP (2016) Geeps: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server. In: Eleventh European Conference on Computer Systems

  11. Wang S, Pi A, Zhou X (2022) Elastic parameter server: accelerating ml training with scalable resource scheduling. IEEE Transactions on Parallel and Distributed Systems 33(5):1128–1143

    Article  Google Scholar 

  12. Seide F, Fu H, Droppo J, Li G, Yu D (2014) 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. In: INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association, 1058–1062

  13. Recht B, Re C, Wright S, Niu F (2011) Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems 24:693–701

    Google Scholar 

  14. Xiao D, Mei Y, Kuang D, Chen M, Guo B, Weigang W (2021) Egc: entropy-based gradient compression for distributed deep learning. Information Sciences 548:118–134

    Article  MathSciNet  Google Scholar 

  15. Lin Y, Han S, Mao H, Wang Y, Dally B (2018) Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training. In: International Conference on Learning Representations

  16. Yan Z, Xiao D, Chen M, Zhou J, Wu W (2020) Dual-way gradient sparsification for asynchronous distributed deep learning. In: 49th International Conference on Parallel Processing - ICPP, ICPP ’20

  17. Shi S, Wang Q, Chu X, Li B, Qin Y, Liu R, Zhao X (2020) Communication-efficient distributed deep learning with merged gradient sparsification on GPUs. In: IEEE INFOCOM 2020 - IEEE Conference on Computer Communications, pages 406–415

  18. Wen W, Cong X, Yan F, Chunpeng W, Wang Y, Chen Y, Li H (2017) Terngrad: ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems 30:1509–1519

    Google Scholar 

  19. Alistarh D, Grubic D, Li J, Tomioka R, Vojnovic Milan (2017) Qsgd: Communication-efficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems 30:1709–1720

    Google Scholar 

  20. Abdi A, Fekri F (2020) Quantized compressive sampling of stochastic gradients for efficient communication in distributed deep learning. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 3105–3112

  21. Abdi A, Fekri F (2020) Indirect stochastic gradient quantization and its application in distributed deep learning. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, 3113–3120

  22. Lin DD, Talathi SS, Annapureddy VS (2016) Fixed point quantization of deep convolutional networks. In: Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, 2849-2858

  23. Vanhoucke V, Senior A, Mao MZ (2011) Improving the speed of neural networks on cpus. In: Deep Learning and Unsupervised Feature Learning Workshop, NIPS

  24. Rastegari M, Ordonez V, Redmon J, Farhadi A (2016) Xnor-net: imagenet classification using binary convolutional neural networks. European Conference on Computer Vision, 525–542

  25. Cai Z, He X, Sun J, Vasconcelos N (2017) Deep learning with low precision by half-wave gaussian quantization. Computer vision and pattern recognition, 5406–5414

  26. Mishra AK, Nurvitadhi E, Cook JJ, Marr D (2018) WRPN: Wide reduced-precision networks. In: 6th International Conference on Learning Representations, ICLR

  27. Geng J, Li D, Wang S (2019) Elasticpipe: An efficient and dynamic model-parallel solution to DNN training. In: Proceedings of the 10th Workshop on Scientific Cloud Computing, ScienceCloud ’19, 5-9

  28. Lee S, Kim JK, Zheng X, Ho Q, Gibson GA, Xing EP (2014) On model parallelization and scheduling strategies for distributed machine learning. In: Advances in Neural Information Processing Systems 27, 2834–2842

  29. Chen C-C, Yang C-L, Cheng H-Y (2018) Efficient and robust parallel DNN training through model parallelism on multi-gpu Platform. CoRR arXiv:1809.02839

  30. Chen X, Eversole A, Li G, Yu D, Seide F (September 2012) Pipelined back-propagation for context-dependent deep neural networks. In: Interspeech

  31. Petrowski A, Dreyfus G, Girault C (1993) Performance analysis of a pipelined backpropagation parallel algorithm. IEEE Transactions on Neural Networks 4(6):970–981

    Article  Google Scholar 

  32. Deng L, Yu D, Platt J (2012) Scalable stacking and learning for building deep architectures. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2133–2136

  33. Gaunt AL, Johnson MA, Riechert M, Tarlow D, Tomioka R, Vytiniotis D, Webster S (2017) Ampnet: Asynchronous model-parallel training for dynamic neural networks. arXiv preprint arXiv:1705.09786

  34. Abuadbba S, Kim K, Kim M, Thapa C, Camtepe SA, Gao Y, Kim H, Nepal S (2020) Can we use split learning on 1d cnn models for privacy preserving training? In: Proceedings of the 15th ACM Asia Conference on Computer and Communications Security, ASIA CCS ’20, 305-318

  35. Gao Y, Kim M, Abuadbba S, Kim Y, Thapa C, Kim K, Camtep SA, Kim H, Nepal S (2020) End-to-end evaluation of federated learning and split learning for internet of things. In: 2020 International Symposium on Reliable Distributed Systems (SRDS), 91–100

  36. Gupta O, Raskar R (2018) Distributed learning of deep neural network over multiple agents. Journal of Network and Computer Applications 116:1–8

    Article  Google Scholar 

  37. Vepakomma P, Gupta O, Dubey A, Raskar R (2019) Reducing leakage in distributed deep learning for sensitive health data. ICLR AI for social good workshop

  38. Vepakomma P, Gupta O, Swedish T, Raskar R (2018) Split learning for health: distributed deep learning without sharing raw patient data. arXiv preprint arXiv:1812.00564

  39. Vepakomma P, Swedish T, Raskar R, Gupta O, Dubey A (2018) No peek: A survey of private distributed deep learning. arXiv preprint arXiv:1812.03288

  40. Al-Rubaie M, Chang JM (2019) Privacy-preserving machine learning: threats and solutions. IEEE Security Privacy 17(2):49–58

    Article  Google Scholar 

  41. Guo Y (2018) A survey on methods and theories of quantized neural networks. arXiv preprint arXiv:1808.04752

  42. Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Antiga L, Lerer A (2017) Automatic differentiation in pytorch. In NIPS-W, Alban Desmaison

  43. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Computer Vision and Pattern Recognition (CVPR)

  44. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–778

  45. Verma V, Lamb A, Beckham C, Najafi A, Mitliagkas I, Lopez-Paz D, Bengio Y (June 2019) Manifold mixup: better representations by interpolating hidden states. In: Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 6438–6447

  46. Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144

  47. Banner R, Hubara I, Hoffer E, Soudry D (2018) Scalable methods for 8-bit training of neural networks. In Advances in Neural Information Processing Systems 31:5145–5153

    Google Scholar 

Download references

Acknowledgements

This research is partially supported by National Natural Science Foundation of China (U1811461), and Guangdong Provincial Natural Science Foundation of China (2018B030312002).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Weigang Wu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xiao, D., Yang, C. & Wu, W. Efficient DNN training based on backpropagation parallelization. Computing 104, 2431–2451 (2022). https://doi.org/10.1007/s00607-022-01094-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00607-022-01094-1

Keywords

Mathematics Subject Classification

Navigation