Abstract
Pipeline parallelism is an efficient way to speed up the training of deep neural networks (DNNs) by partitioning the model and pipelining the training process across a cluster of workers in distributed systems. In this paper, we propose a new pipeline parallelization approach (Q-FB pipeline) for distributed deep learning, which can achieve both high training speed and high hardware utilization. The major novelty of Q-FB pipeline lies in a mechanism that can parallelize the backpropagation training without loss of precision. Since the parameters update of the backward phase depends on the error calculated in the forward phase, paralleling the backpropagation process naively will hurt the model’s convergence behaviour. To provide convergence guarantees, Q-FB pipeline lets the forward phase and backward phase execute in parallel on different processors with the techniques of shared model memory and accumulated gradients update. To overcome the communication bottleneck, Q-FB pipeline compresses both activations and gradients before transferring them to other workers. We adopt an activation quantization scheme for reducing traffic in the forward phase and propose a gradient compression algorithm (2-Step GC algorithm) for reducing communication costs in the backward phase. Experiments at both small and large computing clusters (e.g. Tianhe-2 supercomputer system) show that Q-FB pipeline can effectively accelerate the training process without loss in convergence or precision.
Similar content being viewed by others
References
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: proceedings 3rd International Conference on Learning Representations, ICLR
Dean J, Corrado G, Monga R, Chen K, Devin M, Mao MZ, Ranzato M, Senior AW, Tucker PA, Yang K, et al (2012) Large scale distributed deep networks. In: Advances in Neural Information Processing Systems 25, 1232–1240
Ho Q, Cipar J, Cui H, Lee S, Kim JK, Gibbons PB, Gibson GA, Ganger G, Xing EP (2013) More effective distributed ml via a stale synchronous parallel parameter server. In: Advances in Neural Information Processing Systems 26, 1223–1231
Li M, Andersen DG, Park JW, Smola AJ, Ahmed A, Josifovski V, Long J, Shekita EJ, Su B-Y (2014) Scaling distributed machine learning with the parameter server. In: Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI’14, 583–598
Huang Y, Cheng Y, Bapna A, Firat O, Chen D, Chen M, Lee H, Ngiam J, Le QV, Wu Y, Chen z (2019) Gpipe: Efficient training of giant neural networks using pipeline parallelism. In: Advances in Neural Information Processing Systems 32:103–112
Narayanan D, Harlap A, Phanishayee A, Seshadri V, Devanur NR, Ganger GR, Gibbons PB, Zaharia M (2019) Pipedream: Generalized pipeline parallelism for DNN training. In: Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP ’19, 1-15
Aji AF, Heafield K (2017) Sparse communication for distributed gradient descent. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 440–445
You Y, Zhang Z, Hsieh C-J, Demmel J, Keutzer K (2018) Imagenet training in minutes. In: Proceedings of the 47th International Conference on Parallel Processing, ICPP
Lee N, Ajanthan T, Torr PHS, Jaggi M (2021) Understanding the effects of data parallelism and sparsity on neural network training. In: 9th International Conference on Learning Representations, ICLR
Cui H, Hao Z, Ganger GR, Gibbons PB, Xing EP (2016) Geeps: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server. In: Eleventh European Conference on Computer Systems
Wang S, Pi A, Zhou X (2022) Elastic parameter server: accelerating ml training with scalable resource scheduling. IEEE Transactions on Parallel and Distributed Systems 33(5):1128–1143
Seide F, Fu H, Droppo J, Li G, Yu D (2014) 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. In: INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association, 1058–1062
Recht B, Re C, Wright S, Niu F (2011) Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems 24:693–701
Xiao D, Mei Y, Kuang D, Chen M, Guo B, Weigang W (2021) Egc: entropy-based gradient compression for distributed deep learning. Information Sciences 548:118–134
Lin Y, Han S, Mao H, Wang Y, Dally B (2018) Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training. In: International Conference on Learning Representations
Yan Z, Xiao D, Chen M, Zhou J, Wu W (2020) Dual-way gradient sparsification for asynchronous distributed deep learning. In: 49th International Conference on Parallel Processing - ICPP, ICPP ’20
Shi S, Wang Q, Chu X, Li B, Qin Y, Liu R, Zhao X (2020) Communication-efficient distributed deep learning with merged gradient sparsification on GPUs. In: IEEE INFOCOM 2020 - IEEE Conference on Computer Communications, pages 406–415
Wen W, Cong X, Yan F, Chunpeng W, Wang Y, Chen Y, Li H (2017) Terngrad: ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems 30:1509–1519
Alistarh D, Grubic D, Li J, Tomioka R, Vojnovic Milan (2017) Qsgd: Communication-efficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems 30:1709–1720
Abdi A, Fekri F (2020) Quantized compressive sampling of stochastic gradients for efficient communication in distributed deep learning. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 3105–3112
Abdi A, Fekri F (2020) Indirect stochastic gradient quantization and its application in distributed deep learning. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, 3113–3120
Lin DD, Talathi SS, Annapureddy VS (2016) Fixed point quantization of deep convolutional networks. In: Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, 2849-2858
Vanhoucke V, Senior A, Mao MZ (2011) Improving the speed of neural networks on cpus. In: Deep Learning and Unsupervised Feature Learning Workshop, NIPS
Rastegari M, Ordonez V, Redmon J, Farhadi A (2016) Xnor-net: imagenet classification using binary convolutional neural networks. European Conference on Computer Vision, 525–542
Cai Z, He X, Sun J, Vasconcelos N (2017) Deep learning with low precision by half-wave gaussian quantization. Computer vision and pattern recognition, 5406–5414
Mishra AK, Nurvitadhi E, Cook JJ, Marr D (2018) WRPN: Wide reduced-precision networks. In: 6th International Conference on Learning Representations, ICLR
Geng J, Li D, Wang S (2019) Elasticpipe: An efficient and dynamic model-parallel solution to DNN training. In: Proceedings of the 10th Workshop on Scientific Cloud Computing, ScienceCloud ’19, 5-9
Lee S, Kim JK, Zheng X, Ho Q, Gibson GA, Xing EP (2014) On model parallelization and scheduling strategies for distributed machine learning. In: Advances in Neural Information Processing Systems 27, 2834–2842
Chen C-C, Yang C-L, Cheng H-Y (2018) Efficient and robust parallel DNN training through model parallelism on multi-gpu Platform. CoRR arXiv:1809.02839
Chen X, Eversole A, Li G, Yu D, Seide F (September 2012) Pipelined back-propagation for context-dependent deep neural networks. In: Interspeech
Petrowski A, Dreyfus G, Girault C (1993) Performance analysis of a pipelined backpropagation parallel algorithm. IEEE Transactions on Neural Networks 4(6):970–981
Deng L, Yu D, Platt J (2012) Scalable stacking and learning for building deep architectures. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2133–2136
Gaunt AL, Johnson MA, Riechert M, Tarlow D, Tomioka R, Vytiniotis D, Webster S (2017) Ampnet: Asynchronous model-parallel training for dynamic neural networks. arXiv preprint arXiv:1705.09786
Abuadbba S, Kim K, Kim M, Thapa C, Camtepe SA, Gao Y, Kim H, Nepal S (2020) Can we use split learning on 1d cnn models for privacy preserving training? In: Proceedings of the 15th ACM Asia Conference on Computer and Communications Security, ASIA CCS ’20, 305-318
Gao Y, Kim M, Abuadbba S, Kim Y, Thapa C, Kim K, Camtep SA, Kim H, Nepal S (2020) End-to-end evaluation of federated learning and split learning for internet of things. In: 2020 International Symposium on Reliable Distributed Systems (SRDS), 91–100
Gupta O, Raskar R (2018) Distributed learning of deep neural network over multiple agents. Journal of Network and Computer Applications 116:1–8
Vepakomma P, Gupta O, Dubey A, Raskar R (2019) Reducing leakage in distributed deep learning for sensitive health data. ICLR AI for social good workshop
Vepakomma P, Gupta O, Swedish T, Raskar R (2018) Split learning for health: distributed deep learning without sharing raw patient data. arXiv preprint arXiv:1812.00564
Vepakomma P, Swedish T, Raskar R, Gupta O, Dubey A (2018) No peek: A survey of private distributed deep learning. arXiv preprint arXiv:1812.03288
Al-Rubaie M, Chang JM (2019) Privacy-preserving machine learning: threats and solutions. IEEE Security Privacy 17(2):49–58
Guo Y (2018) A survey on methods and theories of quantized neural networks. arXiv preprint arXiv:1808.04752
Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Antiga L, Lerer A (2017) Automatic differentiation in pytorch. In NIPS-W, Alban Desmaison
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Computer Vision and Pattern Recognition (CVPR)
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–778
Verma V, Lamb A, Beckham C, Najafi A, Mitliagkas I, Lopez-Paz D, Bengio Y (June 2019) Manifold mixup: better representations by interpolating hidden states. In: Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 6438–6447
Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144
Banner R, Hubara I, Hoffer E, Soudry D (2018) Scalable methods for 8-bit training of neural networks. In Advances in Neural Information Processing Systems 31:5145–5153
Acknowledgements
This research is partially supported by National Natural Science Foundation of China (U1811461), and Guangdong Provincial Natural Science Foundation of China (2018B030312002).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Xiao, D., Yang, C. & Wu, W. Efficient DNN training based on backpropagation parallelization. Computing 104, 2431–2451 (2022). https://doi.org/10.1007/s00607-022-01094-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00607-022-01094-1