Skip to main content
Log in

FLSGD: free local SGD with parallel synchronization

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

A Correction to this article was published on 29 March 2022

This article has been updated

Abstract

Synchronous parameters algorithms with data parallelism have been successfully utilized to accelerate the distributed training of deep neural networks (DNNs). However, a prevalent shortcoming of the synchronous methods is computation waste resulted from the mutual waiting among the computational workers with different performance and the communication delays at each synchronization. To alleviate this drawback, we propose a novel method, free local stochastic gradient descent (FLSGD) with parallel synchronization, to eliminate the waiting and communication overhead. Specifically, the process of distributed DNN training is firstly modeled as a pipeline which assembly consists of three components: dataset partition, local SGD, and parameter updating. Then, a novel adaptive batch size and dataset partition method based on the computational performance of the node is employed to eliminate the waiting time by keeping the load balance of the distributed DNN training. The local SGD and the parameter updating including gradients synchronization are parallelized to eliminate the communication cost by one-step gradient delaying, and the stale problem is remedied by an appropriate approximation. To our best knowledge, this is the first work focusing on decreasing both distributed training load balancing and communication overhead Extensive experiments are conducted with four state-of-the-art DNN models on two image classification datasets (i.e., CIFAR10 and CIFAR100) to demonstrate that the effectiveness of FLSGD outperforms the synchronous methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Change history

Notes

  1. https://pytorch.org/.

References

  1. Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90. https://doi.org/10.1145/3065386

    Article  Google Scholar 

  2. Dayiheng L, Yeyun G, Jie F, Yu Y, Jiusheng C, Daxin J, Jiancheng L, Nan D (2020) Rikinet: reading wikipedia pages for natural question answering. ACL. pp 6762–6771

  3. Sim CKB (2014) A spectral masking approach to noise-robust speech recognition using deep neural networks. IEEE Trans Audio Speech Lang Process Publ IEEE Signal Process Soc 22(8):1296–1305

    Google Scholar 

  4. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol 2016. pp 770–778. https://doi.org/10.1109/CVPR.2016.90

  5. Deng J, Dong W, Socher R, Li L-J, Li K, Li FF (2009) Imagenet: a large-scale hierarchical image database. pp 248–255. https://doi.org/10.1109/CVPR.2009.5206848

  6. Dean J, Corrado GS, Monga R, Chen K, Devin M, Le QV, Mao MZ, Ranzato M, Senior A, Tucker P, Yang K, Ng, AY (2012) Large scale distributed deep networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems, vol 1. NIPS’12, Curran Associates Inc., Red Hook, NY, USA, pp 1223–1231

  7. Lecun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444. https://doi.org/10.1038/nature14539

    Article  Google Scholar 

  8. Zhang H, Hsieh C-J, Akella V (2016) Hogwild++: A new mechanism for decentralized asynchronous stochastic gradient descent. In: 2016 IEEE 16th International Conference on Data Mining (ICDM). pp 629–638. https://doi.org/10.1109/ICDM.2016.0074

  9. Dekel O, Gilad-Bachrach R, Shamir O, Xiao L (2012) Optimal distributed online prediction using mini-batches. J Mach Learn Res 13:165–202

    MathSciNet  MATH  Google Scholar 

  10. Zhao X, An A, Liu J, Chen BX (2019) Dynamic stale synchronous parallel distributed training for deep learning. https://doi.org/10.1109/ICDCS.2019.00149

  11. Yu H, Yang S, Zhu S (2019) Parallel restarted sgd with faster convergence and less communication: demystifying why model averaging works for deep learning. In: Proceedings of the AAAI Conference on Artificial Intelligence vol 33, pp 5693–5700. https://doi.org/10.1609/aaai.v33i01.33015693

  12. Stich SU (2019) Local SGD converges fast and communicates little. In: International Conference on Learning Representations. https://openreview.net/forum?id=S1g2JnRcFX

  13. Zhang J, Sa CD, Mitliagkas I, Ré C (2021) Parallel SGD: when does averaging help?. CoRR. arXiv:1606.07365

  14. Wang J, Joshi G (2019) Adaptive communication strategies to achieve the best error-runtime trade-off in local-update SGD. In: Talwalkar A, Smith V, Zaharia M (eds) Proceedings of Machine Learning and Systems, 2019, MLSys 2019, Stanford, CA, USA, March 31–April 2, 2019, mlsys.org, 2019

  15. Zhang S, Choromanska A, LeCun Y (2015) Deep learning with elastic averaging SGD. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7–12, 2015, Montreal, Quebec, Canada. pp 685–693

  16. Zhao X, Papagelis M, An A, Chen BX, Liu J, Hu Y (2019) Elastic bulk synchronous parallel model for distributed deep learning. In: Wang J, Shim K, Wu X (eds) 2019 IEEE International Conference on Data Mining. ICDM 2019, Beijing, China, November 8–11, 2019, IEEE, pp 1504–1509

  17. Lian X, Huang Y, Li Y, Liu J (2015) Asynchronous parallel stochastic gradient for nonconvex optimization. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, vol 2, NIPS’15. MIT Press, Cambridge, pp 2737–2745

  18. Seide F, Fu H, Droppo J, Li G, Yu D (2014) 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. pp 1058–1062

  19. Alistarh D, Grubic D, Li J, Tomioka R, Vojnovic M (2017) QSGD: communication-efficient SGD via gradient quantization and encoding. In: Guyon I, von Luxburg U, Bengio S, Wallach HM, Fergus R, Vishwanathan SVN, Garnett R (eds) Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4–9, 2017, Long Beach, CA, USA. pp 1709–1720

  20. Aji AF, Heafield K (2017) Sparse communication for distributed gradient descent. In: Palmer M, Hwa R, Riedel S (eds) Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9–11, 2017. Association for Computational Linguistics, pp 440–445

  21. Shi S, Wang Q, Zhao K, Tang Z, Wang Y, Huang X, Chu X (2019) A distributed synchronous SGD algorithm with global top-k sparsification for low bandwidth networks. In: 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS), pp 2238–2247

  22. Vogels T, Karimireddy SP, Jaggi M (2020) Practical low-rank communication compression in decentralized deep learning. In: Larochelle H, Ranzato M, Hadsell R, Balcan M, Lin H (eds) Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6–12, 2020, virtual

  23. Zhang H, Zheng Z, Xu S, Dai W, Ho Q, Liang X, Hu Z, Wei J, Xie P, Xing EP (2017) Poseidon: an efficient communication architecture for distributed deep learning on GPU clusters. In: USENIX Annual Technical Conference (USENIX ATC 17). USENIX Association, Santa Clara, CA, pp 181–193

  24. Wang S, Pi A, Zhou X, Wang J, Xu CZ (2021) Overlapping communication with computation in parameter server for scalable dl training. IEEE Trans Parallel Distrib Syst 32(9):2144–2159

    Article  Google Scholar 

  25. Li Y, Yu M, Li S, Avestimehr S, Kim NS, Schwing A (2018) Pipe-SGD: A decentralized pipelined SGD framework for distributed deep net training. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18. Curran Associates Inc., Red Hook, NY, USA, pp 8056–8067

  26. Lin T, Stich SU, Patel KK, Jaggi M (2020) Don’t use large mini-batches, use local SGD. In: International Conference on Learning Representations. https://openreview.net/forum?id=B1eyO1BFPr

  27. Ye Q, Zhou Y, Shi M, Sun Y, Lv J (2020) DBS: Dynamic batch size for distributed deep neural network training. arXiv e-prints arXiv:2007.11831

  28. Zinkevich M, Weimer M, Smola AJ, Li L (2010) Parallelized stochastic gradient descent. In: Lafferty JD, Williams CKI, Shawe-Taylor J, Zemel RS, Culotta A (eds) Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010, vol 2010. Proceedings of a Meeting held 6–9 December. Vancouver, British Columbia, Canada, Curran Associates Inc, pp 2595–2603

  29. Woodworth B, Kshitij Patel K, Stich SU, Dai Z, Bullins B, McMahan HB, Shamir O, Srebro N (2020) Is local SGD better than minibatch SGD?. arXiv e-prints arXiv:2002.07839

  30. Lin T, Stich SU, Patel KK, Jaggi M (2020) Don’t use large mini-batches, use local SGD. In: 8th International Conference on Learning Representations, ICLR 2020. Addis Ababa, Ethiopia, April 26–30, 2020, OpenReview.net

  31. Kshitij Patel K, Dieuleveut A (2019) Communication trade-offs for synchronized distributed SGD with large step size. arXiv e-prints arXiv:1904.11325

  32. Jiang P, Agrawal G (2020) Adaptive periodic averaging: a practical approach to reducing communication in distributed learning. arXiv e-prints arXiv:2007.06134

  33. Ko Y, Choi K, Seo J, Kim S-W (2021) An in-depth analysis of distributed training of deep neural networks. In: IEEE International Parallel and Distributed Processing Symposium (IPDPS), vol 2021, pp 994–1003. https://doi.org/10.1109/IPDPS49936.2021.00108

  34. Gupta S, Zhang W, Wang F (2016) Model accuracy and runtime tradeoff in distributed deep learning: a systematic study. In: Bonchi F, Domingo-Ferrer J, Baeza-Yates R, Zhou Z, Wu X (eds) IEEE 16th International Conference on Data Mining, ICDM 2016, December 12–15, 2016, Barcelona, Spain. IEEE Computer Society, pp 171–180. https://doi.org/10.1109/ICDM.2016.0028

  35. Ho Q, Cipar J, Cui H, Lee S, Kim JK, Gibbons PB, Gibson GA, Ganger G, Xing EP (2013) More effective distributed ml via a stale synchronous parallel parameter server. In: Advances in Neural Information Processing Systems. pp 1223–1231

  36. Cipar J, Ho Q, Kim JK, Lee S, Ganger GR, Gibson G, Keeton K, Xing E (2013) Solving the straggler problem with bounded staleness. In: Presented as Part of the 14th Workshop on Hot Topics in Operating Systems

  37. Zhang W, Gupta S, Lian X, Liu J. Staleness-aware async-SGD for distributed deep learning. arXiv preprint arXiv:1511.05950

  38. Lian X, Zhang W, Zhang C, Liu J (2018) Asynchronous decentralized parallel stochastic gradient descent. In: Dy J, Krause A (eds) Proceedings of the 35th International Conference on Machine Learning, vol 80 of Proceedings of Machine Learning Research, PMLR. pp 3043–3052

  39. Suresh AT, Yu FX, Kumar S, McMahan HB (2017) Distributed mean estimation with limited communication. In: Proceedings of the 34th International Conference on Machine Learning, vol 70, ICML’17, JMLR.org. p 3329–3337

  40. Wen W, Xu C, Yan F, Wu C, Wang Y, Chen Y, Li H (2017) Terngrad: ternary gradients to reduce communication in distributed deep learning. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in Neural Information Processing Systems 30. Curran Associates Inc, pp 1509–1519

  41. Tang H, Yu C, Lian X, Zhang T, Liu J (2019) DoubleSqueeze: Parallel stochastic gradient descent with double-pass error-compensated compression. In: Chaudhuri K, Salakhutdinov R (eds) Proceedings of the 36th International Conference on Machine Learning, vol 97 of Proceedings of Machine Learning Research, PMLR. pp 6155–6165

  42. Sattler F, Wiedemann S, Müller K, Samek W (2019) Sparse binary compression: towards distributed deep learning with minimal communication. In: 2019 International Joint Conference on Neural Networks (IJCNN). pp 1–8

  43. Wang H, Sievert S, Liu S, Charles ZB, Papailiopoulos DS, Wright S (2018) ATOMO: communication-efficient learning via atomic sparsification. In: Bengio S, Wallach HM, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R (eds) Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3–8, 2018, Montréal, Canada. pp 9872–9883

  44. Vogels T, Karimireddy SP, Jaggi M (2019) Powersgd: practical low-rank gradient compression for distributed optimization. In: Wallach HM, Larochelle H, Beygelzimer A, d’Alché-Buc F, Fox EB, Garnett R (eds) Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8–14, 2019. Vancouver, BC, Canada, pp 14236–14245

  45. Alistarh D, Hoefler T, Johansson M, Khirirat S, Konstantinov N, Renggli C (2018) The convergence of sparsified gradient methods. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18. Curran Associates Inc., Red Hook, NY, USA, pp 5977–5987

  46. Wang S, Pi A, Zhou X, Wang J, Xu C-Z (2021) Overlapping communication with computation in parameter server for scalable dl training. IEEE Trans Parallel Distrib Syst 32(9):2144–2159. https://doi.org/10.1109/TPDS.2021.3062721

    Article  Google Scholar 

  47. Shi S, Chu X, Li B (2021) Mg-wfbp: Merging gradients wisely for efficient communication in distributed deep learning. IEEE Trans Parallel Distrib Syst 32(8):1903–1917. https://doi.org/10.1109/TPDS.2021.3052862

    Article  Google Scholar 

  48. Zheng S, Meng Q, Wang T, Chen W, Yu N, Ma Z-M, Liu T-Y (2017) Asynchronous stochastic gradient descent with delay compensation. In: International Conference on Machine Learning, PMLR. pp 4120–4129

  49. Friedman J, Hastie T, Tibshirani R et al (2001) The elements of statistical learning, vol 1. Springer series in statistics. Springer, New York

    MATH  Google Scholar 

  50. Krizhevsky A, Hinton G. Learning multiple layers of features from tiny images. Computer Science Department, University of Toronto, Tech. Rep 1

  51. Zhao H, Canny J (2014) Kylix: A sparse allreduce for commodity clusters. In: 2014 43rd International Conference on Parallel Processing. pp 273–282. https://doi.org/10.1109/ICPP.2014.36

  52. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol 2017. pp 2261–2269. https://doi.org/10.1109/CVPR.2017.243

Download references

Funding

This work is supported by the Key Program of National Science Foundation of China (Grant No. 61836006), and the 111 Project under grant B21044.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jiancheng Lv.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Fully documented templates are available in the elsarticle package on CTAN.

The original online version of this article was revised: The Funding information section was missing. Information on Fig. 6 was corrected.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ye, Q., Zhou, Y., Shi, M. et al. FLSGD: free local SGD with parallel synchronization. J Supercomput 78, 12410–12433 (2022). https://doi.org/10.1007/s11227-021-04267-5

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-021-04267-5

Keywords

Navigation