Skip to main content
Log in

Accelerating local SGD for non-IID data using variance reduction

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

Distributed stochastic gradient descent and its variants have been widely adopted in the training of machine learning models, which apply multiple workers in parallel. Among them, local-based algorithms, including Local SGD and FedAvg, have gained much attention due to their superior properties, such as low communication cost and privacy-preserving. Nevertheless, when the data distribution on workers is non-identical, local-based algorithms would encounter a significant degradation in the convergence rate. In this paper, we propose Variance Reduced Local SGD (VRL-SGD) to deal with the heterogeneous data. Without extra communication cost, VRL-SGD can reduce the gradient variance among workers caused by the heterogeneous data, and thus it prevents local-based algorithms from slow convergence rate. Moreover, we present VRL-SGD-W with an effective warm-up mechanism for the scenarios, where the data among workers are quite diverse. Benefiting from eliminating the impact of such heterogeneous data, we theoretically prove that VRL-SGD achieves a linear iteration speedup with lower communication complexity even if workers access non-identical datasets. We conduct experiments on three machine learning tasks. The experimental results demonstrate that VRL-SGD performs impressively better than Local SGD for the heterogeneous data and VRL-SGD-W is much robust under high data variance among workers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Robbins H, Monro S. A stochastic approximation method. The Annals of Mathematical Statistics, 1951, 22(3): 400–407

    Article  MathSciNet  Google Scholar 

  2. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016, 770–778

  3. Chen J, Liu D, Xu T, Wu S, Cheng Y, Chen E. Is heuristic sampling necessary in training deep object detectors? 2019, arXiv preprint arXiv: 1909.04868

  4. Devlin J, Chang M W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019

  5. Cheng H T, Koc L, Harmsen J, Shaked T, Chandra T, Aradhye H, Anderson G, Corrado G, Chai W, Ispir M, Anil R, Haque Z, Hong L, Jain V, Liu X, Shah H. Wide & deep learning for recommender systems. In: Proceedings of the 1st Workshop on Deep Learning for Recommender Systems. 2016, 7–10

  6. Liu Q, Huang Z, Yin Y, Chen E, Xiong H, Su Y, Hu G. EKT: exercise-aware knowledge tracing for student performance prediction. IEEE Transactions on Knowledge and Data Engineering, 2021, 33(1): 100–115

    Article  Google Scholar 

  7. Wu L, Li Z, Zhao H, Pan Z, Liu Q, Chen E. Estimating early fundraising performance of innovations via graph-based market environment model. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(4): 6396–6403

    Article  Google Scholar 

  8. Liu Y, Liu Q, Zhao H, Pan Z, Liu C. Adaptive quantitative trading: an imitative deep reinforcement learning approach. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(2): 2128–2135

    Article  Google Scholar 

  9. Wang J, Zhang H, Liu Q, Pan Z, Tao H. Crowdfunding dynamics tracking: a reinforcement learning approach. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(4): 6210–6218

    Article  Google Scholar 

  10. Wang J, Joshi G. Cooperative SGD: a unified framework for the design and analysis of communication-efficient SGD algorithms. 2019, arXiv preprint arXiv: 1808.07576

  11. Zhou F, Cong G. On the convergence properties of a K-step averaging stochastic gradient descent algorithm for nonconvex optimization. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence. 2018, 3219–3227

  12. Stich S U. Local SGD converges fast and communicates little. In: Proceedings of 2019 International Conference on Learning Representations. 2019

  13. Yu H, Yang S, Zhu S. Parallel restarted SGD with faster convergence and less communication: demystifying why model averaging works for deep learning. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33: 5693–5700

    Article  Google Scholar 

  14. Shen S, Xu L, Liu J, Liang X, Cheng Y. Faster distributed deep net training: computation and communication decoupled stochastic gradient descent. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence. 2019, 4582–4589

  15. McMahan H B, Moore E, Ramage D, Hampson S, Arcas B A. Communication-efficient learning of deep networks from decentralized data. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics. 2017, 1273–1282

  16. Konečný J, McMahan H B, Yu F X, Suresh A T, Bacon D, Richtárik P. Federated learning: strategies for improving communication efficiency. In: Proceedings of the ICLR 2018. 2018

  17. Li T, Sahu A K, Talwalkar A, Smith V. Federated learning: challenges, methods, and future directions. IEEE Signal Processing Magazine, 2020, 37(3): 50–60

    Article  Google Scholar 

  18. Kairouz P, McMahan H B, Avent B, Bellet A, Bennis M, et al. Advances and Open Problems in Federated Learning. Foundations and Trends® in Machine Learning, 2021, 14(1–2): 1–210

    Article  Google Scholar 

  19. Yang J, Duan Y, Qiao T, Zhou H, Wang J, Zhao W. Prototyping federated learning on edge computing systems. Frontiers of Computer Science, 2020, 14: 146318

    Article  Google Scholar 

  20. Ghadimi S, Lan G. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 2013, 23(4): 2341–2368

    Article  MathSciNet  Google Scholar 

  21. Dekel O, Gilad-Bachrach R, Shamir O, Xiao L. Optimal distributed online prediction using mini-batches. The Journal of Machine Learning Research, 2012, 13: 165–202

    MathSciNet  MATH  Google Scholar 

  22. Alistarh D, Grubic D, Li J, Tomioka R, Vojnovic M. QSGD: communication-efficient SGD via gradient quantization and encoding. In: Proceedings of the Advances in Neural Information Processing Systems. 2017, 1709–1720

  23. Aji A F, Heafield K. Sparse communication for distributed gradient descent. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017, 440–445

  24. Bernstein J, Zhao J, Azizzadenesheli K, Anandkumar A. SignSGD with majority vote is communication efficient and fault tolerant. In: Proceedings of the International Conference on Learning Representations. 2019

  25. Lin Y, Han S, Mao H, Wang Y, Dally W J. Deep gradient compression: reducing the communication bandwidth for distributed training. In: Proceedings of the International Conference on Learning Representations. 2018

  26. Karimireddy S P, Rebjock Q, Stich S U, Jaggi M. Error feedback fixes SignSGD and other gradient compression schemes. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 3252–3261

  27. Tang H, Yu C, Lian X, Zhang T, Liu J. DoubleSqueeze: parallel stochastic gradient descent with double-pass error-compensated compression. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 6155–6165

  28. Povey D, Zhang X, Khudanpur S. Parallel training of DNNs with natural gradient and parameter averaging. 2015, arXiv preprint arXiv: 1410.7455

  29. Su H, Chen H. Experiments on parallel training of deep neural network using model averaging. 2018, arXiv preprint arXiv: 1507.01239

  30. Lin T, Stich S U, Patel K K, Jaggi M. Don’t use large mini-batches, use local SGD. In: Proceedings of the 8th International Conference on Learning Representations. 2020

  31. Yu H, Jin R, Yang S. On the linear speedup analysis of communication efficient momentum SGD for distributed non-convex optimization. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 7184–7193

  32. Haddadpour F, Kamani M M, Mahdavi M, Cadambe V. Trading redundancy for communication: Speeding up distributed SGD for non-convex optimization. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 2545–2554

  33. Li X, Huang K, Yang W, Wang S, Zhang Z. On the convergence of FedAvg on non-IID data. In: Proceedings of the 8th International Conference on Learning Representations. 2020

  34. Khaled A, Mishchenko K, Richtarik P. Tighter theory for local SGD on identical and heterogeneous data. In: Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics. 2020, 4519–4529

  35. Johnson R, Zhang T. Accelerating stochastic gradient descent using predictive variance reduction. In: Proceedings of the 27th Annual Conference on Neural Information Processing Systems. 2013, 315–323

  36. Zhang L, Mahdavi M, Jin R. Linear convergence with condition number independent access of full gradients. In: Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013, 980–988

  37. Defazio A, Bach F, Lacoste-Julien S. SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. In: Proceedings of the 27th International Conference on Neural Information Processing Systems. 2014, 1646–1654

  38. Nguyen L M, Liu J, Scheinberg K, Takáč M. SARAH: a novel method for machine learning problems using stochastic recursive gradient. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 2613–2621

  39. Shi W, Ling Q, Wu G, Yin W. EXTRA: an exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 2015, 25(2): 944–966

    Article  MathSciNet  Google Scholar 

  40. Mokhtari A, Ribeiro A. DSA: decentralized double stochastic averaging gradient algorithm. The Journal of Machine Learning Research, 2016, 17(1): 2165–2199

    MathSciNet  MATH  Google Scholar 

  41. Tang H, Lian X, Yan M, Zhang C, Liu J. D2: decentralized training over decentralized data. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 4855–4863

  42. Karimireddy S P, Kale S, Mohri M, Reddi S, Stich S, Suresh A T. SCAFFOLD: stochastic controlled averaging for federated learning. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 5132–5143

  43. Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A. Automatic differentiation in PyTorch. In: Proceedings of the 31st Conference on Neural Information Processing Systems. 2017

  44. Zhang S, Choromanska A, LeCun Y. Deep learning with elastic averaging SGD. In: Proceedings of the 28th International Conference on Neural Information Processing Systems. 2015, 685–693

  45. El-Sawy A, El-Bakry H, Loey M. CNN for handwritten Arabic digits recognition based on LeNet-5. In: Proceedings of the International Conference on Advanced Intelligent Systems and Informatics. 2016, 566–575

  46. LeCun Y. The MNIST database of handwritten digits. See Yann.lecun.com/exdb/mnist/website, 1998

  47. Kim Y. Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014

  48. Lehmann J, Isele R, Jakob M, Jentzsch A, Kontokostas D, Mendes P N, Hellmann S, Morsey M, van Kleef P, Auer S, Bizer C. DBpedia-A large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web, 2015, 6(2): 167–195

    Article  Google Scholar 

  49. Deng J, Dong W, Socher R, Li L J, Li K, Fei-Fei L. ImageNet: a large-scale hierarchical image database. In: Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. 2009, 248–255

  50. Moschitti A, Pang B, Daelemans W. Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014, 1532–1543

  51. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016, 2818–2826

  52. Ioffe S, Szegedy C. Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning. 2015, 448–456

Download references

Acknowledgements

This research was partially supported by grants from the National Key Research and Development Program of China (No. 2018YFC0832101), and the National Natural Science Foundation of China (Grant Nos. U20A20229 and 61922073).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Enhong Chen.

Additional information

Xianfeng Liang is currently a MS student in the School of Computer Science and Technology at the University of Science and Technology of China (USTC), China. His major research interests include machine learning and optimization.

Shuheng Shen received the MS degree from University of Science and Technology of China (USTC), China in 2020. His major research interests include machine learning, stochastic optimization and distributed optimization.

Enhong Chen is a professor and vice dean of the School of Computer Science at University of Science and Technology of China (USTC), China. He received the PhD degree from USTC, China. His general area of research includes data mining and machine learning, social network analysis and recommender systems. He has published more than 100 papers in refereed conferences and journals, including IEEE Trans. KDE, IEEE Trans. MC, KDD, ICDM, NIPS, and CIKM. He was on program committees of numerous conferences including KDD, ICDM, SDM. He is a senior member of the IEEE.

Jingchang Liu received the MS degree from University of Science and Technology of China (USTC), China in 2019. His major research interests include machine learning, stochastic optimization and distributed optimization.

Qi Liu is a professor at University of Science and Technology of China (USTC), China. He received the PhD degree in computer science from USTC, China. His general area of research is data mining and knowledge discovery. He has published prolifically in refereed journals and conference proceedings, e.g., TKDE, TOIS, TKDD, TIST, KDD, IJCAI, AAAI, ICDM, SDM and CIKM. He has served regularly in the program committees of a number of conferences, and is a reviewer for the leading academic journals in his fields. He is a member of ACM and IEEE. Dr. Liu is the recipient of the KDD 2018 Best Student Paper Award (Research) and the ICDM 2011 Best Research Paper Award. He is supported by the Young Elite Scientist Sponsorship Program of CAST and the Youth Innovation Promotion Association of CAS.

Yifei Cheng is currently working toward the PhD degree in the School of Data Science, University of Science and Technology of China, China. His current research interests include machine learning, distributed optimization and federated learning.

Zhen Pan received the PhD degree from University of Science and Technology of China, China in 2020. His major research interests include machine learning and data mining.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liang, X., Shen, S., Chen, E. et al. Accelerating local SGD for non-IID data using variance reduction. Front. Comput. Sci. 17, 172311 (2023). https://doi.org/10.1007/s11704-021-1018-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11704-021-1018-0

Keywords

Navigation