On the convergence and improvement of stochastic normalized gradient descent

Zhao, Shen-Yi; Xie, Yin-Peng; Li, Wu-Jun

doi:10.1007/s11432-020-3023-7

On the convergence and improvement of stochastic normalized gradient descent

Research Paper
Published: 08 February 2021

Volume 64, article number 132103, (2021)
Cite this article

Science China Information Sciences Aims and scope Submit manuscript

Shen-Yi Zhao¹,
Yin-Peng Xie¹ &
Wu-Jun Li¹

413 Accesses
8 Citations
Explore all metrics

Abstract

Non-convex models, like deep neural networks, have been widely used in machine learning applications. Training non-convex models is a difficult task owing to the saddle points of models. Recently, stochastic normalized gradient descent (SNGD), which updates the model parameter by a normalized gradient in each iteration, has attracted much attention. Existing results show that SNGD can achieve better performance on escaping saddle points than classical training methods like stochastic gradient descent (SGD). However, none of the existing studies has provided theoretical proof about the convergence of SNGD for non-convex problems. In this paper, we firstly prove the convergence of SNGD for non-convex problems. Particularly, we prove that SNGD can achieve the same computation complexity as SGD. In addition, based on our convergence proof of SNGD, we find that SNGD needs to adopt a small constant learning rate for convergence guarantee. This makes SNGD do not perform well on training large non-convex models in practice. Hence, we propose a new method, called stagewise SNGD (S-SNGD), to improve the performance of SNGD. Different from SNGD in which a small constant learning rate is necessary for convergence guarantee, S-SNGD can adopt a large initial learning rate and reduce the learning rate by stage. The convergence of S-SNGD can also be theoretically proved for non-convex problems. Empirical results on deep neural networks show that S-SNGD achieves better performance than SNGD in terms of both training loss and test accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Recent Theoretical Advances in Non-Convex Optimization

Large Scale Optimization with Proximal Stochastic Newton-Type Gradient Descent

Stochastic variance reduced gradient with hyper-gradient for non-convex large-scale learning

Article 10 October 2023

Zhuang Yang

References

Bottou L. Online learning and stochastic approximations. On-line Learn Neural Netw, 1998, 17: 142
MATH Google Scholar
Robbins H, Monro S. A stochastic approximation method. Ann Math Statist, 1951, 22: 400–407
Article MathSciNet Google Scholar
Chen C Y, Wang W L, Zhang Y Z, et al. A convergence analysis for a class of practical variance-reduction stochastic gradient MCMC. Sci China Inf Sci, 2019, 62: 012101
Article MathSciNet Google Scholar
Tseng P. An incremental gradient(-projection) method with momentum term and adaptive stepsize rule. SIAM J Optim, 1998, 8: 506–531
Article MathSciNet Google Scholar
Duchi J, Hazan E, Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res, 2011, 12: 2121–2159
MathSciNet MATH Google Scholar
Ghadimi S, Lan G. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math Program, 2016, 156: 59–99
Article MathSciNet Google Scholar
Ding F, Yang H Z, Liu F. Performance analysis of stochastic gradient algorithms under weak conditions. Sci China Ser F-Inf Sci, 2008, 51: 1269–1280
Article MathSciNet Google Scholar
Nemirovski A, Juditsky A, Lan G, et al. Robust stochastic approximation approach to stochastic programming. SIAM J Optim, 2009, 19: 1574–1609
Article MathSciNet Google Scholar
Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks. In: Proceedings of Advances in Neural Information Processing Systems, Lake Tahoe, 2012. 1097–1105
LeCun Y A, Bottou L, Orr G B, et al. Neural Networks: Tricks of the Trade. Berlin: Springer Science & Business Media, 2012. 9–48
Keskar N S, Mudigere D, Nocedal J, et al. On large-batch training for deep learning: generalization gap and sharp minima. In: Proceedings of International Conference on Learning Representations, Toulon, 2017
Sutskever I, Martens J, Dahl G E, et al. On the importance of initialization and momentum in deep learning. In: Proceedings of International Conference on Machine Learning, Atlanta, 2013. 1139–1147
Hazan E, Levy K, Shalev-Shwartz S. Beyond convexity: stochastic quasi-convex optimization. In: Proceedings of Advances in Neural Information Processing Systems, Montréal, 2015. 1594–1602
Levy K Y. The power of normalization: faster evasion of saddle points. 2016. ArXiv:1611.04831
Fang C, Li C J, Lin Z, et al. Spider: near-optimal non-convex optimization via stochastic path-integrated differential estimator. In: Proceedings of Advances in Neural Information Processing Systems, Montréal, 2018. 689–699
Zhang J, He T, Sra S, et al. Why gradient clipping accelerates training: a theoretical justification for adaptivity. In: Proceedings of International Conference on Learning Representations, Addis Ababa, 2020
Nesterov Y E. Introductory Lectures on Convex Optimization: A Basic Course. Berlin: Springer Science & Business Media, 2004
Book Google Scholar
Wilson A C, Mackey L, Wibisono A. Accelerating rescaled gradient descent: fast optimization of smooth functions. In: Proceedings of Advances in Neural Information Processing Systems, Vancouver, 2019. 13533–13543
Ge R, Huang F, Jin C, et al. Escaping from saddle points-online stochastic gradient for tensor decomposition. In: Proceedings of Conference on Learning Theory, Paris, 2015. 797–842
Zinkevich M. Online convex programming and generalized infinitesimal gradient ascent. In: Proceedings of International Conference on Machine learning, Washington, 2003. 928–936
Johnson R, Zhang T. Accelerating stochastic gradient descent using predictive variance reduction. In: Proceedings of Advances in Neural Information Processing Systems, Lake Tahoe, 2013. 315–323
Lei L, Ju C, Chen J, et al. Non-convex finite-sum optimization via SCSG methods. In: Proceedings of Advances in Neural Information Processing Systems, Long Beach, 2017. 2348–2358
Allen Z, Bengio S, Wallach H, et al. Natasha2-faster non-convex optimization than SGD. In: Proceedings of Advances in Neural Information Processing Systems, Montréal, 2018. 2680–2691
Defazio A, Bottou L. On the ineffectiveness of variance reduced optimization for deep learning. In: Proceedings of Advances in Neural Information Processing Systems, Vancouver, 2019. 1753–1763
He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016. 770–778
Krizhevsky A, Hinton G E. Learning Multiple Layers of Features From Tiny Images. Technical Report TR-2009. 2009
Yuan Z, Yan Y, Jin R, et al. Stagewise training accelerates convergence of testing error over SGD. In: Proceedings of Advances in Neural Information Processing Systems, Vancouver, 2019. 2604–2614

Download references

Acknowledgements

This work was supported by Science and Technology Project of State Grid Corporation of China (Grant No. SGGR0000XTJS1900448).

Author information

Authors and Affiliations

National Key Laboratory for Novel Software Technology, Department of Computer Science and Technology, Nanjing University, Nanjing, 210023, China
Shen-Yi Zhao, Yin-Peng Xie & Wu-Jun Li

Authors

Shen-Yi Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Yin-Peng Xie
View author publications
You can also search for this author in PubMed Google Scholar
Wu-Jun Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wu-Jun Li.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhao, SY., Xie, YP. & Li, WJ. On the convergence and improvement of stochastic normalized gradient descent. Sci. China Inf. Sci. 64, 132103 (2021). https://doi.org/10.1007/s11432-020-3023-7

Download citation

Received: 20 March 2020
Revised: 06 May 2020
Accepted: 03 June 2020
Published: 08 February 2021
DOI: https://doi.org/10.1007/s11432-020-3023-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On the convergence and improvement of stochastic normalized gradient descent

Abstract

Access this article

Similar content being viewed by others

Recent Theoretical Advances in Non-Convex Optimization

Large Scale Optimization with Proximal Stochastic Newton-Type Gradient Descent

Stochastic variance reduced gradient with hyper-gradient for non-convex large-scale learning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

On the convergence and improvement of stochastic normalized gradient descent

Abstract

Access this article

Similar content being viewed by others

Recent Theoretical Advances in Non-Convex Optimization

Large Scale Optimization with Proximal Stochastic Newton-Type Gradient Descent

Stochastic variance reduced gradient with hyper-gradient for non-convex large-scale learning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation