An Adaptive Learning Rate Schedule for SIGNSGD Optimizer in Neural Networks

Wang, Kang; Sun, Tao; Dou, Yong

doi:10.1007/s11063-021-10658-9

An Adaptive Learning Rate Schedule for SIGNSGD Optimizer in Neural Networks

Published: 16 October 2021

Volume 54, pages 803–816, (2022)
Cite this article

Neural Processing Letters Aims and scope Submit manuscript

498 Accesses
4 Citations
1 Altmetric
Explore all metrics

Abstract

SIGNSGD is able to dramatically improve the performance of training large neural networks by transmitting the sign of each minibatch stochastic gradient, which achieves gradient communication compression and keeps standard stochastic gradient descent (SGD) level convergence rate. Meanwhile, the learning rate plays a vital role in training neural networks, but existing learning rate optimization strategies mainly face the following problems: (1) for learning rate decay method, small learning rates produced lead to converge slowly, and extra hyper-parameters are required except for the initial learning rate, causing more human participation. (2) Adaptive gradient algorithms have poor generalization performance and also utilize other hyper-parameters. (3) Generating learning rates via two-level optimization models is difficult and time-consuming in training. To this end, we propose a novel adaptive learning rate schedule for neural network training via SIGNSGD optimizer for the first time. In our method, based on the theoretical inspiration that the convergence rate’s upper bound has minimization with the current learning rate in each iteration, the current learning rate can be expressed by a mathematical expression that is merely related to historical learning rates. Then, given an initial value, learning rates in different training stages can be adaptively obtained. Our proposed method has following advantages: (1) it is a novel automatic method without additional hyper-parameters except for one initial value, thus reducing the manual participation. (2) It has faster convergence rate and outperforms the standard SGD. (3) It makes neural networks achieve better performance with fewer gradient communication bits. Three numerical simulations are conducted on different neural networks with three public datasets: MNIST, Cifar-10 and Cifar-100 datasets, and several numerical results are presented to demonstrate the efficiency of our proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Combination of Optimization Methods in a Multistage Approach for a Deep Neural Network Model

Article 03 November 2023

AdaLip: An Adaptive Learning Rate Method per Layer for Stochastic Optimization

Article Open access 04 January 2023

GLR: Gradient-Based Learning Rate Scheduler

References

Bengio Y (2012) Practical recommendations for gradient-based training of deep architectures. In: Neural networks: tricks of the trade. Springer, pp 437–478
Bernstein J, Wang YX, Azizzadenesheli K, Anandkumar A (2018a) SIGNSGD: compressed optimisation for non-convex problems. In: International conference on machine learning, PMLR, pp 560–569
Bernstein J, Zhao J, Azizzadenesheli K, Anandkumar A (2018b) SIGNSGD with majority vote is communication efficient and fault tolerant. ArXiv preprint arXiv:1810.05291
Bottou L, Curtis FE, Nocedal J (2018) Optimization methods for large-scale machine learning. SIAM Rev 60(2):223–311
Article MathSciNet Google Scholar
Deng L (2012) The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Process Maga 29(6):141–142
Article Google Scholar
Deng L, Li J, Huang JT, Yao K, Yu D, Seide F, Seltzer M, Zweig G, He X, Williams J et al (2013) Recent advances in deep learning for speech research at microsoft. In: 2013 IEEE international conference on acoustics. Speech and signal processing. IEEE, pp 8604–8608
Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res 12(7):257–269
Farabet C, Couprie C, Najman L, LeCun Y (2012) Learning hierarchical features for scene labeling. IEEE Trans Pattern Anal Mach Intell 35(8):1915–1929
Article Google Scholar
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587
Gower RM, Loizou N, Qian X, Sailanbayev A, Shulgin E, Richtárik P (2019) SGD: general analysis and improved rates. In: International conference on machine learning, PMLR, pp 5200–5209
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hu W, Zhu Z, Xiong H, Huan J (2019) Quasi-potential as an implicit regularizer for the loss function in the stochastic gradient descent. ArXiv preprint arXiv:1901.06054
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. ArXiv preprint arXiv:1412.6980
Kong F (2019) Facial expression recognition method based on deep convolutional neural network combined with improved LBP features. Pers Ubiquit Comput 23(3):531–539
Article Google Scholar
Krizhevsky A, Hinton G et al (2009) Learning multiple layers of features from tiny images. Handb Syst Autoimmun Dis 1(4)
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25:1097–1105
Google Scholar
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444
Article Google Scholar
Liu C, Gardner SJ, Wen N, Elshaikh MA, Siddiqui F, Movsas B, Chetty IJ (2019) Automatic segmentation of the prostate on CT images using deep neural networks (DNN). Int J Radiat Oncol* Biol* Phys 104(4):924–932
Loshchilov I, Hutter F (2016) SGDR: stochastic gradient descent with warm restarts. ArXiv preprint arXiv:1608.03983
Mandt S, Hoffman M, Blei D (2016) A variational analysis of stochastic gradient algorithms. In: International conference on machine learning, PMLR, pp 354–363
Nguyen CC, Tran GS, Nghiem TP, Burie JC, Luong CM (2019) Real-time smile detection using deep learning. J Comput Sci Cybern 35(2):135–145
Article Google Scholar
Polyak BT (1964) Some methods of speeding up the convergence of iteration methods. USSR Comput Math Math Phys 4(5):1–17
Article Google Scholar
Robbins H, Monro S (1951) A stochastic approximation method. Ann Math Stat 22(3):400–407
Sermanet P, Eigen D, Zhang X, Mathieu M, Fergus R, LeCun Y (2013) Overfeat: integrated recognition, localization and detection using convolutional networks. ArXiv preprint arXiv:1312.6229
Shu J, Zhu Y, Zhao Q, Meng D, Xu Z (2020) Meta-LR-schedule-net: learned LR schedules that scale and generalize. ArXiv preprint arXiv:2007.14546
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. ArXiv preprint arXiv:1409.1556
Smith LN (2017) Cyclical learning rates for training neural networks. In: 2017 IEEE winter conference on applications of computer vision (WACV). IEEE, pp 464–472
Sze V, Chen YH, Yang TJ, Emer JS (2017) Efficient processing of deep neural networks: a tutorial and survey. Proc IEEE 105(12):2295–2329
Article Google Scholar
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9
Tieleman T, Hinton G (2012) Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw Mach Learn 4(2):26–31
Tompson J, Jain A, LeCun Y, Bregler C (2014) Joint training of a convolutional network and a graphical model for human pose estimation. ArXiv preprint arXiv:1406.2984
Tompson J, Goroshin R, Jain A, LeCun Y, Bregler C (2015) Efficient object localization using convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 648–656
Tseng P (1998) An incremental gradient (-projection) method with momentum term and adaptive stepsize rule. SIAM J Optim 8(2):506–531
Article MathSciNet Google Scholar
Xu Z, Dai AM, Kemp J, Metz L (2019) Learning an adaptive learning rate schedule. ArXiv preprint arXiv:1909.09712
Zeiler MD (2012) Adadelta: an adaptive learning rate method. ArXiv preprint arXiv:1212.5701

Download references

Acknowledgements

We would like to give our great and sincere gratitude to the editor and reviewers for their valuable comments on our work. We are grateful for the support from the National Key Research and Development Program of China Under No. 2018YFB0204301.

Author information

Authors and Affiliations

The National Laboratory for Parallel and Distributed Processing, School of Computer, National University of Defense Technology, Changsha, 410073, Hunan, China
Kang Wang, Tao Sun & Yong Dou

Authors

Kang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Tao Sun
View author publications
You can also search for this author in PubMed Google Scholar
Yong Dou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kang Wang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, K., Sun, T. & Dou, Y. An Adaptive Learning Rate Schedule for SIGNSGD Optimizer in Neural Networks. Neural Process Lett 54, 803–816 (2022). https://doi.org/10.1007/s11063-021-10658-9

Download citation

Accepted: 05 October 2021
Published: 16 October 2021
Issue Date: April 2022
DOI: https://doi.org/10.1007/s11063-021-10658-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Adaptive Learning Rate Schedule for SIGNSGD Optimizer in Neural Networks

Abstract

Access this article

Similar content being viewed by others

Combination of Optimization Methods in a Multistage Approach for a Deep Neural Network Model

AdaLip: An Adaptive Learning Rate Method per Layer for Stochastic Optimization

GLR: Gradient-Based Learning Rate Scheduler

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An Adaptive Learning Rate Schedule for SIGNSGD Optimizer in Neural Networks

Abstract

Access this article

Similar content being viewed by others

Combination of Optimization Methods in a Multistage Approach for a Deep Neural Network Model

AdaLip: An Adaptive Learning Rate Method per Layer for Stochastic Optimization

GLR: Gradient-Based Learning Rate Scheduler

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation