A fractional-order momentum optimization approach of deep neural networks

Yu, ZhongLiang; Sun, Guanghui; Lv, Jianfeng

doi:10.1007/s00521-021-06765-2

A fractional-order momentum optimization approach of deep neural networks

Original Article
Published: 27 January 2022

Volume 34, pages 7091–7111, (2022)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

ZhongLiang Yu¹,
Guanghui Sun¹ &
Jianfeng Lv¹

843 Accesses
14 Citations
1 Altmetric
Explore all metrics

Abstract

The development of universal and high-efficiency optimization algorithms is a very important research direction of neural networks. Stochastic Gradient Decent Momentum(SGDM) is one of the most successful optimization algorithms, and easily fall into local extremes minimum. Inspired by the prominent success of Fractional-order Calculus in automatic control, we proposed a method based on Fractional-Order named Fractional-Order Momentum(FracM). As a natural extension of integral calculus, fractional order calculus inherits almost all the characteristics of integral calculus, and have some memorization and nonlocality. FracM performs fractional-order difference of momentum and gradient in SGDM algorithm. FracM can partially solve the problem of traps in the local minimum point and accelerated the train process. The proposed FracM optimization method can compare with the most advanced SGDM and Adam and other advanced optimization algorithm in terms of classification accuracy. The experiments show that FracM outperforms other optimizers on CIFAR10/100 and textual datasets IMDB with transformer-based models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speeding up the Oscillation-Free Modified Heavy Ball Algorithm

sqFm: a novel adaptive optimization scheme for deep learning model

Article 17 January 2024

Adaptive Learning Rate and Momentum for Training Deep Neural Networks

References

An W, Wang H, Sun Q, Xu J, Dai Q, Zhang L (2018) A pid controller approach for stochastic optimization of deep networks, pp 8522–8531
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv:1409.0473
Bao C, Pu Y, Zhang Y (2018) Fractional-order deep backpropagation neural network. Comput Intell Neurosci 2018:1–10
Google Scholar
Bottou L (2010) Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010. Springer, pp 177–186
Bottou L (2010) Large-scale machine learning with stochastic gradient descent
Bottou L, Bousquet O (2011) The tradeoffs of large scale learning. Adv Neural Inf Process Syst 20:1–8
Google Scholar
Dauphin YN, Pascanu R, Gulcehre C, Cho K, Ganguli S, Bengio Y (2014) Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In: Advances in neural information processing systems, pp 2933–2941
Ding J, Ren X, Luo R, Sun X (2019) An adaptive and momental bound method for stochastic learning. arXiv:1910.12249
Ding Z, Shen Y (2016) Projective synchronization of nonidentical fractional-order neural networks based on sliding mode controller. Neural Netw 76:97–105
Article Google Scholar
Dubey SR, Chakraborty S, Roy SK, Mukherjee S, Singh SK, Chaudhuri BB (2019) diffgrad: an optimization method for convolutional neural networks. IEEE Trans Neural Netw 1–12
Ginsburg B, Castonguay P, Hrinchuk O, Kuchaiev O, Lavrukhin V, Leary R, Li J, Nguyen H, Zhang Y, Cohen JM (2019) Stochastic gradient methods with layer-wise adaptive moments for training of deep networks. arXiv:1905.11286
Glorot X, Bordes A, Bengio Y (2011) Deep sparse rectifier neural networks, vol 15, pp 315–323
Goodfellow I, Pougetabadie J, Mirza M, Xu B, Wardefarley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets 2672–2680
Gupta V, Koren T, Singer Y (2018) Shampoo: preconditioned stochastic tensor optimization, pp 1837–1845
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition, pp 770–778
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition 2016-Decem, pp 770–778. https://doi.org/10.1109/CVPR.2016.90 arXiv:1512.03385
Heo B, Chun S, Oh SJ, Han D, Yun S, Uh Y, Ha JW (2020) Slowing down the weight norm increase in momentum-based optimizers. arXiv preprint arXiv:2006.08217
Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18:1527–1554
Article MathSciNet Google Scholar
Hinton GE, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov R (2012) Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167
Kan T, Gao Z, Yang C, Jian S (2021) Convolutional neural networks based on fractional-order momentum for parameter training. Neurocomputing 449:85–99
Article Google Scholar
Kaslik E, Sivasundaram S (2012) Nonlinear dynamics and chaos in fractional-order neural networks. Neural Netw 32:245–256
Article Google Scholar
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv:1412.6980
Krizhevsky A, Hinton G et al (2009) Learning multiple layers of features from tiny images
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444
Article Google Scholar
Ling ZH, Kang SY, Zen H, Senior A, Schuster M, Qian XJ, Meng HM, Deng L (2015) Deep learning for acoustic modeling in parametric speech generation: a systematic review of existing techniques and future trends. IEEE Signal Process Mag 32:35–52
Article Google Scholar
Liu L, Jiang H (2019) On the variance of the adaptive learning rate and beyond. arXiv:1908.03265
Loshchilov I, Hutter F (2017) Decoupled weight decay regularization. arXiv:1711.05101
Lucas J, Zemel R Grosse R (2018) Aggregated momentum: stability through passive damping
Luo L, Xiong Y, Liu Y, Sun X (2019) Adaptive gradient methods with dynamic bound of learning rate
Ma J, Yarats D, (2018) Quasi-hyperbolic momentum and adam for deep learning. arXiv:1810.06801
Ma X (2020) Apollo: an adaptive parameter-wise diagonal quasi-newton method for nonconvex stochastic optimization
Mnih V, Heess N, Graves A, Kavukcuoglu K (2014) Recurrent models of visual attention. arXiv:1406.6247
Nakkiran P, Kaplun G, Bansal Y, Yang T, Barak B, Sutskever I (2019) Deep double descent: where bigger models and more data hurt. arXiv:1912.02292
Noda K, Yamaguchi Y, Nakadai K, Okuno HG, Ogata T (2015) Audio-visual speech recognition using deep learning. Appl Intell 42:722–737
Article Google Scholar
Osawa K, Tsuji Y, Ueno Y, Naruse A, Yokota R, Matsuoka S (2018) Second-order optimization method for large mini-batch: training resnet-50 on imagenet in 35 epochs
Qian N (1999) On the momentum term in gradient descent learning algorithms. Neural Netw 12:145–151
Article Google Scholar
Reddi SJ, Kale S, Kumar S (2018) On the convergence of adam and beyond
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115:211–252
Article MathSciNet Google Scholar
Schmidhuber J (2015) Deep learning in neural networks. Neural Netw 61:85–117
Article Google Scholar
Shazeer N, Stern M (2018) Adafactor: adaptive learning rates with sublinear memory cost. 35th International conference on machine learning, ICML 2018 10, pp 7322–7330. arXiv:1804.04235
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition
Sun R (2019) Optimization for deep learning: theory and algorithms
Sun W, Su F, Wang L (2017) Improving deep neural networks with multi-layer maxout networks and a novel initialization method. Neurocomputing 278:34–40
Article Google Scholar
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. arXiv:1409.3215
Sutton R (1986) Two problems with back propagation and other steepest descent learning procedures for networks. In: Proceedings of the eighth annual conference of the cognitive science society, pp 823–832
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions, pp 1–9
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need, pp 5998–6008
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. arXiv
Wu G, Lu W, Gao G, Zhao C, Liu J (2016) Regional deep learning model for visual tracking. Neurocomputing 175:310–323
Article Google Scholar
Yang Y, He Y, Wang Y, Wu M (2018) Stability analysis of fractional-order neural networks: an lmi approach. Neurocomputing 285:82–93
Article Google Scholar
Yao Z, Gholami A, Shen S, Keutzer K, Mahoney MW (2020) Adahessian: an adaptive second order optimizer for machine learning
You Y, Li J, Reddi SJ, Hseu J, Kumar S, Bhojanapalli S, Song X, Demmel J, Hsieh C (2019) Large batch optimization for deep learning: training bert in 76 minutes. arXiv:1904.00962
Yu D, Deng L (2011) Deep learning and its applications to signal and information processing. IEEE Signal Process Mag 28:145–154
Article Google Scholar
Zaheer M, Reddi SJ, Sachan DS, Kale S, Kumar S (2018) Adaptive methods for nonconvex optimization 9793–9803
Zeiler MD (2012) Adadelta: an adaptive learning rate method. arXiv:1212.5701
Zhang MR, Lucas J, Hinton GE, Ba J (2019) Lookahead optimizer: k steps forward, 1 step back. arXiv:1907.08610
Zhang S, Yu Y, Yu J (2016) Lmi conditions for global stability of fractional-order neural networks. IEEE Trans Neural Netw Learn Syst 28:2423–2433
Article MathSciNet Google Scholar
Zhuang J, Tang T, Ding Y, Tatikonda SC, Dvornek NC, Papademetris X, Duncan JS (2020) Adabelief optimizer: adapting stepsizes by the belief in observed gradients. In: Advances in neural information processing systems
Zinkevich M (2003) Online convex programming and generalized infinitesimal gradient ascent, pp 928–935

Download references

Author information

Authors and Affiliations

Department of Control Science and Engineering, Yu Harbin Institute of Technology, 92 Xi Da Zhi Jie, Nangang Qu, Heilongjiang Province, China
ZhongLiang Yu, Guanghui Sun & Jianfeng Lv

Authors

ZhongLiang Yu
View author publications
You can also search for this author in PubMed Google Scholar
Guanghui Sun
View author publications
You can also search for this author in PubMed Google Scholar
Jianfeng Lv
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guanghui Sun.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yu, Z., Sun, G. & Lv, J. A fractional-order momentum optimization approach of deep neural networks. Neural Comput & Applic 34, 7091–7111 (2022). https://doi.org/10.1007/s00521-021-06765-2

Download citation

Received: 08 March 2021
Accepted: 15 November 2021
Published: 27 January 2022
Issue Date: May 2022
DOI: https://doi.org/10.1007/s00521-021-06765-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A fractional-order momentum optimization approach of deep neural networks

Abstract

Access this article

Similar content being viewed by others

Speeding up the Oscillation-Free Modified Heavy Ball Algorithm

sqFm: a novel adaptive optimization scheme for deep learning model

Adaptive Learning Rate and Momentum for Training Deep Neural Networks

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A fractional-order momentum optimization approach of deep neural networks

Abstract

Access this article

Similar content being viewed by others

Speeding up the Oscillation-Free Modified Heavy Ball Algorithm

sqFm: a novel adaptive optimization scheme for deep learning model

Adaptive Learning Rate and Momentum for Training Deep Neural Networks

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation