Abstract
Among the adaptive algorithms, Adam is the most widely used algorithm, especially for training deep neural networks. However, recent studies have shown that it has a weak generalization ability, and even cannot converge in extreme cases. AdaX (2020) is a variant of Adam, which modifies the second moment of Adam, making the algorithm enjoy good generalization ability compared to SGD. This work aims to improve the AdaX algorithm with faster convergence speed and higher training accuracy. The first moment of AdaX is essentially a classical momentum term, while the Nesterov’s accelerated gradient (NAG) is theoretically and experimentally superior to this classical momentum. Therefore, we replace the classical momentum term of the first moment of AdaX with NAG, and obtain the resulting algorithm named Nesterov’s accelerated AdaX (Nadax). Extensive experiments on deep learning tasks show that training models with our proposed Nadax can bring favorable benefits.







Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Li W, Zhang Z, Wang X, Adax PL (2020) Adaptive gradient descent with exponential long term memory. arXiv:2004.09740
Sharma N, Jain V, Mishra A (2018) An analysis of convolutional neural networks for image classification. Procedia Comput Sci 132:377–384
Zhao W, Lou M, Qi Y, Wang Y, Xu C, Deng X, Ma. Y (2021) Adaptive channel and multiscale spatial context network for breast mass segmentation in full-field mammograms. Applied Intelligence 51(12):8810–8827
Tian P, Mo H, Jiang L (2021) Scene graph generation by multi-level semantic tasks. Applied Intelligence, 51(11):7781–7793
Anup KG, Puneet G, Esa R (2021) Fatalread-fooling visual speech recognition models
Robbins H, Monro S (1951) A stochastic approximation method. The annals of mathematical statistics pages 400–407
Nesterov Y (1983) A method for unconstrained convex minimization problem with the rate of convergence o (1/kˆ 2). In Doklady an ussr, 269:543–547
Sutskever I, Martens J, Dahl G, Hinton G (2013) On the importance of initialization and momentum in deep learning. In: International conference on machine learning, pages 1139–1147. PMLR
Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research 12:7
Matthew DZ (2012) Adadelta: an adaptive learning rate method. arXiv:1212.5701
Tijmen T., Geoffrey H., et al. (2012) Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning 4(2):26– 31
Kingma DP, Adam JB (2014) A method for stochastic optimization. arXiv:1412.6980
Timothy D (2016) Incorporating nesterov momentum into adam
Reddi SJ, Kale S, Kumar S (2019) On the convergence of adam and beyond. arXiv:1904.09237
Wilson AC, Roelofs R, Stern M, Srebro N, Recht B (2017) The marginal value of adaptive gradient methods in machine learning
Luo L, Xiong Y, Liu Y, Sun XU (2019) Adaptive gradient methods with dynamic bound of learning rate. arXiv:1902.09843
Boris TP (1964) Some methods of speeding up the convergence of iteration methods. Ussr computational mathematics and mathematical physics 4(5):1–17
Zhuang J, Tang T, Ding Y, Tatikonda SC , Dvornek N, Papademetris X, Duncan J (2020) Adabelief optimizer: Adapting stepsizes by the belief in observed gradients. Adv Neural Inf Process Syst 33:18795–18806
Hazan E (2019) Introduction to online convex optimization. arXiv:1909.05207
Zinkevich M (2003) Online convex programming and generalized infinitesimal gradient ascent. In: Proceedings of the 20th international conference on machine learning (icml-03), 928–936
LeCun Y (1998) The mnist database of handwritten digits. http://yann.lecuncom/exdb/mnist/
Loshchilov I, Hutter F (2017) Decoupled weight decay regularization. arXiv:1711.05101
Xiao H, Rasul K, Vollgraf R (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv:1708.07747
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition
Krizhevsky A, Hinton G (2009) Learning multiple layers of features from tiny images
Everingham M, Eslami SM , Gool LV, Williams CKI, Winn J, Zisserman A (2015) The pascal visual object classes challenge: A retrospective. International journal of computer vision 111(1):98–136
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition 3431–3440
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Li H, Xu Z, Taylor G, Studer C, Goldstein T (2018) Visualizing the loss landscape of neural nets
Acknowledgements
This work is supported in part by the Natural Science Foundation of China under Grant No. 61472003, Academic and Technical Leaders and Backup Candidates of Anhui Province under Grant No. 2019h211, Innovation team of ’50 Star of Science and Technology’ of Huainan, Anhui Province.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Gui, Y., Li, D. & Fang, R. A fast adaptive algorithm for training deep neural networks. Appl Intell 53, 4099–4108 (2023). https://doi.org/10.1007/s10489-022-03629-7
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-022-03629-7