Abstract
Stochastic adaptive gradient decent algorithms, such as AdaGrad and Adam, are extensively used to train deep neural networks. However, randomly sampling gradient information introduces instability to the learning rates, leading to adaptive methods with poor generalization. To address this issue, the ABNGrad algorithm, which leverages the absolute value operation and the normalization technique, is proposed. More specifically, the absolute value function is first incorporated into the iteration of the second-order moment estimate to ensure that it monotonically increases. Then, the normalization technique is employed to prevent a rapid decrease in the learning rate. In particular, the techniques used in this paper can also be integrated into other existing adaptive algorithms, such as Adam, AdamW, AdaBound, and RAdam, yielding good performance. Additionally, it is shown that ABNGrad can attain the optimal regret bound for solving online convex optimization problems. Finally, many experimental results illustrate the effectiveness of ABNGrad. For a comprehensive exploration of the advantages of the proposed approach and the specifics of its detailed implementation, the readers are referred to the following https://github.com/Wenhan-Jiang/ABNGrad.git
Graphical abstract
Similar content being viewed by others
Data Availability
The data that support the findings of this study are openly available in CIFAR-10, at https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz. The data that support the findings of this study are openly available in CIFAR-100 at https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz. The data that support the findings of this study are openly available in CINIC-10 at https://paperswithcode.com/sota/image-classification-on-cinic-10. The authors confirm that the PTB dataset is available in the article The Penn Treebank: An Overview. The data that support the findings of this study are openly available in VOC at http://host.robots.ox.ac.uk/pascal/VOC/.
Notes
The authors confirm that the PTB dataset is available in the article The Penn Treebank: An Overview
References
Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Human language technologies, (NAACL), Minneapolis, Minnesota, June, vol 1, pp 4171–4186. https://doi.org/10.18653/v1/N19-1423. https://aclanthology.org/N19-1423
Dai Z, Yang Z, Yang Y, Carbonell J, Le Q, Salakhutdinov R (2019) Transformer-XL: attentive language models beyond a fixed-length context, 2978–2988. https://doi.org/10.18653/v1/P19-1285
Zhang T, Chen S, Wulamu A, Guo X, Li Q, Zheng H (2023) Transg-net: transformer and graph neural network based multi-modal data fusion network for molecular properties prediction, 16077–16088. https://doi.org/10.1007/s10489-022-04351-0
Kononov E, Tashkinov M, Silberschmidt VV (2023) Reconstruction of 3d random media from 2d images: generative adversarial learning approach. Comput Aided Des 158:103498. https://doi.org/10.1016/j.cad.2023.103498
Mathis A, Mamidanna P, Cury KM et al (2018) Deeplabcut: markerless pose estimation of user-defined body parts with deep learning. Nat Neurosci 21(9):1281–1289. https://doi.org/10.1038/s41593-018-0209-y
Huang B, Zhang S, Huang J, Yu Y, Shi Z, Xiong Y (2022) Knowledge distilled pre-training model for vision-language-navigation. Appl Intell 53:5607–5619. https://doi.org/10.1007/s10489-022-03779-8
Kumar A, Aggarwal RK (2022) An exploration of semi-supervised and language-adversarial transfer learning using hybrid acoustic model for hindi speech recognition. J Reliab Intell Environ 8:117–132. https://doi.org/10.1007/s40860-021-00140-7
Hu L, Fu C, Ren Zea (2023) Sselm-neg: spherical search-based extreme learning machine for drug-target interaction prediction. BMC Bioinformatics 24(38):1471–2105. https://doi.org/10.1186/s12859-023-05153-y
Xu Y, Verma D, Sheridan RP et al (2020) Deep dive into machine learning models for protein engineering. J Chem Inf Model 60(6):2773–2790. https://doi.org/10.1021/acs.jcim.0c00073
Waring J, Lindvall C, Umeton R (2020) Automated machine learning: review of the state-of-the-art and opportunities for healthcare. Artif Intell Med 104:101822. https://doi.org/10.1016/j.artmed.2020.101822
Wu J, Chen X-Y, Zhang H, Xiong L-D, Lei H, Deng S-H (2019) Hyperparameter optimization for machine learning models based on bayesian optimizationb. Journal of Electronic Science and Technology 17(1):26–40. https://doi.org/10.11989/JEST.1674-862X.80904120
Abbaszadeh Shahri A, Pashamohammadi F, Asheghi R, Abbaszadeh Shahri H (2022) Automated intelligent hybrid computing schemes to predict blasting induced ground vibration. Engineering with Computers 38(4):3335–3349. https://doi.org/10.1007/s00366-021-01444-1
Yuan W, Hu F, Lu L (2022) A new non-adaptive optimization method: stochastic gradient descent with momentum and difference. Appl Intell 52:3939–3953. https://doi.org/10.1007/s10489-021-02224-6
Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization, vol 12, pp 2121–2159. https://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf
Yedida R, Aha S, Prashanth T (2021) Lipschitzlr: using theoretically computed adaptive learning rates for fast convergence. Appl Intell 51:1460–1478. https://doi.org/10.1007/s10489-020-01892-0
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: 3rd International conference on learning representations, ICLR, San Diego, CA, USA, May, San Diego, CA, USA. http://arxiv.org/abs/1412.6980
Reddi SJ, Kale S, Kumar S (2018) On the convergence of adam and beyond. In: 6th International conference on learning representations, ICLR, Vancouver, BC, Canada, April, Vancouver, BC, Canada. https://openreview.net/forum?id=ryQu7f-RZ
Luo L, Xiong Y, Liu Y, Sun X (2019) Adaptive gradient methods with dynamic bound of learning rate. In: 7th International conference on learning representations, ICLR, New Orleans, LA, USA, May 6-9, New Orleans, LA, USA. https://openreview.net/forum?id=Bkg3g2R9FX
Loshchilov I, Hutter F (2019) Decoupled weight decay regularization. In: 7th International conference on learning representations, ICLR, New Orleans, LA, USA, May 6-9. https://openreview.net/forum?id=Bkg6RiCqY7
Liu L, Jiang H, He P, Chen W, Liu X, Gao J, Han J (2020) On the variance of the adaptive learning rate and beyond. In: International conference on learning representations, Ethiopia, July. https://openreview.net/forum?id=rkgz2aEKDr
Zinkevich M (2003) Online convex programming and generalized infinitesimal gradient ascent. In: Proceedings of the 20th international conference on machine learning, ICML, Washington, DC, USA, August 21-24, pp 928–936. https://icml.cc/Conferences/2010/papers/473.pdf
Hazan E, Agarwal A, Kale S (2007) Logarithmic regret algorithms for online convex optimization. Mach Learn 69:169–192. https://doi.org/10.1007/s10994-007-5016-8
Zeng K, Liu J, Jiang Z, Xu D (2022) A decreasing scaling transition scheme from adam to sgd. Adv Theory Simul 5(7). https://doi.org/10.1002/adts.202100599
Jalaian B, Lee M, Russell S (2019) Uncertain context: uncertainty quantification in machine learning. AI Mag 40(4):40–49. https://doi.org/10.1609/aimag.v40i4.4812
Wu X, Wagner P, Huber MF (2023) In: Shajek A, Hartmann EA (eds) Quantification of uncertainties in neural networks. Springer, Cham, pp 276–287. https://doi.org/10.1007/978-3-031-26490-0_16
Zhuang J, Tang T, Ding Y, Tatikonda SC, Dvornek N, Papademetris X, Duncan J (2020) Adabelief optimizer: adapting stepsizes by the belief in observed gradients. In: Advances in neural information processing systems, vol 33, pp 18795–18806. https://proceedings.neurips.cc/paper_files/paper/2020/file/d9d4f495e875a2e075a1a4a6e1b9770f-Paper.pdf
Koçak H (2021) A combined meshfree exponential Rosenbrock integrator for the third-order dispersive partial differential equations. Numer Methods Partial Differ Equ 37(3):2458–2468. https://doi.org/10.1002/num.22726
Oza U, Patel S, Kumar P (2021) Noveme - color space net for image classification. Intell Inf Database Syst 12672:531–543. https://doi.org/10.1007/978-3-030-73280-6_42
Branco A, Carvalheiro C, Costa F, Castro S, Silva J, Martins C, Ramos J (2014) Deepbankpt and companion portuguese treebanks in a multilingual collection of treebanks aligned with the penn treebank. Computational Processing of the Portuguese Language 207–213. https://doi.org/10.1007/978-3-319-09761-9_23
Ma X, Tao Z, Wang Y, Yu H, Wang Y (2015) Long short-term memory neural network for traffic speed prediction using remote microwave sensor data. Transportation Research Part C: Emerging Technologies 54:187–197. https://doi.org/10.1016/j.trc.2015.03.014
McMahan HB, Streeter MJ (2010) Adaptive bound optimization for online convex optimization, pp 224–256. https://www.learningtheory.org/colt2010/conference-website/papers/104mcmahan.pdf
Acknowledgements
This work was funded in part by the Natural Science Foundation of Jilin Province (No.YDZJ202201ZYTS519, No.YDZJ202201ZYTS585), and in part by the National Natural Science Foundation of China (No.62176051).
Author information
Authors and Affiliations
Contributions
Wenhan Jiang:writing - original draft preparation, conceptualization, methodology, investigation. Yuqing Liang: writing - original draft preparation, writing - review and editing. Zhixia Jiang: writing - review and editing, supervision. Dongpo Xu: writing - review and editing, Conceptualization. Linhua Zhou: Conceptualization, Validation.
Corresponding authors
Ethics declarations
Conflicts of interest
The authors declare that they have no confict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A Proof of theorem 1
Lemma 1
(Lemma 3 [31]) For any \(Q \in S_{d}^{+}\) and convex feasible set \(\mathcal {F} \subset \mathbb {R}^{d}\), suppose \({{u}_{1}}=\arg \min _{x\in \mathcal {F}}\,\left\| {{Q}^{1/2}}\left( x-{{z}_{1}} \right) \right\| \) and \({{u}_{2}}=\arg \min _{x\in \mathcal {F}}\,\left\| {{Q}^{1/2}}\left( x-{{z}_{2}} \right) \right\| \) where \({{z}_{1}},{{z}_{2}}\in {{\mathbb {R}}^{d}}\), then we have \({{\left\| {{Q}^{1/2}}\left( {{u}_{1}}-{{u}_{2}} \right) \right\| }^{2}}\le {{\left\| {{Q}^{1/2}}\left( {{z}_{1}}-{{z}_{2}} \right) \right\| }^{2}}\).
Proof of Theorem 1
By the definition of the projection operation \(\prod _{\mathcal {F},A}\) in Section 2, we have
Since \({{x}^{*}}\in \mathcal {F}\), we then know that \(\prod \nolimits _{\mathcal {F},\sqrt{{{V}_{t}}}}{\left( {{x}^{*}} \right) }={{x}^{*}}\). Thus, using Lemma 1 with \({{u}_{1}}={{x}_{t+1}}\), \({{u}_{2}}={{x}^{*}}\) and \(Q=\sqrt{{{V}_{t}}}\), we arrive at
where the last equality follows from the iterate of \(m_{t}\) in Algorithm 2, i.e, \(m_{t}=\beta _{1t}m_{t-1}+(1-\beta _{1t})g_{t}\). Upon rearranging (A2), we have
Since \(0\le \beta _{1t}<1\), then dividing both sides of (A3) by \(2\alpha _{t}(1-\beta _{1t})\) to get
where we introduce \(\Theta _{t}\) to make our proof clear. Upon applying the inequality that \(\langle a,b\rangle \le \frac{\epsilon }{2}\Vert a\Vert ^{2}+\frac{1}{2\epsilon }\Vert b\Vert ^{2}\) with \(a=-V_{t}^{-1/4}m_{t-1}\), \(b=V_{t}^{1/4}(x_{t}-x^{*})\) and \(\epsilon =\alpha _{t}\), the inner product in (A4) becomes
Now, substituting (A5) in (A4), we have
Summing over t from 1 to T, we obtain
where \(\Theta _{t}\), \(\Lambda _{t}\) and \(\Upsilon _{t}\) are defined in (A4) and (A6), respectively.
Next, we estimate terms \(\Theta _{t}\), \(\Lambda _{t}\) and \(\Upsilon _{t}\) in (A7) as follows. For simplicity, we set \(\widetilde{v}_{t}=\hat{v}_{t}+\epsilon \), then \(V_{t}=\text {diag}(\hat{v}_{t}+\epsilon )=\text {diag}(\widetilde{v}_{t})\) in Algorithm 2, thus
where the first inequality holds by \(\beta _{11}=\beta _{1}\) and the condition that \(\beta _{1t}\le \beta _{1(t-1)}\), i.e., \(-\frac{1}{1-\beta _{1(t-1)}}\le -\frac{1}{1-\beta _{1t}}\), and the last inequality follows from Assumption 1 that \((x_{1,i}-x^{*}_{i})^{2}\le \Vert x_{1}-x^{*}_{i}\Vert _{\infty }^{2}\le D_{\infty }^{2}\). Upon it follows from the update of \(\hat{v}_{t}\) in Algorithm 2, i.e., \(\hat{v}_{t}=\max \{\hat{v}_{t-1},v_{t}\}\), that \(\hat{v}_{t}\ge \hat{v}_{t-1}\). Then, by the fact that \(\widetilde{v}_{t}=\hat{v}_{t}+\epsilon \), we can get that \(\widetilde{v}_{t}\ge \widetilde{v}_{t-1}\). Thus, from the stepsize condition \(\alpha _{t}=\frac{\alpha }{\sqrt{t}}\), we have
Thus, using the condition that \(\beta _{1t}\le \beta _{1}\), we can further estimate (A8) as follows.
where the second inequality is due to \((x_{t,i}-x^{*}_{i})^{2}\le \Vert x_{t}-x^{*}\Vert _{\infty }^{2}\le D_{\infty }^{2}\) assumed in Assumption 1, and the last equality holds by the stepsize condition that \(\alpha _{t}=\frac{\alpha }{\sqrt{t}}\).
By the definition of \(\Lambda _{t}\) in (A6), we have
where the first inequality holds by the conditions that \(\beta _{1t}<1\) and \(\beta _{1t}\le \beta _{1}\), the second inequality follows from \(\widetilde{v}_{t,i}=\hat{v}_{t,i}+\epsilon \ge \epsilon \), i.e., \(\widetilde{v}_{t,i}^{-1/2}\le \frac{1}{\sqrt{\epsilon }}\) for all \(t\in \left[ T \right] \). The third inequality is due to \(\alpha _{t}=\frac{\alpha }{\sqrt{t}}\), thus \(\alpha _{t}\le \alpha _{t-1}\), and the last inequality holds because \(m_{0}=0\), i.e., \(\sum _{t=1}^{T}\alpha _{t-1}m_{t-1,i}^{2}=\sum _{t=1}^{T-1}\alpha _{t}m_{t,i}^{2}\le \sum _{t=1}^{T}\alpha _{t}m_{t,i}^{2}\). Upon from the iterate of \(m_{t}\) in Algorithm 2, we arrive at
where the first inequality follows from \(1-\beta _{1k}\le 1\) and \(\beta _{1j}\le \beta _{1}\) for any \(j\ge 1\), the second inequality holds by the Jensen inequality with respect to the convex function \(x^{2}\), and the last inequality is due to the fact that \(\beta _{1}<1\), thus \(\sum _{k=0}^{\infty }\beta _{1}^{k}\le \frac{1}{1-\beta _{1}}\). Now, plugging (A12) in (A11) gives
where we exchange the order of sum in the first equality, and the second inequality follows from the fact that \(\sqrt{k}\le \sqrt{t}\) for any \(t\ge k\). The last two inequalities hold by \(\beta _{1}\ge 0\), then \(\sum _{t=k}^{T}\beta _{1}^{t}\le \sum _{t=k}^{\infty }\beta _{1}^{t}\le \frac{\beta _{1}^{k}}{1-\beta _{1}}\). Further, applying the bounded gradient condition in Assumption 2, i.e., \(\Vert g_{t}\Vert \le G\), to obtain
where the last inequality follows from
Upon combining (A14) and (A13), we have
From the definition of \(\Upsilon _{t}\) in (A6), we have
where the first inequality holds by the condition that \(\beta _{1t}\le \beta _{1}\) and the last inequality is due to Assumption 1.
Finally, substituting the bounds of \(\Theta _{t}\) in (A10), \(\Lambda _{t}\) in (A16) and \(\Upsilon _{t}\) in (A17) to (A7), we arrive at
Upon since \({{v}_{t}}=\frac{{{v}_{t-1}}+\left( 1-{{\beta }_{2}} \right) \left| g_{t}^{2}-{{v}_{t-1}} \right| }{{{\left\| {{v}_{t-1}}+\left( 1-{{\beta }_{2}} \right) \left| g_{t}^{2}-{{v}_{t-1}} \right| \right\| }_{p}}}\), we then have \(\Vert v_{t}\Vert =1\). By the iterate of \(\hat{v}_{t}\), i.e., \(\hat{v}_{t}=\max \{\hat{v}_{t-1},v_{t}\}\) and \(\hat{v}_{0}=0\), we can obtain \(\Vert \hat{v}_{t}\Vert =1\). By the definition of \(\widetilde{v}_{t}\), we have
Combining (A18) and (A19) gives
Applying the convexity of the function \(f_{t}\), we know that
This completes the proof. \(\square \)
Appendix B Proof of corollary 1
Proof of Corollary 1
By the conditions that \({{\beta }_{1t}}={{\beta }_{1}}{{\lambda }^{t}}\) and \(0<\lambda <1\), we have
From Theorem 1, it is clear that
This completes the proof. \(\square \)
Appendix C Proof of corollary 2
Proof of Corollary 2
Since \({{\beta }_{1t}}={{\beta }_{1}}/t\) and \(\alpha _{t}=\alpha /\sqrt{t}\), we then have
where the last inequality is due to
From Theorem 1, it is clear that
This completes the proof. \(\square \)
Appendix D Algorithms
This section provides the Algorithms for different optimization techniques, including AdaGrad, AdaGradN, AdaBound, ABNBound, AdamW, ABNAdamW, RAdam and ABNRAdam.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Jiang, W., Liang, Y., Jiang, Z. et al. ABNGrad: adaptive step size gradient descent for optimizing neural networks. Appl Intell 54, 2361–2378 (2024). https://doi.org/10.1007/s10489-024-05303-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-024-05303-6