Skip to main content
Log in

ABNGrad: adaptive step size gradient descent for optimizing neural networks

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Stochastic adaptive gradient decent algorithms, such as AdaGrad and Adam, are extensively used to train deep neural networks. However, randomly sampling gradient information introduces instability to the learning rates, leading to adaptive methods with poor generalization. To address this issue, the ABNGrad algorithm, which leverages the absolute value operation and the normalization technique, is proposed. More specifically, the absolute value function is first incorporated into the iteration of the second-order moment estimate to ensure that it monotonically increases. Then, the normalization technique is employed to prevent a rapid decrease in the learning rate. In particular, the techniques used in this paper can also be integrated into other existing adaptive algorithms, such as Adam, AdamW, AdaBound, and RAdam, yielding good performance. Additionally, it is shown that ABNGrad can attain the optimal regret bound for solving online convex optimization problems. Finally, many experimental results illustrate the effectiveness of ABNGrad. For a comprehensive exploration of the advantages of the proposed approach and the specifics of its detailed implementation, the readers are referred to the following https://github.com/Wenhan-Jiang/ABNGrad.git

Graphical abstract

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm 1
Algorithm 2
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Data Availability

The data that support the findings of this study are openly available in CIFAR-10, at https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz. The data that support the findings of this study are openly available in CIFAR-100 at https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz. The data that support the findings of this study are openly available in CINIC-10 at https://paperswithcode.com/sota/image-classification-on-cinic-10. The authors confirm that the PTB dataset is available in the article The Penn Treebank: An Overview. The data that support the findings of this study are openly available in VOC at http://host.robots.ox.ac.uk/pascal/VOC/.

Notes

  1. https://github.com/lanpa/tensorboardX.git

  2. https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz

  3. https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz

  4. https://paperswithcode.com/sota/image-classification-on-cinic-10

  5. The authors confirm that the PTB dataset is available in the article The Penn Treebank: An Overview

  6. https://github.com/ultralytics/yolov5.git

  7. http://host.robots.ox.ac.uk/pascal/VOC/

References

  1. Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Human language technologies, (NAACL), Minneapolis, Minnesota, June, vol 1, pp 4171–4186. https://doi.org/10.18653/v1/N19-1423. https://aclanthology.org/N19-1423

  2. Dai Z, Yang Z, Yang Y, Carbonell J, Le Q, Salakhutdinov R (2019) Transformer-XL: attentive language models beyond a fixed-length context, 2978–2988. https://doi.org/10.18653/v1/P19-1285

  3. Zhang T, Chen S, Wulamu A, Guo X, Li Q, Zheng H (2023) Transg-net: transformer and graph neural network based multi-modal data fusion network for molecular properties prediction, 16077–16088. https://doi.org/10.1007/s10489-022-04351-0

  4. Kononov E, Tashkinov M, Silberschmidt VV (2023) Reconstruction of 3d random media from 2d images: generative adversarial learning approach. Comput Aided Des 158:103498. https://doi.org/10.1016/j.cad.2023.103498

    Article  MathSciNet  Google Scholar 

  5. Mathis A, Mamidanna P, Cury KM et al (2018) Deeplabcut: markerless pose estimation of user-defined body parts with deep learning. Nat Neurosci 21(9):1281–1289. https://doi.org/10.1038/s41593-018-0209-y

    Article  CAS  PubMed  Google Scholar 

  6. Huang B, Zhang S, Huang J, Yu Y, Shi Z, Xiong Y (2022) Knowledge distilled pre-training model for vision-language-navigation. Appl Intell 53:5607–5619. https://doi.org/10.1007/s10489-022-03779-8

    Article  Google Scholar 

  7. Kumar A, Aggarwal RK (2022) An exploration of semi-supervised and language-adversarial transfer learning using hybrid acoustic model for hindi speech recognition. J Reliab Intell Environ 8:117–132. https://doi.org/10.1007/s40860-021-00140-7

    Article  Google Scholar 

  8. Hu L, Fu C, Ren Zea (2023) Sselm-neg: spherical search-based extreme learning machine for drug-target interaction prediction. BMC Bioinformatics 24(38):1471–2105. https://doi.org/10.1186/s12859-023-05153-y

  9. Xu Y, Verma D, Sheridan RP et al (2020) Deep dive into machine learning models for protein engineering. J Chem Inf Model 60(6):2773–2790. https://doi.org/10.1021/acs.jcim.0c00073

    Article  CAS  PubMed  Google Scholar 

  10. Waring J, Lindvall C, Umeton R (2020) Automated machine learning: review of the state-of-the-art and opportunities for healthcare. Artif Intell Med 104:101822. https://doi.org/10.1016/j.artmed.2020.101822

    Article  PubMed  Google Scholar 

  11. Wu J, Chen X-Y, Zhang H, Xiong L-D, Lei H, Deng S-H (2019) Hyperparameter optimization for machine learning models based on bayesian optimizationb. Journal of Electronic Science and Technology 17(1):26–40. https://doi.org/10.11989/JEST.1674-862X.80904120

    Article  Google Scholar 

  12. Abbaszadeh Shahri A, Pashamohammadi F, Asheghi R, Abbaszadeh Shahri H (2022) Automated intelligent hybrid computing schemes to predict blasting induced ground vibration. Engineering with Computers 38(4):3335–3349. https://doi.org/10.1007/s00366-021-01444-1

    Article  Google Scholar 

  13. Yuan W, Hu F, Lu L (2022) A new non-adaptive optimization method: stochastic gradient descent with momentum and difference. Appl Intell 52:3939–3953. https://doi.org/10.1007/s10489-021-02224-6

    Article  Google Scholar 

  14. Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization, vol 12, pp 2121–2159. https://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf

  15. Yedida R, Aha S, Prashanth T (2021) Lipschitzlr: using theoretically computed adaptive learning rates for fast convergence. Appl Intell 51:1460–1478. https://doi.org/10.1007/s10489-020-01892-0

    Article  Google Scholar 

  16. Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: 3rd International conference on learning representations, ICLR, San Diego, CA, USA, May, San Diego, CA, USA. http://arxiv.org/abs/1412.6980

  17. Reddi SJ, Kale S, Kumar S (2018) On the convergence of adam and beyond. In: 6th International conference on learning representations, ICLR, Vancouver, BC, Canada, April, Vancouver, BC, Canada. https://openreview.net/forum?id=ryQu7f-RZ

  18. Luo L, Xiong Y, Liu Y, Sun X (2019) Adaptive gradient methods with dynamic bound of learning rate. In: 7th International conference on learning representations, ICLR, New Orleans, LA, USA, May 6-9, New Orleans, LA, USA. https://openreview.net/forum?id=Bkg3g2R9FX

  19. Loshchilov I, Hutter F (2019) Decoupled weight decay regularization. In: 7th International conference on learning representations, ICLR, New Orleans, LA, USA, May 6-9. https://openreview.net/forum?id=Bkg6RiCqY7

  20. Liu L, Jiang H, He P, Chen W, Liu X, Gao J, Han J (2020) On the variance of the adaptive learning rate and beyond. In: International conference on learning representations, Ethiopia, July. https://openreview.net/forum?id=rkgz2aEKDr

  21. Zinkevich M (2003) Online convex programming and generalized infinitesimal gradient ascent. In: Proceedings of the 20th international conference on machine learning, ICML, Washington, DC, USA, August 21-24, pp 928–936. https://icml.cc/Conferences/2010/papers/473.pdf

  22. Hazan E, Agarwal A, Kale S (2007) Logarithmic regret algorithms for online convex optimization. Mach Learn 69:169–192. https://doi.org/10.1007/s10994-007-5016-8

    Article  Google Scholar 

  23. Zeng K, Liu J, Jiang Z, Xu D (2022) A decreasing scaling transition scheme from adam to sgd. Adv Theory Simul 5(7). https://doi.org/10.1002/adts.202100599

  24. Jalaian B, Lee M, Russell S (2019) Uncertain context: uncertainty quantification in machine learning. AI Mag 40(4):40–49. https://doi.org/10.1609/aimag.v40i4.4812

    Article  Google Scholar 

  25. Wu X, Wagner P, Huber MF (2023) In: Shajek A, Hartmann EA (eds) Quantification of uncertainties in neural networks. Springer, Cham, pp 276–287. https://doi.org/10.1007/978-3-031-26490-0_16

  26. Zhuang J, Tang T, Ding Y, Tatikonda SC, Dvornek N, Papademetris X, Duncan J (2020) Adabelief optimizer: adapting stepsizes by the belief in observed gradients. In: Advances in neural information processing systems, vol 33, pp 18795–18806. https://proceedings.neurips.cc/paper_files/paper/2020/file/d9d4f495e875a2e075a1a4a6e1b9770f-Paper.pdf

  27. Koçak H (2021) A combined meshfree exponential Rosenbrock integrator for the third-order dispersive partial differential equations. Numer Methods Partial Differ Equ 37(3):2458–2468. https://doi.org/10.1002/num.22726

    Article  MathSciNet  Google Scholar 

  28. Oza U, Patel S, Kumar P (2021) Noveme - color space net for image classification. Intell Inf Database Syst 12672:531–543. https://doi.org/10.1007/978-3-030-73280-6_42

    Article  Google Scholar 

  29. Branco A, Carvalheiro C, Costa F, Castro S, Silva J, Martins C, Ramos J (2014) Deepbankpt and companion portuguese treebanks in a multilingual collection of treebanks aligned with the penn treebank. Computational Processing of the Portuguese Language 207–213. https://doi.org/10.1007/978-3-319-09761-9_23

  30. Ma X, Tao Z, Wang Y, Yu H, Wang Y (2015) Long short-term memory neural network for traffic speed prediction using remote microwave sensor data. Transportation Research Part C: Emerging Technologies 54:187–197. https://doi.org/10.1016/j.trc.2015.03.014

    Article  Google Scholar 

  31. McMahan HB, Streeter MJ (2010) Adaptive bound optimization for online convex optimization, pp 224–256. https://www.learningtheory.org/colt2010/conference-website/papers/104mcmahan.pdf

Download references

Acknowledgements

This work was funded in part by the Natural Science Foundation of Jilin Province (No.YDZJ202201ZYTS519, No.YDZJ202201ZYTS585), and in part by the National Natural Science Foundation of China (No.62176051).

Author information

Authors and Affiliations

Authors

Contributions

Wenhan Jiang:writing - original draft preparation, conceptualization, methodology, investigation. Yuqing Liang: writing - original draft preparation, writing - review and editing. Zhixia Jiang: writing - review and editing, supervision. Dongpo Xu: writing - review and editing, Conceptualization. Linhua Zhou: Conceptualization, Validation.

Corresponding authors

Correspondence to Zhixia Jiang or Dongpo Xu.

Ethics declarations

Conflicts of interest

The authors declare that they have no confict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A   Proof of theorem 1

Lemma 1

(Lemma 3 [31])  For any \(Q \in S_{d}^{+}\) and convex feasible set \(\mathcal {F} \subset \mathbb {R}^{d}\), suppose \({{u}_{1}}=\arg \min _{x\in \mathcal {F}}\,\left\| {{Q}^{1/2}}\left( x-{{z}_{1}} \right) \right\| \) and \({{u}_{2}}=\arg \min _{x\in \mathcal {F}}\,\left\| {{Q}^{1/2}}\left( x-{{z}_{2}} \right) \right\| \) where \({{z}_{1}},{{z}_{2}}\in {{\mathbb {R}}^{d}}\), then we have \({{\left\| {{Q}^{1/2}}\left( {{u}_{1}}-{{u}_{2}} \right) \right\| }^{2}}\le {{\left\| {{Q}^{1/2}}\left( {{z}_{1}}-{{z}_{2}} \right) \right\| }^{2}}\).

Proof of Theorem 1

By the definition of the projection operation \(\prod _{\mathcal {F},A}\) in Section 2, we have

$$\begin{aligned} {{x}_{t+1}}&=\prod \nolimits _{\mathcal {F},\sqrt{{{V}_{t}}}}{\left( {{x}_{t}}-{{\alpha }_{t}}V_{t}^{-1/2}{{m}_{t}} \right) } \nonumber \\&=\arg \underset{x\in \mathcal {F}}{\mathop {\min }}\,\left\| V_{t}^{1/4}\left[ x-\left( {{x}_{t}}-{{\alpha }_{t}}V_{t}^{-1/2}{{m}_{t}} \right) \right] \right\| . \end{aligned}$$
(A1)

Since \({{x}^{*}}\in \mathcal {F}\), we then know that \(\prod \nolimits _{\mathcal {F},\sqrt{{{V}_{t}}}}{\left( {{x}^{*}} \right) }={{x}^{*}}\). Thus, using Lemma 1 with \({{u}_{1}}={{x}_{t+1}}\), \({{u}_{2}}={{x}^{*}}\) and \(Q=\sqrt{{{V}_{t}}}\), we arrive at

$$\begin{aligned} \begin{aligned} {{\left\| V_{t}^{1/4}\left( {{x}_{t+1}}\!-\!{{x}^{*}} \right) \right\| }^{2}}&\!\le {{\left\| V_{t}^{1/4}\left( {{x}_{t}}-{{\alpha }_{t}}V_{t}^{-1/2}{{m}_{t}}-{{x}^{*}} \right) \right\| }^{2}} \\&\!={{\left\| V_{t}^{1/4}\left( {{x}_{t}}\!-\!{{x}^{*}} \right) \right\| }^{2}} \!+\alpha _{t}^{2}{{\left\| V_{t}^{-1/4}{{m}_{t}} \right\| }^{2}} \\&\quad -2{{\alpha }_{t}}\left\langle {{m}_{t}},{{x}_{t}}-{{x}^{*}} \right\rangle \\&\!={{\left\| V_{t}^{1/4}\left( {{x}_{t}}\!-\!{{x}^{*}} \right) \right\| }^{2}}\!+\alpha _{t}^{2}{{\left\| V_{t}^{-1/4}{{m}_{t}} \right\| }^{2}} \\&\quad -2\alpha _{t}\beta _{1t} \langle m_{t-1},x_{t}-x^{*}\rangle -2\alpha _{t} \\&\quad (1-\beta _{1t})\langle g_{t},x_{t}-x^{*}\rangle ,\\ \end{aligned} \end{aligned}$$
(A2)

where the last equality follows from the iterate of \(m_{t}\) in Algorithm 2, i.e, \(m_{t}=\beta _{1t}m_{t-1}+(1-\beta _{1t})g_{t}\). Upon rearranging (A2), we have

$$\begin{aligned} 2\alpha _{t}(1-\beta _{1t}) \langle g_{t},x_{t}-x^{*}\rangle&\le {{\left\| V_{t}^{1/4}\left( {{x}_{t}}-{{x}^{*}} \right) \right\| }^{2}} -{{\left\| V_{t}^{1/4}\left( {{x}_{t+1}}-{{x}^{*}} \right) \right\| }^{2}} \nonumber \\&\quad +\alpha _{t}^{2}{{\left\| V_{t}^{-1/4}{{m}_{t}} \right\| }^{2}}\! -2\alpha _{t}\beta _{1t} \langle m_{t-1},x_{t}\!-\!x^{*}\rangle . \end{aligned}$$
(A3)

Since \(0\le \beta _{1t}<1\), then dividing both sides of (A3) by \(2\alpha _{t}(1-\beta _{1t})\) to get

$$\begin{aligned} \left\langle {{g}_{t}},{{x}_{t}}-{{x}^{*}} \right\rangle&\le \frac{{{\alpha }_{t}}}{2\left( 1-{{\beta }_{1t}} \right) }{{\left\| V_{t}^{-1/4}{{m}_{t}} \right\| }^{2}}-\frac{{{\beta }_{1t}}}{1-{{\beta }_{1t}}}\left\langle {{m}_{t-1}},{{x}_{t}}-{{x}^{*}} \right\rangle \nonumber \\&\quad + \underbrace{\frac{1}{2{{\alpha }_{t}}\left( 1-{{\beta }_{1t}} \right) }\left( {{\left\| V_{t}^{1/4}\left( {{x}_{t}}\!-\!{{x}^{*}} \right) \right\| }^{2}}\!\!-\!{{\left\| V_{t}^{1/4}\left( {{x}_{t+1}}\!-\!{{x}^{*}} \right) \right\| }^{2}} \right) }_{\Theta _{t}}, \end{aligned}$$
(A4)

where we introduce \(\Theta _{t}\) to make our proof clear. Upon applying the inequality that \(\langle a,b\rangle \le \frac{\epsilon }{2}\Vert a\Vert ^{2}+\frac{1}{2\epsilon }\Vert b\Vert ^{2}\) with \(a=-V_{t}^{-1/4}m_{t-1}\), \(b=V_{t}^{1/4}(x_{t}-x^{*})\) and \(\epsilon =\alpha _{t}\), the inner product in (A4) becomes

$$\begin{aligned}&-\langle m_{t-1},x_{t}-x^{*}\rangle \le \dfrac{\alpha _{t}}{2} \left\Vert V_{t}^{-1/4}m_{t-1}\right\Vert ^{2} \nonumber \\&+\dfrac{1}{2\alpha _{t}} \left\Vert V_{t}^{1/4}(x_{t}-x^{*})\right\Vert ^{2}. \end{aligned}$$
(A5)

Now, substituting (A5) in (A4), we have

$$\begin{aligned} \langle g_{t},x_{t}-x^{*}\rangle&\le \dfrac{\alpha _{t}}{2(1-\beta _{1t})} \left\Vert V_{t}^{-1/4}m_{t}\right\Vert ^{2} +\Theta _{t} \nonumber \\&\quad +\dfrac{\beta _{1t}}{1-\beta _{1t}} \left( \dfrac{\alpha _{t}}{2} \left\Vert V_{t}^{-1/4}m_{t-1}\right\Vert ^{2} +\dfrac{1}{2\alpha _{t}} \left\Vert V_{t}^{1/4}(x_{t}-x^{*})\right\Vert ^{2} \right) \nonumber \\&= \underbrace{ \dfrac{\beta _{1t}}{2\alpha _{t}(1-\beta _{1t})} \left\Vert V_{t}^{1/4}(x_{t}-x^{*})\right\Vert ^{2}}_{\Upsilon _{t}} +\Theta _{t} \nonumber \\&\quad +\underbrace{\dfrac{\alpha _{t}}{2(1-\beta _{1t})} \left\Vert V_{t}^{-1/4}m_{t}\right\Vert ^{2} +\dfrac{\alpha _{t}\beta _{1t}}{2(1-\beta _{1t})} \left\Vert V_{t}^{-1/4}m_{t-1}\right\Vert ^{2}}_{\Lambda _{t}}. \end{aligned}$$
(A6)

Summing over t from 1 to T, we obtain

$$\begin{aligned} \sum _{t=1}^{T} \langle g_{t},x_{t}-x^{*}\rangle \le \sum _{t=1}^{T}\Theta _{t} +\sum _{t=1}^{T}\Lambda _{t} +\sum _{t=1}^{T}\Upsilon _{t}, \end{aligned}$$
(A7)

where \(\Theta _{t}\), \(\Lambda _{t}\) and \(\Upsilon _{t}\) are defined in (A4) and (A6), respectively.

Next, we estimate terms \(\Theta _{t}\), \(\Lambda _{t}\) and \(\Upsilon _{t}\) in (A7) as follows. For simplicity, we set \(\widetilde{v}_{t}=\hat{v}_{t}+\epsilon \), then \(V_{t}=\text {diag}(\hat{v}_{t}+\epsilon )=\text {diag}(\widetilde{v}_{t})\) in Algorithm 2, thus

$$\begin{aligned} \sum _{t=1}^{T}\Theta _{t}&\overset{\text {(A4)}}{=} \sum _{t=1}^{T} \dfrac{1}{2\alpha _{t}(1-\beta _{1t})} \left( \left\Vert V_{t}^{1/4}(x_{t}-x^{*})\right\Vert ^{2}-\left\Vert V_{t}^{1/4}(x_{t+1}-x^{*})\right\Vert ^{2}\right) \nonumber \\&\,=\,\dfrac{1}{2} \sum _{t=1}^{T} \sum _{i=1}^{d} \dfrac{1}{\alpha _{t}(1-\beta _{1t})} \left( \widetilde{v}_{t,i}^{1/2}(x_{t,i}-x_{i}^{*})^{2}- \widetilde{v}_{t,i}^{1/2}(x_{t+1,i}-x_{i}^{*})^{2}\right) \nonumber \\&\,=\,\dfrac{1}{2}\sum _{t=1}^{T}\sum _{i=1}^{d}\dfrac{\widetilde{v}_{t,i}^{1/2}}{\alpha _{t}(1-\beta _{1t})} \left( (x_{t,i}-x_{i}^{*})^{2}- (x_{t+1,i}-x_{i}^{*})^{2}\right) \nonumber \\&\,=\,\dfrac{1}{2\alpha _{1}(1-\beta _{11})}\sum _{i=1}^{d}\widetilde{v}_{1,i}^{1/2}(x_{1,i}-x_{i}^{*})^{2}-\dfrac{1}{2\alpha _{T}(1-\beta _{1T})} \nonumber \\&\,\quad \sum _{i=1}^{d}\widetilde{v}_{T,i}^{1/2} (x_{T+1,i}-x_{i}^{*})^{2} \nonumber \\&\,\quad +\dfrac{1}{2}\sum _{t=2}^{T}\sum _{i=1}^{d}(x_{t,i}-x_{i}^{*})^{2}\left( \dfrac{\widetilde{v}_{t,i}^{1/2}}{\alpha _{t}(1-\beta _{1t})}-\dfrac{\widetilde{v}_{t-1,i}^{1/2}}{\alpha _{t-1}(1-\beta _{1(t-1)})}\right) \nonumber \\&\,\le \, \dfrac{D_{\infty }^{2}}{2\alpha _{1}(1-\beta _{1})}\sum _{i=1}^{d}\widetilde{v}_{1,i}^{1/2}\!+\!\dfrac{1}{2}\sum _{t=2}^{T}\sum _{i=1}^{d}\dfrac{(x_{t,i}\!-\!x_{i}^{*})^{2}}{1-\beta _{1t}} \!\left( \dfrac{\widetilde{v}_{t,i}^{1/2}}{\alpha _{t}}\!-\!\dfrac{\widetilde{v}_{t-1,i}^{1/2}}{\alpha _{t-1}}\!\right) , \end{aligned}$$
(A8)

where the first inequality holds by \(\beta _{11}=\beta _{1}\) and the condition that \(\beta _{1t}\le \beta _{1(t-1)}\), i.e., \(-\frac{1}{1-\beta _{1(t-1)}}\le -\frac{1}{1-\beta _{1t}}\), and the last inequality follows from Assumption 1 that \((x_{1,i}-x^{*}_{i})^{2}\le \Vert x_{1}-x^{*}_{i}\Vert _{\infty }^{2}\le D_{\infty }^{2}\). Upon it follows from the update of \(\hat{v}_{t}\) in Algorithm 2, i.e., \(\hat{v}_{t}=\max \{\hat{v}_{t-1},v_{t}\}\), that \(\hat{v}_{t}\ge \hat{v}_{t-1}\). Then, by the fact that \(\widetilde{v}_{t}=\hat{v}_{t}+\epsilon \), we can get that \(\widetilde{v}_{t}\ge \widetilde{v}_{t-1}\). Thus, from the stepsize condition \(\alpha _{t}=\frac{\alpha }{\sqrt{t}}\), we have

$$\begin{aligned} \dfrac{\widetilde{v}_{t,i}^{1/2}}{\alpha _{t}} -\dfrac{\widetilde{v}_{t-1,i}^{1/2}}{\alpha _{t-1}} =\dfrac{\sqrt{t}\,\widetilde{v}_{t,i}^{1/2}}{\alpha } -\dfrac{\sqrt{t-1}\,\widetilde{v}_{t-1,i}^{1/2}}{\alpha } \ge 0. \end{aligned}$$
(A9)

Thus, using the condition that \(\beta _{1t}\le \beta _{1}\), we can further estimate (A8) as follows.

$$\begin{aligned} \sum _{t=1}^{T}\Theta _{t}&\le \dfrac{D_{\infty }^{2}}{2\alpha _{1}(1-\beta _{1})}\sum _{i=1}^{d}\widetilde{v}_{1,i}^{1/2}+\dfrac{1}{2(1-\beta _{1})}\sum _{t=2}^{T}\sum _{i=1}^{d}(x_{t,i}-x_{i}^{*})^{2}\nonumber \\&\quad \times \left( \dfrac{\widetilde{v}_{t,i}^{1/2}}{\alpha _{t}}-\dfrac{\widetilde{v}_{t-1,i}^{1/2}}{\alpha _{t-1}}\right) \le \dfrac{D_{\infty }^{2}}{2\alpha _{1}(1-\beta _{1})}\sum _{i=1}^{d}\widetilde{v}_{1,i}^{1/2}+\dfrac{D_{\infty }^{2}}{2(1-\beta _{1})}\nonumber \\&\quad \times \sum _{t=2}^{T}\sum _{i=1}^{d}\left( \dfrac{\widetilde{v}_{t,i}^{1/2}}{\alpha _{t}}-\dfrac{\widetilde{v}_{t-1,i}^{1/2}}{\alpha _{t-1}}\right) =\dfrac{D_{\infty }^{2}}{2(1-\beta _{1})}\sum _{i=1}^{d} \dfrac{\widetilde{v}_{T,i}^{1/2}}{\alpha _{T}}\nonumber \\&=\dfrac{D_{\infty }^{2}\sqrt{T}}{2\alpha (1-\beta _{1})}\quad \times \sum _{i=1}^{d}\widetilde{v}_{T,i}^{1/2}, \end{aligned}$$
(A10)

where the second inequality is due to \((x_{t,i}-x^{*}_{i})^{2}\le \Vert x_{t}-x^{*}\Vert _{\infty }^{2}\le D_{\infty }^{2}\) assumed in Assumption 1, and the last equality holds by the stepsize condition that \(\alpha _{t}=\frac{\alpha }{\sqrt{t}}\).

By the definition of \(\Lambda _{t}\) in (A6), we have

$$\begin{aligned} \sum _{t=1}^{T}\Lambda _{t}&=\sum _{t=1}^{T}\left( \dfrac{\alpha _{t}}{2(1-\beta _{1t})}\left\Vert V_{t}^{-1/4}m_{t}\right\Vert ^{2}+\dfrac{\alpha _{t}\beta _{1t}}{2(1-\beta _{1t})}\left\Vert V_{t}^{-1/4}m_{t-1}\right\Vert ^{2}\right) \nonumber \\&\le \dfrac{1}{2(1-\beta _{1})}\sum _{t=1}^{T}\alpha _{t}\left\Vert V_{t}^{-1/4}m_{t}\right\Vert ^{2}+\dfrac{1}{2(1-\beta _{1})}\sum _{t=1}^{T}\alpha _{t}\left\Vert V_{t}^{-1/4}m_{t-1}\right\Vert ^{2} \nonumber \\&=\dfrac{1}{2(1-\beta _{1})}\sum _{t=1}^{T}\alpha _{t}\sum _{i=1}^{d}\widetilde{v}_{t,i}^{-1/2}\big (m_{t,i}^{2}+m_{t-1,i}^{2}\big ) \nonumber \\&\le \dfrac{1}{2\sqrt{\epsilon }(1-\beta _{1})}\sum _{t=1}^{T}\alpha _{t}\sum _{i=1}^{d}\big (m_{t,i}^{2}+m_{t-1,i}^{2}\big ) \nonumber \\&\le \dfrac{1}{2\sqrt{\epsilon }(1-\beta _{1})}\sum _{t=1}^{T}\sum _{i=1}^{d}\big (\alpha _{t}m_{t,i}^{2}+\alpha _{t-1}m_{t-1,i}^{2}\big ) \nonumber \\&\le \dfrac{1}{\sqrt{\epsilon }(1-\beta _{1})}\sum _{t=1}^{T}\sum _{i=1}^{d}\alpha _{t}m_{t,i}^{2}, \end{aligned}$$
(A11)

where the first inequality holds by the conditions that \(\beta _{1t}<1\) and \(\beta _{1t}\le \beta _{1}\), the second inequality follows from \(\widetilde{v}_{t,i}=\hat{v}_{t,i}+\epsilon \ge \epsilon \), i.e., \(\widetilde{v}_{t,i}^{-1/2}\le \frac{1}{\sqrt{\epsilon }}\) for all \(t\in \left[ T \right] \). The third inequality is due to \(\alpha _{t}=\frac{\alpha }{\sqrt{t}}\), thus \(\alpha _{t}\le \alpha _{t-1}\), and the last inequality holds because \(m_{0}=0\), i.e., \(\sum _{t=1}^{T}\alpha _{t-1}m_{t-1,i}^{2}=\sum _{t=1}^{T-1}\alpha _{t}m_{t,i}^{2}\le \sum _{t=1}^{T}\alpha _{t}m_{t,i}^{2}\). Upon from the iterate of \(m_{t}\) in Algorithm 2, we arrive at

$$\begin{aligned} m_{t,i}^{2}&=\left( \sum _{k=1}^{t} (1-\beta _{1k}) \prod _{j=k+1}^{t} \beta _{1j} g_{k,i}\right) ^{2} \nonumber \\&\le \left( \sum _{k=1}^{t} \beta _{1}^{t-k} g_{k,i}\right) ^{2} \le \left( \sum _{k=1}^{t} \beta _{1}^{t-k}\right) \sum _{k=1}^{t}\beta _{1}^{t-k} g_{k,i}^{2} \nonumber \\&\le \left( \sum _{k=0}^{\infty } \beta _{1}^{k}\right) \sum _{k=1}^{t}\beta _{1}^{t-k} g_{k,i}^{2} \le \dfrac{1}{1-\beta _{1}} \sum _{k=1}^{t}\beta _{1}^{t-k} g_{k,i}^{2}, \end{aligned}$$
(A12)

where the first inequality follows from \(1-\beta _{1k}\le 1\) and \(\beta _{1j}\le \beta _{1}\) for any \(j\ge 1\), the second inequality holds by the Jensen inequality with respect to the convex function \(x^{2}\), and the last inequality is due to the fact that \(\beta _{1}<1\), thus \(\sum _{k=0}^{\infty }\beta _{1}^{k}\le \frac{1}{1-\beta _{1}}\). Now, plugging (A12) in (A11) gives

$$\begin{aligned} \sum _{t=1}^{T}\Lambda _{t}&\le \dfrac{1}{\sqrt{\epsilon }(1-\beta _{1})} \sum _{t=1}^{T} \sum _{i=1}^{d} \alpha _{t}\cdot \dfrac{1}{1-\beta _{1}} \sum _{k=1}^{t}\beta _{1}^{t-k} g_{k,i}^{2} \nonumber \\&\le \dfrac{\alpha }{\sqrt{\epsilon }(1-\beta _{1})^{2}} \sum _{k=1}^{T}\sum _{t=k}^{T} \dfrac{\beta _{1}^{t-k}}{\sqrt{k}} \Vert g_{k}\Vert ^{2} \nonumber \\&\le \dfrac{\alpha }{\sqrt{\epsilon }(1-\beta _{1})^{2}} \sum _{k=1}^{T} \dfrac{\beta _{1}^{-k}}{\sqrt{k}} \Vert g_{k}\Vert ^{2} \sum _{t=k}^{\infty } \beta _{1}^{t} \nonumber \\&\le \dfrac{\alpha }{\sqrt{\epsilon }(1-\beta _{1})^{3}} \sum _{k=1}^{T} \dfrac{\Vert g_{k}\Vert ^{2}}{\sqrt{k}}, \end{aligned}$$
(A13)

where we exchange the order of sum in the first equality, and the second inequality follows from the fact that \(\sqrt{k}\le \sqrt{t}\) for any \(t\ge k\). The last two inequalities hold by \(\beta _{1}\ge 0\), then \(\sum _{t=k}^{T}\beta _{1}^{t}\le \sum _{t=k}^{\infty }\beta _{1}^{t}\le \frac{\beta _{1}^{k}}{1-\beta _{1}}\). Further, applying the bounded gradient condition in Assumption 2, i.e., \(\Vert g_{t}\Vert \le G\), to obtain

$$\begin{aligned} \sum _{k=1}^{T} \dfrac{\Vert g_{k}\Vert ^{2}}{\sqrt{k}} \le G^{2}\sum _{k=1}^{T}\dfrac{1}{\sqrt{k}} \le 2G^{2}\sqrt{T}, \end{aligned}$$
(A14)

where the last inequality follows from

$$\begin{aligned} \sum _{k=2}^{T} \dfrac{1}{\sqrt{k}} \le \int _{1}^{T}\dfrac{1}{\sqrt{x}}\,dx \le 2\sqrt{T}-1. \end{aligned}$$
(A15)

Upon combining (A14) and (A13), we have

$$\begin{aligned} \sum _{t=1}^{T}\Lambda _{t} \le \dfrac{\alpha }{\sqrt{\epsilon }(1-\beta _{1})^{3}}\, 2G^{2}\sqrt{T} =\dfrac{2\alpha G^{2}\sqrt{T}}{\sqrt{\epsilon }(1-\beta _{1})^{3}}. \end{aligned}$$
(A16)

From the definition of \(\Upsilon _{t}\) in (A6), we have

$$\begin{aligned} \sum _{t=1}^{T}\Upsilon _{t}&=\sum _{t=1}^{T} \dfrac{\beta _{1t}}{2\alpha _{t}(1-\beta _{1t})} \left\Vert V_{t}^{1/4}(x_{t}-x^{*})\right\Vert ^{2} \nonumber \\&\le \dfrac{1}{2(1-\beta _{1})} \sum _{t=1}^{T}\dfrac{\beta _{1t}}{\alpha _{t}} \left\Vert V_{t}^{1/4}(x_{t}-x^{*})\right\Vert ^{2} \nonumber \\&\le \dfrac{D_{\infty }}{2(1-\beta _{1})} \sum _{t=1}^{T}\dfrac{\beta _{1t}}{\alpha _{t}} \sum _{i=1}^{d}\widetilde{v}_{t,i}^{1/2}, \end{aligned}$$
(A17)

where the first inequality holds by the condition that \(\beta _{1t}\le \beta _{1}\) and the last inequality is due to Assumption 1.

Finally, substituting the bounds of \(\Theta _{t}\) in (A10), \(\Lambda _{t}\) in (A16) and \(\Upsilon _{t}\) in (A17) to (A7), we arrive at

$$\begin{aligned} \sum _{t=1}^{T}\langle g_{t},x_{t}-x^{*}\rangle&\overset{\text {(A7)}}{\le }\sum _{t=1}^{T}\Theta _{t}+\sum _{t=1}^{T}\Lambda _{t}+\sum _{t=1}^{T}\Upsilon _{t} \nonumber \\&\;\,\le \;\, \dfrac{D_{\infty }^{2}\sqrt{T}}{2\alpha (1-\beta _{1})}\sum _{i=1}^{d}\widetilde{v}_{T,i}^{1/2}+\dfrac{2\alpha G^{2}\sqrt{T}}{\sqrt{\epsilon }(1-\beta _{1})^{3}} \nonumber \\&\qquad +\dfrac{D_{\infty }}{2(1-\beta _{1})}\sum _{t=1}^{T}\dfrac{\beta _{1t}}{\alpha _{t}}\sum _{i=1}^{d}\widetilde{v}_{t,i}^{1/2}, \end{aligned}$$
(A18)

Upon since \({{v}_{t}}=\frac{{{v}_{t-1}}+\left( 1-{{\beta }_{2}} \right) \left| g_{t}^{2}-{{v}_{t-1}} \right| }{{{\left\| {{v}_{t-1}}+\left( 1-{{\beta }_{2}} \right) \left| g_{t}^{2}-{{v}_{t-1}} \right| \right\| }_{p}}}\), we then have \(\Vert v_{t}\Vert =1\). By the iterate of \(\hat{v}_{t}\), i.e., \(\hat{v}_{t}=\max \{\hat{v}_{t-1},v_{t}\}\) and \(\hat{v}_{0}=0\), we can obtain \(\Vert \hat{v}_{t}\Vert =1\). By the definition of \(\widetilde{v}_{t}\), we have

$$\begin{aligned} \sum _{i=1}^{d} \widetilde{v}_{t,i}^{1/2}&=\sum _{i=1}^{d}\sqrt{\hat{v}_{t,i}+\epsilon } \le \sum _{i=1}^{d}\sqrt{\hat{v}_{t,i}} +\sum _{i=1}^{d}\sqrt{\epsilon } \nonumber \\&\le \sum _{i=1}^{d}\sqrt{\Vert \hat{v}_{t}\Vert } +d\sqrt{\epsilon } =(1+\sqrt{\epsilon })d. \end{aligned}$$
(A19)

Combining (A18) and (A19) gives

$$\begin{aligned} \sum _{t=1}^{T}\langle g_{t},x_{t}-x^{*}\rangle&\le \dfrac{D_{\infty }^{2}(1+\sqrt{\epsilon })d\sqrt{T}}{2\alpha (1-\beta _{1})}+\dfrac{2\alpha G^{2}\sqrt{T}}{\sqrt{\epsilon }(1-\beta _{1})^{3}} \nonumber \\&+\dfrac{D_{\infty }(1+\sqrt{\epsilon })d}{2(1-\beta _{1})}\sum _{t=1}^{T}\dfrac{\beta _{1t}}{\alpha _{t}} \nonumber \\&=\left( \dfrac{D_{\infty }^{2}(1+\sqrt{\epsilon })d}{2\alpha (1-\beta _{1})}+\dfrac{2\alpha G^{2}}{\sqrt{\epsilon }(1-\beta _{1})^{3}}\right) \sqrt{T} \nonumber \\&+\dfrac{D_{\infty }(1+\sqrt{\epsilon })d}{2(1-\beta _{1})}\sum _{t=1}^{T}\dfrac{\beta _{1t}}{\alpha _{t}}. \end{aligned}$$
(A20)

Applying the convexity of the function \(f_{t}\), we know that

$$\begin{aligned}&\sum _{t=1}^{T}\big (f_{t}(x_{t})-f_{t}(x^{*})\big )\le \sum _{t=1}^{T}\langle g_{t},x_{t}-x^{*}\rangle \nonumber \\&\le \left( \dfrac{D_{\infty }^{2}(1+\sqrt{\epsilon })d}{2\alpha (1-\beta _{1})}\!+\!\dfrac{2\alpha G^{2}}{\sqrt{\epsilon }(1-\beta _{1})^{3}}\right) \sqrt{T} \nonumber \\&+\dfrac{D_{\infty }(1+\sqrt{\epsilon })d}{2(1-\beta _{1})}\sum _{t=1}^{T}\dfrac{\beta _{1t}}{\alpha _{t}}. \end{aligned}$$
(A21)

This completes the proof. \(\square \)

Appendix B   Proof of corollary 1

Proof of Corollary 1

By the conditions that \({{\beta }_{1t}}={{\beta }_{1}}{{\lambda }^{t}}\) and \(0<\lambda <1\), we have

$$\begin{aligned} \begin{aligned} \sum \limits _{t=1}^{T}{\frac{{{\beta }_{1t}}}{{{\alpha }_{t}}}}&=\sum \limits _{t=1}^{T}{{{\beta }_{1}}{{\lambda }^{t}}\frac{\sqrt{t}}{\alpha }}\le \frac{{{\beta }_{1}}}{\alpha }\sum \limits _{t=1}^{T}{{{\lambda }^{t-1}}\sqrt{t}}\le \frac{{{\beta }_{1}}}{\alpha }\sum \limits _{t=1}^{T}{{{\lambda }^{t-1}}t} \\&=\frac{{{\beta }_{1}}}{\alpha }\left( \frac{1-{{\lambda }^{T}}}{{{\left( 1-\lambda \right) }^{2}}}-\frac{T{{\lambda }^{T}}}{1-\lambda } \right) \le \frac{{{\beta }_{1}}}{\alpha {{\left( 1-\lambda \right) }^{2}}}. \\ \end{aligned} \end{aligned}$$
(B1)

From Theorem 1, it is clear that

$$\begin{aligned} \begin{aligned}&\;\;\sum \limits _{t=1}^{T}{\big ( {{f}_{t}}\left( {{x}_{t}} \right) -{{f}_{t}}\left( {{x}^{*}} \right) \big )}\\&\;\le \; \left( \dfrac{D_{\infty }^{2}(1+\sqrt{\epsilon })d}{2\alpha (1-\beta _{1})}+\dfrac{2\alpha G^{2}}{\sqrt{\epsilon }(1-\beta _{1})^{3}}\right) \sqrt{T} \\&\quad +\dfrac{D_{\infty }(1+\sqrt{\epsilon })d}{2(1-\beta _{1})}\sum _{t=1}^{T}\dfrac{\beta _{1t}}{\alpha _{t}}\\&\overset{\text {(B1)}}{\le }\left( \dfrac{D_{\infty }^{2}(1+\sqrt{\epsilon })d}{2\alpha (1-\beta _{1})}+\dfrac{2\alpha G^{2}}{\sqrt{\epsilon }(1-\beta _{1})^{3}}\right) \sqrt{T} \\&\quad +\dfrac{D_{\infty }(1+\sqrt{\epsilon })d}{2(1-\beta _{1})}\frac{{{\beta }_{1}}}{\alpha {{\left( 1-\lambda \right) }^{2}}}.\\ \end{aligned} \end{aligned}$$
(B2)

This completes the proof. \(\square \)

Appendix C   Proof of corollary 2

Proof of Corollary 2

Since \({{\beta }_{1t}}={{\beta }_{1}}/t\) and \(\alpha _{t}=\alpha /\sqrt{t}\), we then have

$$\begin{aligned} \begin{aligned} \sum \limits _{t=1}^{T} {\frac{{{\beta }_{1t}}}{{{\alpha }_{t}}}} =\frac{{{\beta }_{1}}}{\alpha } \sum \limits _{t=1}^{T}{\frac{\sqrt{t}}{t}} =\frac{{{\beta }_{1}}}{\alpha } \sum \limits _{t=1}^{T}{\frac{1}{\sqrt{t}}} \le \frac{2{{\beta }_{1}}\sqrt{T}}{\alpha }, \end{aligned} \end{aligned}$$
(C1)

where the last inequality is due to

$$\begin{aligned} \sum \limits _{t=1}^{T}{\frac{1}{\sqrt{t}}} =1+\sum \limits _{t=2}^{T}{\frac{1}{\sqrt{t}}}\le 1+\int _{1}^{T}{\frac{1}{\sqrt{t}}dt} \le 2\sqrt{T}. \end{aligned}$$
(C2)

From Theorem 1, it is clear that

$$\begin{aligned} \begin{aligned}&\;\;\sum \limits _{t=1}^{T}{\left( {{f}_{t}}\left( {{x}_{t}} \right) -{{f}_{t}}\left( {{x}^{*}} \right) \right) }\\&\;\le \; \left( \dfrac{D_{\infty }^{2}(1+\sqrt{\epsilon })d}{2\alpha (1-\beta _{1})}+\dfrac{2\alpha G^{2}}{\sqrt{\epsilon }(1-\beta _{1})^{3}}\right) \sqrt{T} \\&\quad +\dfrac{D_{\infty }(1+\sqrt{\epsilon })d}{2(1-\beta _{1})}\sum _{t=1}^{T}\dfrac{\beta _{1t}}{\alpha _{t}}\\&\overset{\text {(C1)}}{\le } \left( \dfrac{D_{\infty }^{2}(1+\sqrt{\epsilon })d}{2\alpha (1-\beta _{1})}+\dfrac{2\alpha G^{2}}{\sqrt{\epsilon }(1-\beta _{1})^{3}}\right) \sqrt{T} \\&\quad +\dfrac{D_{\infty }(1+\sqrt{\epsilon })d}{2(1-\beta _{1})}\frac{2{{\beta }_{1}}\sqrt{T}}{\alpha }\\&\;=\;\left( \dfrac{D_{\infty }(1+\sqrt{\epsilon })d(D_{\infty }+2\beta _{1})}{2\alpha (1-\beta _{1})}+\dfrac{2\alpha G^{2}}{\sqrt{\epsilon }(1-\beta _{1})^{3}}\right) \sqrt{T}, \\ \end{aligned} \end{aligned}$$
(C3)

This completes the proof. \(\square \)

Appendix D   Algorithms

This section provides the Algorithms for different optimization techniques, including AdaGrad, AdaGradN, AdaBound, ABNBound, AdamW, ABNAdamW, RAdam and ABNRAdam.

Algorithm 3
figure d

AdaGrad

Algorithm 4
figure e

AdaGradN

Algorithm 5
figure f

AdaBound

Algorithm 6
figure g

ABNBound

Algorithm 7
figure h

AdamW

Algorithm 8
figure i

ABNAdamW

Algorithm 9
figure j

RAdam

Algorithm 10
figure k

ABNRAdam

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jiang, W., Liang, Y., Jiang, Z. et al. ABNGrad: adaptive step size gradient descent for optimizing neural networks. Appl Intell 54, 2361–2378 (2024). https://doi.org/10.1007/s10489-024-05303-6

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-024-05303-6

Keywords

Navigation