ABNGrad: adaptive step size gradient descent for optimizing neural networks

Jiang, Wenhan; Liang, Yuqing; Jiang, Zhixia; Xu, Dongpo; Zhou, Linhua

doi:10.1007/s10489-024-05303-6

ABNGrad: adaptive step size gradient descent for optimizing neural networks

Published: 16 February 2024

Volume 54, pages 2361–2378, (2024)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Wenhan Jiang¹^na1,
Yuqing Liang²^na1,
Zhixia Jiang ORCID: orcid.org/0000-0002-6773-1444¹,
Dongpo Xu² &
…
Linhua Zhou¹

264 Accesses
Explore all metrics

Abstract

Stochastic adaptive gradient decent algorithms, such as AdaGrad and Adam, are extensively used to train deep neural networks. However, randomly sampling gradient information introduces instability to the learning rates, leading to adaptive methods with poor generalization. To address this issue, the ABNGrad algorithm, which leverages the absolute value operation and the normalization technique, is proposed. More specifically, the absolute value function is first incorporated into the iteration of the second-order moment estimate to ensure that it monotonically increases. Then, the normalization technique is employed to prevent a rapid decrease in the learning rate. In particular, the techniques used in this paper can also be integrated into other existing adaptive algorithms, such as Adam, AdamW, AdaBound, and RAdam, yielding good performance. Additionally, it is shown that ABNGrad can attain the optimal regret bound for solving online convex optimization problems. Finally, many experimental results illustrate the effectiveness of ABNGrad. For a comprehensive exploration of the advantages of the proposed approach and the specifics of its detailed implementation, the readers are referred to the following https://github.com/Wenhan-Jiang/ABNGrad.git

Graphical abstract

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Algorithm 1

Algorithm 2

Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions

Article 18 August 2021

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

Article Open access 31 March 2021

Development and Application of Artificial Neural Network

Article 30 December 2017

Data Availability

The data that support the findings of this study are openly available in CIFAR-10, at https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz. The data that support the findings of this study are openly available in CIFAR-100 at https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz. The data that support the findings of this study are openly available in CINIC-10 at https://paperswithcode.com/sota/image-classification-on-cinic-10. The authors confirm that the PTB dataset is available in the article The Penn Treebank: An Overview. The data that support the findings of this study are openly available in VOC at http://host.robots.ox.ac.uk/pascal/VOC/.

Notes

https://github.com/lanpa/tensorboardX.git
https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz
https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
https://paperswithcode.com/sota/image-classification-on-cinic-10
The authors confirm that the PTB dataset is available in the article The Penn Treebank: An Overview
https://github.com/ultralytics/yolov5.git
http://host.robots.ox.ac.uk/pascal/VOC/

References

Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Human language technologies, (NAACL), Minneapolis, Minnesota, June, vol 1, pp 4171–4186. https://doi.org/10.18653/v1/N19-1423. https://aclanthology.org/N19-1423
Dai Z, Yang Z, Yang Y, Carbonell J, Le Q, Salakhutdinov R (2019) Transformer-XL: attentive language models beyond a fixed-length context, 2978–2988. https://doi.org/10.18653/v1/P19-1285
Zhang T, Chen S, Wulamu A, Guo X, Li Q, Zheng H (2023) Transg-net: transformer and graph neural network based multi-modal data fusion network for molecular properties prediction, 16077–16088. https://doi.org/10.1007/s10489-022-04351-0
Kononov E, Tashkinov M, Silberschmidt VV (2023) Reconstruction of 3d random media from 2d images: generative adversarial learning approach. Comput Aided Des 158:103498. https://doi.org/10.1016/j.cad.2023.103498
Article MathSciNet Google Scholar
Mathis A, Mamidanna P, Cury KM et al (2018) Deeplabcut: markerless pose estimation of user-defined body parts with deep learning. Nat Neurosci 21(9):1281–1289. https://doi.org/10.1038/s41593-018-0209-y
Article CAS PubMed Google Scholar
Huang B, Zhang S, Huang J, Yu Y, Shi Z, Xiong Y (2022) Knowledge distilled pre-training model for vision-language-navigation. Appl Intell 53:5607–5619. https://doi.org/10.1007/s10489-022-03779-8
Article Google Scholar
Kumar A, Aggarwal RK (2022) An exploration of semi-supervised and language-adversarial transfer learning using hybrid acoustic model for hindi speech recognition. J Reliab Intell Environ 8:117–132. https://doi.org/10.1007/s40860-021-00140-7
Article Google Scholar
Hu L, Fu C, Ren Zea (2023) Sselm-neg: spherical search-based extreme learning machine for drug-target interaction prediction. BMC Bioinformatics 24(38):1471–2105. https://doi.org/10.1186/s12859-023-05153-y
Xu Y, Verma D, Sheridan RP et al (2020) Deep dive into machine learning models for protein engineering. J Chem Inf Model 60(6):2773–2790. https://doi.org/10.1021/acs.jcim.0c00073
Article CAS PubMed Google Scholar
Waring J, Lindvall C, Umeton R (2020) Automated machine learning: review of the state-of-the-art and opportunities for healthcare. Artif Intell Med 104:101822. https://doi.org/10.1016/j.artmed.2020.101822
Article PubMed Google Scholar
Wu J, Chen X-Y, Zhang H, Xiong L-D, Lei H, Deng S-H (2019) Hyperparameter optimization for machine learning models based on bayesian optimizationb. Journal of Electronic Science and Technology 17(1):26–40. https://doi.org/10.11989/JEST.1674-862X.80904120
Article Google Scholar
Abbaszadeh Shahri A, Pashamohammadi F, Asheghi R, Abbaszadeh Shahri H (2022) Automated intelligent hybrid computing schemes to predict blasting induced ground vibration. Engineering with Computers 38(4):3335–3349. https://doi.org/10.1007/s00366-021-01444-1
Article Google Scholar
Yuan W, Hu F, Lu L (2022) A new non-adaptive optimization method: stochastic gradient descent with momentum and difference. Appl Intell 52:3939–3953. https://doi.org/10.1007/s10489-021-02224-6
Article Google Scholar
Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization, vol 12, pp 2121–2159. https://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf
Yedida R, Aha S, Prashanth T (2021) Lipschitzlr: using theoretically computed adaptive learning rates for fast convergence. Appl Intell 51:1460–1478. https://doi.org/10.1007/s10489-020-01892-0
Article Google Scholar
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: 3rd International conference on learning representations, ICLR, San Diego, CA, USA, May, San Diego, CA, USA. http://arxiv.org/abs/1412.6980
Reddi SJ, Kale S, Kumar S (2018) On the convergence of adam and beyond. In: 6th International conference on learning representations, ICLR, Vancouver, BC, Canada, April, Vancouver, BC, Canada. https://openreview.net/forum?id=ryQu7f-RZ
Luo L, Xiong Y, Liu Y, Sun X (2019) Adaptive gradient methods with dynamic bound of learning rate. In: 7th International conference on learning representations, ICLR, New Orleans, LA, USA, May 6-9, New Orleans, LA, USA. https://openreview.net/forum?id=Bkg3g2R9FX
Loshchilov I, Hutter F (2019) Decoupled weight decay regularization. In: 7th International conference on learning representations, ICLR, New Orleans, LA, USA, May 6-9. https://openreview.net/forum?id=Bkg6RiCqY7
Liu L, Jiang H, He P, Chen W, Liu X, Gao J, Han J (2020) On the variance of the adaptive learning rate and beyond. In: International conference on learning representations, Ethiopia, July. https://openreview.net/forum?id=rkgz2aEKDr
Zinkevich M (2003) Online convex programming and generalized infinitesimal gradient ascent. In: Proceedings of the 20th international conference on machine learning, ICML, Washington, DC, USA, August 21-24, pp 928–936. https://icml.cc/Conferences/2010/papers/473.pdf
Hazan E, Agarwal A, Kale S (2007) Logarithmic regret algorithms for online convex optimization. Mach Learn 69:169–192. https://doi.org/10.1007/s10994-007-5016-8
Article Google Scholar
Zeng K, Liu J, Jiang Z, Xu D (2022) A decreasing scaling transition scheme from adam to sgd. Adv Theory Simul 5(7). https://doi.org/10.1002/adts.202100599
Jalaian B, Lee M, Russell S (2019) Uncertain context: uncertainty quantification in machine learning. AI Mag 40(4):40–49. https://doi.org/10.1609/aimag.v40i4.4812
Article Google Scholar
Wu X, Wagner P, Huber MF (2023) In: Shajek A, Hartmann EA (eds) Quantification of uncertainties in neural networks. Springer, Cham, pp 276–287. https://doi.org/10.1007/978-3-031-26490-0_16
Zhuang J, Tang T, Ding Y, Tatikonda SC, Dvornek N, Papademetris X, Duncan J (2020) Adabelief optimizer: adapting stepsizes by the belief in observed gradients. In: Advances in neural information processing systems, vol 33, pp 18795–18806. https://proceedings.neurips.cc/paper_files/paper/2020/file/d9d4f495e875a2e075a1a4a6e1b9770f-Paper.pdf
Koçak H (2021) A combined meshfree exponential Rosenbrock integrator for the third-order dispersive partial differential equations. Numer Methods Partial Differ Equ 37(3):2458–2468. https://doi.org/10.1002/num.22726
Article MathSciNet Google Scholar
Oza U, Patel S, Kumar P (2021) Noveme - color space net for image classification. Intell Inf Database Syst 12672:531–543. https://doi.org/10.1007/978-3-030-73280-6_42
Article Google Scholar
Branco A, Carvalheiro C, Costa F, Castro S, Silva J, Martins C, Ramos J (2014) Deepbankpt and companion portuguese treebanks in a multilingual collection of treebanks aligned with the penn treebank. Computational Processing of the Portuguese Language 207–213. https://doi.org/10.1007/978-3-319-09761-9_23
Ma X, Tao Z, Wang Y, Yu H, Wang Y (2015) Long short-term memory neural network for traffic speed prediction using remote microwave sensor data. Transportation Research Part C: Emerging Technologies 54:187–197. https://doi.org/10.1016/j.trc.2015.03.014
Article Google Scholar
McMahan HB, Streeter MJ (2010) Adaptive bound optimization for online convex optimization, pp 224–256. https://www.learningtheory.org/colt2010/conference-website/papers/104mcmahan.pdf

Download references

Acknowledgements

This work was funded in part by the Natural Science Foundation of Jilin Province (No.YDZJ202201ZYTS519, No.YDZJ202201ZYTS585), and in part by the National Natural Science Foundation of China (No.62176051).

Author information

Wenhan Jiang and Yuqing Liang contributed equally to this work.

Authors and Affiliations

School of Mathematics and Statistics, Changchun University of Science and Technology, Changchun, 130022, China
Wenhan Jiang, Zhixia Jiang & Linhua Zhou
School of Mathematics and Statistics, Northeast Normal University, Changchun, 130024, China
Yuqing Liang & Dongpo Xu

Authors

Wenhan Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Yuqing Liang
View author publications
You can also search for this author in PubMed Google Scholar
Zhixia Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Dongpo Xu
View author publications
You can also search for this author in PubMed Google Scholar
Linhua Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Wenhan Jiang:writing - original draft preparation, conceptualization, methodology, investigation. Yuqing Liang: writing - original draft preparation, writing - review and editing. Zhixia Jiang: writing - review and editing, supervision. Dongpo Xu: writing - review and editing, Conceptualization. Linhua Zhou: Conceptualization, Validation.

Corresponding authors

Correspondence to Zhixia Jiang or Dongpo Xu.

Ethics declarations

Conflicts of interest

The authors declare that they have no confict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Proof of theorem 1

Lemma 1

(Lemma 3 [31]) For any $Q \in S_{d}^{+}$ and convex feasible set $\mathcal {F} \subset \mathbb {R}^{d}$, suppose ${{u}_{1}}=\arg \min _{x\in \mathcal {F}}\,\left\| {{Q}^{1/2}}\left( x-{{z}_{1}} \right) \right\| $ and ${{u}_{2}}=\arg \min _{x\in \mathcal {F}}\,\left\| {{Q}^{1/2}}\left( x-{{z}_{2}} \right) \right\| $ where ${{z}_{1}},{{z}_{2}}\in {{\mathbb {R}}^{d}}$, then we have ${{\left\| {{Q}^{1/2}}\left( {{u}_{1}}-{{u}_{2}} \right) \right\| }^{2}}\le {{\left\| {{Q}^{1/2}}\left( {{z}_{1}}-{{z}_{2}} \right) \right\| }^{2}}$.

Proof of Theorem 1

By the definition of the projection operation $\prod _{\mathcal {F},A}$ in Section 2, we have

$$\begin{aligned} {{x}_{t+1}}&=\prod \nolimits _{\mathcal {F},\sqrt{{{V}_{t}}}}{\left( {{x}_{t}}-{{\alpha }_{t}}V_{t}^{-1/2}{{m}_{t}} \right) } \nonumber \\&=\arg \underset{x\in \mathcal {F}}{\mathop {\min }}\,\left\| V_{t}^{1/4}\left[ x-\left( {{x}_{t}}-{{\alpha }_{t}}V_{t}^{-1/2}{{m}_{t}} \right) \right] \right\| . \end{aligned}$$

(A1)

Since ${{x}^{*}}\in \mathcal {F}$, we then know that $\prod \nolimits _{\mathcal {F},\sqrt{{{V}_{t}}}}{\left( {{x}^{*}} \right) }={{x}^{*}}$. Thus, using Lemma 1 with ${{u}_{1}}={{x}_{t+1}}$, ${{u}_{2}}={{x}^{*}}$ and $Q=\sqrt{{{V}_{t}}}$, we arrive at

$$\begin{aligned} \begin{aligned} {{\left\| V_{t}^{1/4}\left( {{x}_{t+1}}\!-\!{{x}^{*}} \right) \right\| }^{2}}&\!\le {{\left\| V_{t}^{1/4}\left( {{x}_{t}}-{{\alpha }_{t}}V_{t}^{-1/2}{{m}_{t}}-{{x}^{*}} \right) \right\| }^{2}} \\&\!={{\left\| V_{t}^{1/4}\left( {{x}_{t}}\!-\!{{x}^{*}} \right) \right\| }^{2}} \!+\alpha _{t}^{2}{{\left\| V_{t}^{-1/4}{{m}_{t}} \right\| }^{2}} \\&\quad -2{{\alpha }_{t}}\left\langle {{m}_{t}},{{x}_{t}}-{{x}^{*}} \right\rangle \\&\!={{\left\| V_{t}^{1/4}\left( {{x}_{t}}\!-\!{{x}^{*}} \right) \right\| }^{2}}\!+\alpha _{t}^{2}{{\left\| V_{t}^{-1/4}{{m}_{t}} \right\| }^{2}} \\&\quad -2\alpha _{t}\beta _{1t} \langle m_{t-1},x_{t}-x^{*}\rangle -2\alpha _{t} \\&\quad (1-\beta _{1t})\langle g_{t},x_{t}-x^{*}\rangle ,\\ \end{aligned} \end{aligned}$$

(A2)

where the last equality follows from the iterate of $m_{t}$ in Algorithm 2, i.e, $m_{t}=\beta _{1t}m_{t-1}+(1-\beta _{1t})g_{t}$. Upon rearranging (A2), we have

$$\begin{aligned} 2\alpha _{t}(1-\beta _{1t}) \langle g_{t},x_{t}-x^{*}\rangle&\le {{\left\| V_{t}^{1/4}\left( {{x}_{t}}-{{x}^{*}} \right) \right\| }^{2}} -{{\left\| V_{t}^{1/4}\left( {{x}_{t+1}}-{{x}^{*}} \right) \right\| }^{2}} \nonumber \\&\quad +\alpha _{t}^{2}{{\left\| V_{t}^{-1/4}{{m}_{t}} \right\| }^{2}}\! -2\alpha _{t}\beta _{1t} \langle m_{t-1},x_{t}\!-\!x^{*}\rangle . \end{aligned}$$

(A3)

Since $0\le \beta _{1t}<1$, then dividing both sides of (A3) by $2\alpha _{t}(1-\beta _{1t})$ to get

$$\begin{aligned} \left\langle {{g}_{t}},{{x}_{t}}-{{x}^{*}} \right\rangle&\le \frac{{{\alpha }_{t}}}{2\left( 1-{{\beta }_{1t}} \right) }{{\left\| V_{t}^{-1/4}{{m}_{t}} \right\| }^{2}}-\frac{{{\beta }_{1t}}}{1-{{\beta }_{1t}}}\left\langle {{m}_{t-1}},{{x}_{t}}-{{x}^{*}} \right\rangle \nonumber \\&\quad + \underbrace{\frac{1}{2{{\alpha }_{t}}\left( 1-{{\beta }_{1t}} \right) }\left( {{\left\| V_{t}^{1/4}\left( {{x}_{t}}\!-\!{{x}^{*}} \right) \right\| }^{2}}\!\!-\!{{\left\| V_{t}^{1/4}\left( {{x}_{t+1}}\!-\!{{x}^{*}} \right) \right\| }^{2}} \right) }_{\Theta _{t}}, \end{aligned}$$

(A4)

where we introduce $\Theta _{t}$ to make our proof clear. Upon applying the inequality that $\langle a,b\rangle \le \frac{\epsilon }{2}\Vert a\Vert ^{2}+\frac{1}{2\epsilon }\Vert b\Vert ^{2}$ with $a=-V_{t}^{-1/4}m_{t-1}$, $b=V_{t}^{1/4}(x_{t}-x^{*})$ and $\epsilon =\alpha _{t}$, the inner product in (A4) becomes

$$\begin{aligned}&-\langle m_{t-1},x_{t}-x^{*}\rangle \le \dfrac{\alpha _{t}}{2} \left\Vert V_{t}^{-1/4}m_{t-1}\right\Vert ^{2} \nonumber \\&+\dfrac{1}{2\alpha _{t}} \left\Vert V_{t}^{1/4}(x_{t}-x^{*})\right\Vert ^{2}. \end{aligned}$$

(A5)

Now, substituting (A5) in (A4), we have

$$\begin{aligned} \langle g_{t},x_{t}-x^{*}\rangle&\le \dfrac{\alpha _{t}}{2(1-\beta _{1t})} \left\Vert V_{t}^{-1/4}m_{t}\right\Vert ^{2} +\Theta _{t} \nonumber \\&\quad +\dfrac{\beta _{1t}}{1-\beta _{1t}} \left( \dfrac{\alpha _{t}}{2} \left\Vert V_{t}^{-1/4}m_{t-1}\right\Vert ^{2} +\dfrac{1}{2\alpha _{t}} \left\Vert V_{t}^{1/4}(x_{t}-x^{*})\right\Vert ^{2} \right) \nonumber \\&= \underbrace{ \dfrac{\beta _{1t}}{2\alpha _{t}(1-\beta _{1t})} \left\Vert V_{t}^{1/4}(x_{t}-x^{*})\right\Vert ^{2}}_{\Upsilon _{t}} +\Theta _{t} \nonumber \\&\quad +\underbrace{\dfrac{\alpha _{t}}{2(1-\beta _{1t})} \left\Vert V_{t}^{-1/4}m_{t}\right\Vert ^{2} +\dfrac{\alpha _{t}\beta _{1t}}{2(1-\beta _{1t})} \left\Vert V_{t}^{-1/4}m_{t-1}\right\Vert ^{2}}_{\Lambda _{t}}. \end{aligned}$$

(A6)

Summing over t from 1 to T, we obtain

$$\begin{aligned} \sum _{t=1}^{T} \langle g_{t},x_{t}-x^{*}\rangle \le \sum _{t=1}^{T}\Theta _{t} +\sum _{t=1}^{T}\Lambda _{t} +\sum _{t=1}^{T}\Upsilon _{t}, \end{aligned}$$

(A7)

where $\Theta _{t}$, $\Lambda _{t}$ and $\Upsilon _{t}$ are defined in (A4) and (A6), respectively.

Next, we estimate terms $\Theta _{t}$, $\Lambda _{t}$ and $\Upsilon _{t}$ in (A7) as follows. For simplicity, we set $\widetilde{v}_{t}=\hat{v}_{t}+\epsilon $, then $V_{t}=\text {diag}(\hat{v}_{t}+\epsilon )=\text {diag}(\widetilde{v}_{t})$ in Algorithm 2, thus

$$\begin{aligned} \sum _{t=1}^{T}\Theta _{t}&\overset{\text {(A4)}}{=} \sum _{t=1}^{T} \dfrac{1}{2\alpha _{t}(1-\beta _{1t})} \left( \left\Vert V_{t}^{1/4}(x_{t}-x^{*})\right\Vert ^{2}-\left\Vert V_{t}^{1/4}(x_{t+1}-x^{*})\right\Vert ^{2}\right) \nonumber \\&\,=\,\dfrac{1}{2} \sum _{t=1}^{T} \sum _{i=1}^{d} \dfrac{1}{\alpha _{t}(1-\beta _{1t})} \left( \widetilde{v}_{t,i}^{1/2}(x_{t,i}-x_{i}^{*})^{2}- \widetilde{v}_{t,i}^{1/2}(x_{t+1,i}-x_{i}^{*})^{2}\right) \nonumber \\&\,=\,\dfrac{1}{2}\sum _{t=1}^{T}\sum _{i=1}^{d}\dfrac{\widetilde{v}_{t,i}^{1/2}}{\alpha _{t}(1-\beta _{1t})} \left( (x_{t,i}-x_{i}^{*})^{2}- (x_{t+1,i}-x_{i}^{*})^{2}\right) \nonumber \\&\,=\,\dfrac{1}{2\alpha _{1}(1-\beta _{11})}\sum _{i=1}^{d}\widetilde{v}_{1,i}^{1/2}(x_{1,i}-x_{i}^{*})^{2}-\dfrac{1}{2\alpha _{T}(1-\beta _{1T})} \nonumber \\&\,\quad \sum _{i=1}^{d}\widetilde{v}_{T,i}^{1/2} (x_{T+1,i}-x_{i}^{*})^{2} \nonumber \\&\,\quad +\dfrac{1}{2}\sum _{t=2}^{T}\sum _{i=1}^{d}(x_{t,i}-x_{i}^{*})^{2}\left( \dfrac{\widetilde{v}_{t,i}^{1/2}}{\alpha _{t}(1-\beta _{1t})}-\dfrac{\widetilde{v}_{t-1,i}^{1/2}}{\alpha _{t-1}(1-\beta _{1(t-1)})}\right) \nonumber \\&\,\le \, \dfrac{D_{\infty }^{2}}{2\alpha _{1}(1-\beta _{1})}\sum _{i=1}^{d}\widetilde{v}_{1,i}^{1/2}\!+\!\dfrac{1}{2}\sum _{t=2}^{T}\sum _{i=1}^{d}\dfrac{(x_{t,i}\!-\!x_{i}^{*})^{2}}{1-\beta _{1t}} \!\left( \dfrac{\widetilde{v}_{t,i}^{1/2}}{\alpha _{t}}\!-\!\dfrac{\widetilde{v}_{t-1,i}^{1/2}}{\alpha _{t-1}}\!\right) , \end{aligned}$$

(A8)

where the first inequality holds by $\beta _{11}=\beta _{1}$ and the condition that $\beta _{1t}\le \beta _{1(t-1)}$, i.e., $-\frac{1}{1-\beta _{1(t-1)}}\le -\frac{1}{1-\beta _{1t}}$, and the last inequality follows from Assumption 1 that $(x_{1,i}-x^{*}_{i})^{2}\le \Vert x_{1}-x^{*}_{i}\Vert _{\infty }^{2}\le D_{\infty }^{2}$. Upon it follows from the update of $\hat{v}_{t}$ in Algorithm 2, i.e., $\hat{v}_{t}=\max \{\hat{v}_{t-1},v_{t}\}$, that $\hat{v}_{t}\ge \hat{v}_{t-1}$. Then, by the fact that $\widetilde{v}_{t}=\hat{v}_{t}+\epsilon $, we can get that $\widetilde{v}_{t}\ge \widetilde{v}_{t-1}$. Thus, from the stepsize condition $\alpha _{t}=\frac{\alpha }{\sqrt{t}}$, we have

$$\begin{aligned} \dfrac{\widetilde{v}_{t,i}^{1/2}}{\alpha _{t}} -\dfrac{\widetilde{v}_{t-1,i}^{1/2}}{\alpha _{t-1}} =\dfrac{\sqrt{t}\,\widetilde{v}_{t,i}^{1/2}}{\alpha } -\dfrac{\sqrt{t-1}\,\widetilde{v}_{t-1,i}^{1/2}}{\alpha } \ge 0. \end{aligned}$$

(A9)

Thus, using the condition that $\beta _{1t}\le \beta _{1}$, we can further estimate (A8) as follows.

$$\begin{aligned} \sum _{t=1}^{T}\Theta _{t}&\le \dfrac{D_{\infty }^{2}}{2\alpha _{1}(1-\beta _{1})}\sum _{i=1}^{d}\widetilde{v}_{1,i}^{1/2}+\dfrac{1}{2(1-\beta _{1})}\sum _{t=2}^{T}\sum _{i=1}^{d}(x_{t,i}-x_{i}^{*})^{2}\nonumber \\&\quad \times \left( \dfrac{\widetilde{v}_{t,i}^{1/2}}{\alpha _{t}}-\dfrac{\widetilde{v}_{t-1,i}^{1/2}}{\alpha _{t-1}}\right) \le \dfrac{D_{\infty }^{2}}{2\alpha _{1}(1-\beta _{1})}\sum _{i=1}^{d}\widetilde{v}_{1,i}^{1/2}+\dfrac{D_{\infty }^{2}}{2(1-\beta _{1})}\nonumber \\&\quad \times \sum _{t=2}^{T}\sum _{i=1}^{d}\left( \dfrac{\widetilde{v}_{t,i}^{1/2}}{\alpha _{t}}-\dfrac{\widetilde{v}_{t-1,i}^{1/2}}{\alpha _{t-1}}\right) =\dfrac{D_{\infty }^{2}}{2(1-\beta _{1})}\sum _{i=1}^{d} \dfrac{\widetilde{v}_{T,i}^{1/2}}{\alpha _{T}}\nonumber \\&=\dfrac{D_{\infty }^{2}\sqrt{T}}{2\alpha (1-\beta _{1})}\quad \times \sum _{i=1}^{d}\widetilde{v}_{T,i}^{1/2}, \end{aligned}$$

(A10)

where the second inequality is due to $(x_{t,i}-x^{*}_{i})^{2}\le \Vert x_{t}-x^{*}\Vert _{\infty }^{2}\le D_{\infty }^{2}$ assumed in Assumption 1, and the last equality holds by the stepsize condition that $\alpha _{t}=\frac{\alpha }{\sqrt{t}}$.

By the definition of $\Lambda _{t}$ in (A6), we have

$$\begin{aligned} \sum _{t=1}^{T}\Lambda _{t}&=\sum _{t=1}^{T}\left( \dfrac{\alpha _{t}}{2(1-\beta _{1t})}\left\Vert V_{t}^{-1/4}m_{t}\right\Vert ^{2}+\dfrac{\alpha _{t}\beta _{1t}}{2(1-\beta _{1t})}\left\Vert V_{t}^{-1/4}m_{t-1}\right\Vert ^{2}\right) \nonumber \\&\le \dfrac{1}{2(1-\beta _{1})}\sum _{t=1}^{T}\alpha _{t}\left\Vert V_{t}^{-1/4}m_{t}\right\Vert ^{2}+\dfrac{1}{2(1-\beta _{1})}\sum _{t=1}^{T}\alpha _{t}\left\Vert V_{t}^{-1/4}m_{t-1}\right\Vert ^{2} \nonumber \\&=\dfrac{1}{2(1-\beta _{1})}\sum _{t=1}^{T}\alpha _{t}\sum _{i=1}^{d}\widetilde{v}_{t,i}^{-1/2}\big (m_{t,i}^{2}+m_{t-1,i}^{2}\big ) \nonumber \\&\le \dfrac{1}{2\sqrt{\epsilon }(1-\beta _{1})}\sum _{t=1}^{T}\alpha _{t}\sum _{i=1}^{d}\big (m_{t,i}^{2}+m_{t-1,i}^{2}\big ) \nonumber \\&\le \dfrac{1}{2\sqrt{\epsilon }(1-\beta _{1})}\sum _{t=1}^{T}\sum _{i=1}^{d}\big (\alpha _{t}m_{t,i}^{2}+\alpha _{t-1}m_{t-1,i}^{2}\big ) \nonumber \\&\le \dfrac{1}{\sqrt{\epsilon }(1-\beta _{1})}\sum _{t=1}^{T}\sum _{i=1}^{d}\alpha _{t}m_{t,i}^{2}, \end{aligned}$$

(A11)

where the first inequality holds by the conditions that $\beta _{1t}<1$ and $\beta _{1t}\le \beta _{1}$, the second inequality follows from $\widetilde{v}_{t,i}=\hat{v}_{t,i}+\epsilon \ge \epsilon $, i.e., $\widetilde{v}_{t,i}^{-1/2}\le \frac{1}{\sqrt{\epsilon }}$ for all $t\in \left[ T \right] $. The third inequality is due to $\alpha _{t}=\frac{\alpha }{\sqrt{t}}$, thus $\alpha _{t}\le \alpha _{t-1}$, and the last inequality holds because $m_{0}=0$, i.e., $\sum _{t=1}^{T}\alpha _{t-1}m_{t-1,i}^{2}=\sum _{t=1}^{T-1}\alpha _{t}m_{t,i}^{2}\le \sum _{t=1}^{T}\alpha _{t}m_{t,i}^{2}$. Upon from the iterate of $m_{t}$ in Algorithm 2, we arrive at

$$\begin{aligned} m_{t,i}^{2}&=\left( \sum _{k=1}^{t} (1-\beta _{1k}) \prod _{j=k+1}^{t} \beta _{1j} g_{k,i}\right) ^{2} \nonumber \\&\le \left( \sum _{k=1}^{t} \beta _{1}^{t-k} g_{k,i}\right) ^{2} \le \left( \sum _{k=1}^{t} \beta _{1}^{t-k}\right) \sum _{k=1}^{t}\beta _{1}^{t-k} g_{k,i}^{2} \nonumber \\&\le \left( \sum _{k=0}^{\infty } \beta _{1}^{k}\right) \sum _{k=1}^{t}\beta _{1}^{t-k} g_{k,i}^{2} \le \dfrac{1}{1-\beta _{1}} \sum _{k=1}^{t}\beta _{1}^{t-k} g_{k,i}^{2}, \end{aligned}$$

(A12)

where the first inequality follows from $1-\beta _{1k}\le 1$ and $\beta _{1j}\le \beta _{1}$ for any $j\ge 1$, the second inequality holds by the Jensen inequality with respect to the convex function $x^{2}$, and the last inequality is due to the fact that $\beta _{1}<1$, thus $\sum _{k=0}^{\infty }\beta _{1}^{k}\le \frac{1}{1-\beta _{1}}$. Now, plugging (A12) in (A11) gives

$$\begin{aligned} \sum _{t=1}^{T}\Lambda _{t}&\le \dfrac{1}{\sqrt{\epsilon }(1-\beta _{1})} \sum _{t=1}^{T} \sum _{i=1}^{d} \alpha _{t}\cdot \dfrac{1}{1-\beta _{1}} \sum _{k=1}^{t}\beta _{1}^{t-k} g_{k,i}^{2} \nonumber \\&\le \dfrac{\alpha }{\sqrt{\epsilon }(1-\beta _{1})^{2}} \sum _{k=1}^{T}\sum _{t=k}^{T} \dfrac{\beta _{1}^{t-k}}{\sqrt{k}} \Vert g_{k}\Vert ^{2} \nonumber \\&\le \dfrac{\alpha }{\sqrt{\epsilon }(1-\beta _{1})^{2}} \sum _{k=1}^{T} \dfrac{\beta _{1}^{-k}}{\sqrt{k}} \Vert g_{k}\Vert ^{2} \sum _{t=k}^{\infty } \beta _{1}^{t} \nonumber \\&\le \dfrac{\alpha }{\sqrt{\epsilon }(1-\beta _{1})^{3}} \sum _{k=1}^{T} \dfrac{\Vert g_{k}\Vert ^{2}}{\sqrt{k}}, \end{aligned}$$

(A13)

where we exchange the order of sum in the first equality, and the second inequality follows from the fact that $\sqrt{k}\le \sqrt{t}$ for any $t\ge k$. The last two inequalities hold by $\beta _{1}\ge 0$, then $\sum _{t=k}^{T}\beta _{1}^{t}\le \sum _{t=k}^{\infty }\beta _{1}^{t}\le \frac{\beta _{1}^{k}}{1-\beta _{1}}$. Further, applying the bounded gradient condition in Assumption 2, i.e., $\Vert g_{t}\Vert \le G$, to obtain

$$\begin{aligned} \sum _{k=1}^{T} \dfrac{\Vert g_{k}\Vert ^{2}}{\sqrt{k}} \le G^{2}\sum _{k=1}^{T}\dfrac{1}{\sqrt{k}} \le 2G^{2}\sqrt{T}, \end{aligned}$$

(A14)

where the last inequality follows from

$$\begin{aligned} \sum _{k=2}^{T} \dfrac{1}{\sqrt{k}} \le \int _{1}^{T}\dfrac{1}{\sqrt{x}}\,dx \le 2\sqrt{T}-1. \end{aligned}$$

(A15)

Upon combining (A14) and (A13), we have

$$\begin{aligned} \sum _{t=1}^{T}\Lambda _{t} \le \dfrac{\alpha }{\sqrt{\epsilon }(1-\beta _{1})^{3}}\, 2G^{2}\sqrt{T} =\dfrac{2\alpha G^{2}\sqrt{T}}{\sqrt{\epsilon }(1-\beta _{1})^{3}}. \end{aligned}$$

(A16)

From the definition of $\Upsilon _{t}$ in (A6), we have

$$\begin{aligned} \sum _{t=1}^{T}\Upsilon _{t}&=\sum _{t=1}^{T} \dfrac{\beta _{1t}}{2\alpha _{t}(1-\beta _{1t})} \left\Vert V_{t}^{1/4}(x_{t}-x^{*})\right\Vert ^{2} \nonumber \\&\le \dfrac{1}{2(1-\beta _{1})} \sum _{t=1}^{T}\dfrac{\beta _{1t}}{\alpha _{t}} \left\Vert V_{t}^{1/4}(x_{t}-x^{*})\right\Vert ^{2} \nonumber \\&\le \dfrac{D_{\infty }}{2(1-\beta _{1})} \sum _{t=1}^{T}\dfrac{\beta _{1t}}{\alpha _{t}} \sum _{i=1}^{d}\widetilde{v}_{t,i}^{1/2}, \end{aligned}$$

(A17)

where the first inequality holds by the condition that $\beta _{1t}\le \beta _{1}$ and the last inequality is due to Assumption 1.

Finally, substituting the bounds of $\Theta _{t}$ in (A10), $\Lambda _{t}$ in (A16) and $\Upsilon _{t}$ in (A17) to (A7), we arrive at

$$\begin{aligned} \sum _{t=1}^{T}\langle g_{t},x_{t}-x^{*}\rangle&\overset{\text {(A7)}}{\le }\sum _{t=1}^{T}\Theta _{t}+\sum _{t=1}^{T}\Lambda _{t}+\sum _{t=1}^{T}\Upsilon _{t} \nonumber \\&\;\,\le \;\, \dfrac{D_{\infty }^{2}\sqrt{T}}{2\alpha (1-\beta _{1})}\sum _{i=1}^{d}\widetilde{v}_{T,i}^{1/2}+\dfrac{2\alpha G^{2}\sqrt{T}}{\sqrt{\epsilon }(1-\beta _{1})^{3}} \nonumber \\&\qquad +\dfrac{D_{\infty }}{2(1-\beta _{1})}\sum _{t=1}^{T}\dfrac{\beta _{1t}}{\alpha _{t}}\sum _{i=1}^{d}\widetilde{v}_{t,i}^{1/2}, \end{aligned}$$

(A18)

Upon since ${{v}_{t}}=\frac{{{v}_{t-1}}+\left( 1-{{\beta }_{2}} \right) \left| g_{t}^{2}-{{v}_{t-1}} \right| }{{{\left\| {{v}_{t-1}}+\left( 1-{{\beta }_{2}} \right) \left| g_{t}^{2}-{{v}_{t-1}} \right| \right\| }_{p}}}$, we then have $\Vert v_{t}\Vert =1$. By the iterate of $\hat{v}_{t}$, i.e., $\hat{v}_{t}=\max \{\hat{v}_{t-1},v_{t}\}$ and $\hat{v}_{0}=0$, we can obtain $\Vert \hat{v}_{t}\Vert =1$. By the definition of $\widetilde{v}_{t}$, we have

$$\begin{aligned} \sum _{i=1}^{d} \widetilde{v}_{t,i}^{1/2}&=\sum _{i=1}^{d}\sqrt{\hat{v}_{t,i}+\epsilon } \le \sum _{i=1}^{d}\sqrt{\hat{v}_{t,i}} +\sum _{i=1}^{d}\sqrt{\epsilon } \nonumber \\&\le \sum _{i=1}^{d}\sqrt{\Vert \hat{v}_{t}\Vert } +d\sqrt{\epsilon } =(1+\sqrt{\epsilon })d. \end{aligned}$$

(A19)

Combining (A18) and (A19) gives

$$\begin{aligned} \sum _{t=1}^{T}\langle g_{t},x_{t}-x^{*}\rangle&\le \dfrac{D_{\infty }^{2}(1+\sqrt{\epsilon })d\sqrt{T}}{2\alpha (1-\beta _{1})}+\dfrac{2\alpha G^{2}\sqrt{T}}{\sqrt{\epsilon }(1-\beta _{1})^{3}} \nonumber \\&+\dfrac{D_{\infty }(1+\sqrt{\epsilon })d}{2(1-\beta _{1})}\sum _{t=1}^{T}\dfrac{\beta _{1t}}{\alpha _{t}} \nonumber \\&=\left( \dfrac{D_{\infty }^{2}(1+\sqrt{\epsilon })d}{2\alpha (1-\beta _{1})}+\dfrac{2\alpha G^{2}}{\sqrt{\epsilon }(1-\beta _{1})^{3}}\right) \sqrt{T} \nonumber \\&+\dfrac{D_{\infty }(1+\sqrt{\epsilon })d}{2(1-\beta _{1})}\sum _{t=1}^{T}\dfrac{\beta _{1t}}{\alpha _{t}}. \end{aligned}$$

(A20)

Applying the convexity of the function $f_{t}$, we know that

$$\begin{aligned}&\sum _{t=1}^{T}\big (f_{t}(x_{t})-f_{t}(x^{*})\big )\le \sum _{t=1}^{T}\langle g_{t},x_{t}-x^{*}\rangle \nonumber \\&\le \left( \dfrac{D_{\infty }^{2}(1+\sqrt{\epsilon })d}{2\alpha (1-\beta _{1})}\!+\!\dfrac{2\alpha G^{2}}{\sqrt{\epsilon }(1-\beta _{1})^{3}}\right) \sqrt{T} \nonumber \\&+\dfrac{D_{\infty }(1+\sqrt{\epsilon })d}{2(1-\beta _{1})}\sum _{t=1}^{T}\dfrac{\beta _{1t}}{\alpha _{t}}. \end{aligned}$$

(A21)

This completes the proof. $\square $

Appendix B Proof of corollary 1

Proof of Corollary 1

By the conditions that ${{\beta }_{1t}}={{\beta }_{1}}{{\lambda }^{t}}$ and $0<\lambda <1$, we have

$$\begin{aligned} \begin{aligned} \sum \limits _{t=1}^{T}{\frac{{{\beta }_{1t}}}{{{\alpha }_{t}}}}&=\sum \limits _{t=1}^{T}{{{\beta }_{1}}{{\lambda }^{t}}\frac{\sqrt{t}}{\alpha }}\le \frac{{{\beta }_{1}}}{\alpha }\sum \limits _{t=1}^{T}{{{\lambda }^{t-1}}\sqrt{t}}\le \frac{{{\beta }_{1}}}{\alpha }\sum \limits _{t=1}^{T}{{{\lambda }^{t-1}}t} \\&=\frac{{{\beta }_{1}}}{\alpha }\left( \frac{1-{{\lambda }^{T}}}{{{\left( 1-\lambda \right) }^{2}}}-\frac{T{{\lambda }^{T}}}{1-\lambda } \right) \le \frac{{{\beta }_{1}}}{\alpha {{\left( 1-\lambda \right) }^{2}}}. \\ \end{aligned} \end{aligned}$$

(B1)

From Theorem 1, it is clear that

$$\begin{aligned} \begin{aligned}&\;\;\sum \limits _{t=1}^{T}{\big ( {{f}_{t}}\left( {{x}_{t}} \right) -{{f}_{t}}\left( {{x}^{*}} \right) \big )}\\&\;\le \; \left( \dfrac{D_{\infty }^{2}(1+\sqrt{\epsilon })d}{2\alpha (1-\beta _{1})}+\dfrac{2\alpha G^{2}}{\sqrt{\epsilon }(1-\beta _{1})^{3}}\right) \sqrt{T} \\&\quad +\dfrac{D_{\infty }(1+\sqrt{\epsilon })d}{2(1-\beta _{1})}\sum _{t=1}^{T}\dfrac{\beta _{1t}}{\alpha _{t}}\\&\overset{\text {(B1)}}{\le }\left( \dfrac{D_{\infty }^{2}(1+\sqrt{\epsilon })d}{2\alpha (1-\beta _{1})}+\dfrac{2\alpha G^{2}}{\sqrt{\epsilon }(1-\beta _{1})^{3}}\right) \sqrt{T} \\&\quad +\dfrac{D_{\infty }(1+\sqrt{\epsilon })d}{2(1-\beta _{1})}\frac{{{\beta }_{1}}}{\alpha {{\left( 1-\lambda \right) }^{2}}}.\\ \end{aligned} \end{aligned}$$

(B2)

This completes the proof. $\square $

Appendix C Proof of corollary 2

Proof of Corollary 2

Since ${{\beta }_{1t}}={{\beta }_{1}}/t$ and $\alpha _{t}=\alpha /\sqrt{t}$, we then have

$$\begin{aligned} \begin{aligned} \sum \limits _{t=1}^{T} {\frac{{{\beta }_{1t}}}{{{\alpha }_{t}}}} =\frac{{{\beta }_{1}}}{\alpha } \sum \limits _{t=1}^{T}{\frac{\sqrt{t}}{t}} =\frac{{{\beta }_{1}}}{\alpha } \sum \limits _{t=1}^{T}{\frac{1}{\sqrt{t}}} \le \frac{2{{\beta }_{1}}\sqrt{T}}{\alpha }, \end{aligned} \end{aligned}$$

(C1)

where the last inequality is due to

$$\begin{aligned} \sum \limits _{t=1}^{T}{\frac{1}{\sqrt{t}}} =1+\sum \limits _{t=2}^{T}{\frac{1}{\sqrt{t}}}\le 1+\int _{1}^{T}{\frac{1}{\sqrt{t}}dt} \le 2\sqrt{T}. \end{aligned}$$

(C2)

From Theorem 1, it is clear that

$$\begin{aligned} \begin{aligned}&\;\;\sum \limits _{t=1}^{T}{\left( {{f}_{t}}\left( {{x}_{t}} \right) -{{f}_{t}}\left( {{x}^{*}} \right) \right) }\\&\;\le \; \left( \dfrac{D_{\infty }^{2}(1+\sqrt{\epsilon })d}{2\alpha (1-\beta _{1})}+\dfrac{2\alpha G^{2}}{\sqrt{\epsilon }(1-\beta _{1})^{3}}\right) \sqrt{T} \\&\quad +\dfrac{D_{\infty }(1+\sqrt{\epsilon })d}{2(1-\beta _{1})}\sum _{t=1}^{T}\dfrac{\beta _{1t}}{\alpha _{t}}\\&\overset{\text {(C1)}}{\le } \left( \dfrac{D_{\infty }^{2}(1+\sqrt{\epsilon })d}{2\alpha (1-\beta _{1})}+\dfrac{2\alpha G^{2}}{\sqrt{\epsilon }(1-\beta _{1})^{3}}\right) \sqrt{T} \\&\quad +\dfrac{D_{\infty }(1+\sqrt{\epsilon })d}{2(1-\beta _{1})}\frac{2{{\beta }_{1}}\sqrt{T}}{\alpha }\\&\;=\;\left( \dfrac{D_{\infty }(1+\sqrt{\epsilon })d(D_{\infty }+2\beta _{1})}{2\alpha (1-\beta _{1})}+\dfrac{2\alpha G^{2}}{\sqrt{\epsilon }(1-\beta _{1})^{3}}\right) \sqrt{T}, \\ \end{aligned} \end{aligned}$$

(C3)

This completes the proof. $\square $

Appendix D Algorithms

This section provides the Algorithms for different optimization techniques, including AdaGrad, AdaGradN, AdaBound, ABNBound, AdamW, ABNAdamW, RAdam and ABNRAdam.

Algorithm 3

AdaGrad

Full size image

Algorithm 4

AdaGradN

Full size image

Algorithm 5

AdaBound

Full size image

Algorithm 6

ABNBound

Full size image

Algorithm 7

AdamW

Full size image

Algorithm 8

ABNAdamW

Full size image

Algorithm 9

RAdam

Full size image

Algorithm 10

ABNRAdam

Full size image

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Jiang, W., Liang, Y., Jiang, Z. et al. ABNGrad: adaptive step size gradient descent for optimizing neural networks. Appl Intell 54, 2361–2378 (2024). https://doi.org/10.1007/s10489-024-05303-6

Download citation

Accepted: 30 January 2024
Published: 16 February 2024
Issue Date: February 2024
DOI: https://doi.org/10.1007/s10489-024-05303-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

ABNGrad: adaptive step size gradient descent for optimizing neural networks

Abstract

Graphical abstract

Access this article

Similar content being viewed by others

Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

Development and Application of Artificial Neural Network

Data Availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Appendices

Appendix A Proof of theorem 1

Lemma 1

Proof of Theorem 1

Appendix B Proof of corollary 1

Proof of Corollary 1

Appendix C Proof of corollary 2

Proof of Corollary 2

Appendix D Algorithms

Rights and permissions

About this article

Cite this article

Keywords

Navigation

ABNGrad: adaptive step size gradient descent for optimizing neural networks

Abstract

Graphical abstract

Access this article

Similar content being viewed by others

Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

Development and Application of Artificial Neural Network

Data Availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Appendices

Appendix A Proof of theorem 1

Lemma 1

Proof of Theorem 1

Appendix B Proof of corollary 1

Proof of Corollary 1

Appendix C Proof of corollary 2

Proof of Corollary 2

Appendix D Algorithms

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation