SAdaBoundNc: an adaptive subgradient online learning algorithm with logarithmic regret bounds

Wang, Lin; Wang, Xin; Li, Tao; Zheng, Ruijuan; Zhu, Junlong; Zhang, Mingchuan

doi:10.1007/s00521-022-08082-8

SAdaBoundNc: an adaptive subgradient online learning algorithm with logarithmic regret bounds

Original Article
Published: 09 December 2022

Volume 35, pages 8051–8063, (2023)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Lin Wang¹^na1,
Xin Wang ORCID: orcid.org/0000-0002-7552-455X²^na1,
Tao Li³,
Ruijuan Zheng¹,
Junlong Zhu¹ &
…
Mingchuan Zhang¹

273 Accesses
1 Altmetric
Explore all metrics

Abstract

Adaptive (sub)gradient methods have received wide applications such as the training of deep networks. The square-root regret bounds are achieved in convex settings. However, how to exploit strong convexity for improving convergence rate and improve the generalization performance of adaptive (sub)gradient methods remain an open problem. For this reason, we devise an adaptive subgradient online learning algorithm called SAdaBoundNc in strong convexity settings. Moreover, we rigorously prove that the logarithmic regret bound can be achieved by choosing a faster diminishing learning rate. Further, we conduct various experiments to evaluate the performance of SAdaBoundNc on four real-world datasets. The results demonstrate that the training speed of SAdaBoundNc outperforms stochastic gradient descent and several adaptive gradient methods in the initial training process. Moreover, the generalization performance of SAdaBou-ndNc is also better than the current state-of-the-art methods on different datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Stochastic Gradient Descent with Polyak’s Learning Rate

Article 08 September 2021

Adam revisited: a weighted past gradients perspective

Article 03 January 2020

Accelerating adaptive online learning by matrix approximation

Article 24 January 2019

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability

The data that support the findings of this study are CIFAR-10 [25], CIFAR-100 [25], Penn TreeBank [29], and VOC2007 [30].

References

Bottou L (2010) Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010. Physica-Verlag, Heidelberg, pp 177–186
Haykin S (2014) Adaptive filter theory, 5th edn. Pearson Education, London
MATH Google Scholar
Robbins H, Monro S (1951) A stochastic approximation method. Ann Math Stat 22:400–407
Article MathSciNet MATH Google Scholar
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313:504–507
Article MathSciNet MATH Google Scholar
Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res 12:2121–2159
MathSciNet MATH Google Scholar
Tieleman T, Hinton G (2012) Rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw Mach Learn 4:26–31
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Proceedings of the 3rd international conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7–9, Conference Track Proceedings, 2015. arxiv:1412.6980
Wilson AC, Roelofs R, Stern M, Srebro N, Recht B (2017) The marginal value of adaptive gradient methods in machine learning. In: Proceedings of the 31st international conference on neural information processing systems, Curran Associates, Inc., pp 4148–4158
Reddi SJ, Kale S, Kumar S (2018) On the convergence of Adam and beyond. In: Proceedings of the 6th international conference on learning representations, ICLR 2018, Vancouver, BC, Canada, April 30, May 3, Conference Track Proceedings. www.OpenReview.net, 2018. https://openreview.net/forum?id=ryQu7f-RZ
Luo L, Xiong Y, Liu Y, Sun X (2019) Adaptive gradient methods with dynamic bound of learning rate,. In: Proceedings of the 7th international conference on learning representations, ICLR 2019, New Orleans, LA, USA, May 6–9, 2019, www.OpenReview.net. https://openreview.net/forum?id=Bkg3g2R9FX
Chen J, Zhou D, Tang Y, Yang Z, Cao Y, Gu Q (2020) Closing the generalization gap of adaptive gradient methods in training deep neural networks. In: Proceedings of the twenty-ninth international joint conference on artificial intelligence, IJCAI , www.ijcai.org 2020, pp 3267–3275. https://doi.org/10.24963/ijcai.2020/452
Hazan E (2016) Introduction to online convex optimization. Found Trends Optim 2:157–325
Article Google Scholar
Hazan E, Agarwal A, Kale S (2007) Logarithmic regret algorithms for online convex optimization. Mach Learn 69:169–192
Article MATH Google Scholar
Mukkamala MC, Hein M (2017) Variants of rmsprop and adagrad with logarithmic regret bounds. In: Proceedings of the 34th international conference on machine learning, ICML 2017, Sydney,NSW, Australia, 6–11 August , vol 70 of proceedings of machine learning research, PMLR, 2017, pp 2545–2553. http://proceedings.mlr.press/v70/mukkamala17a.html
Wang G, Lu S, Cheng Q, Tu W, Zhang L (2020) Sadam: a variant of adam for strongly convex functions. In: Proceedings of the 8th international conference on learning representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, www.OpenReview.net, 2020. https://openreview.net/forum?id=rye5YaEtPr
Zinkevich M (2003) Online convex programming and generalized infinitesimal gradient ascent. In: Proceedings of the Twentieth International conference on machine learning (ICML 2003), August 21–24, Washington, DC, USA, AAAI Press, 2003, pp 928–936. http://www.aaai.org/Library/ICML/2003/icml03-120.php
Cesa-Bianchi N, Conconi A, Gentile C (2004) On the generalization ability of on-line learning algorithms. IEEE Trans Inf Theory 50:2050–2057
Article MathSciNet MATH Google Scholar
Zeiler MD (2012) Adadelta: an adaptive learning rate method. CoRR arxiv:1212.5701
Dozat T (2016) Incorporating nesterov momentum into adam. In: Proceedings of the 4th international conference on learning representations, ICLR 2016, San Juan, Puerto Rico, May 2–4, Workshop Track Proceedings,2016. https://openreview.net/forum?id=OM0jvwB8jIp57Z-JjtNEZ &noteId=OM0jvwB8jIp57ZJjtNEZ
Gregor K, Danihelka I, Graves A, Rezende DJ, Wierstra D (2015) DRAW: a recurrent neural network for image generation. In: Proceedings of the 32nd international conference on machine learning, ICML 2015, Lille, France, 6–11 July , vol 37 of JMLR workshop and conference proceedings, www.JMLR.org, 2015, pp. 1462–1471. http://proceedings.mlr.press/v37/gregor15.html
Xu K, Ba J, Kiros R, Cho K, Courville AC, Salakhutdinov R, Zemel RS, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: Proceedings of the 32nd international conference on machine learning, ICML 2015, Lille,France, 6–11 July , vol 37 of JMLR Workshop and conference proceedings, www.JMLR.org, 2015, pp 2048–2057. http://proceedings.mlr.press/v37/xuc15.html
Choi D, Shallue CJ, Nado Z, Lee J, Maddison CJ, Dahl GE (2019) On empirical comparisons of optimizers for deep learning. CoRR arxiv:1910.05446
Keskar NS, Socher R (2017) Improving generalization performance by switching from Adam to SGD. CoRR arxiv:1712.07628
Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press, Cambridge
Book MATH Google Scholar
Krizhevsky A (2009) Learning multiple layers of features from tiny images. Technical Report, Citeseer
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of 2016 IEEE conference on computer vision and pattern recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, IEEE Computer Society, 2016, pp. 770–778. https://doi.org/10.1109/CVPR.2016.90
Chen Z, Xu Y, Chen E, Yang T (2018) SADAGRAD: strongly adaptive stochastic gradient method. In: Dy JG, Krause A (Eds.), Proceedings of the 35th international conference on machine learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10–15, vol 80 of proceedings of machine learning research, PMLR, 2018, pp 912–920. http://proceedings.mlr.press/v80/chen18m.html
McMahan HB, Streeter MJ (2010) Adaptive bound optimization for online convex optimization. In: Proceedings of The 23rd conference on learning theory (COLT 2010), Haifa, Israel, June 27–29, Omnipress, 2010, pp. 244–256. http://colt2010.haifa.il.ibm.com/papers/COLT2010proceedings.pdfpage=252
Marcus M, Santorini B, Marcinkiewicz MA (1993) Building a large annotated corpus of English: The Penn Treebank
Everingham M, Gool LV, Williams CKI, Winn JM, Zisserman A (2009) The Pascal Visual Object Classes (VOC) challenge. Int J Comput Vision 88:303–338
Article Google Scholar

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grants No. 62002102, No. 72002133, and No. 62172142, and in part by the Leading talents of science and technology in the Central Plain of China under Grant No. 224200510004, and in part by the Ministry of Education of China Science Foundation under Grants No. 19YJC630174, and in part by the Program for Science & Technology Innovation Talents in the University of Henan Province under Grant No. 22HASTIT014.

Author information

Lin Wang and Xin Wang contributed equally to this work.

Authors and Affiliations

School of Information Engineering, Henan University of Science and Technology, Luoyang, 471023, China
Lin Wang, Ruijuan Zheng, Junlong Zhu & Mingchuan Zhang
School of Business and Management, Shanghai International Studies University, Shanghai, 200083, China
Xin Wang
Department of Information Technology Management, CITIC Heavy Industries Co., Ltd, Luoyang, 471003, China
Tao Li

Authors

Lin Wang
View author publications
You can also search for this author inPubMed Google Scholar
Xin Wang
View author publications
You can also search for this author inPubMed Google Scholar
Tao Li
View author publications
You can also search for this author inPubMed Google Scholar
Ruijuan Zheng
View author publications
You can also search for this author inPubMed Google Scholar
Junlong Zhu
View author publications
You can also search for this author inPubMed Google Scholar
Mingchuan Zhang
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

LW contributed to methodology, theoretical analysis. XW contributed to validation and software. RZ contributed to conceptualization, writing—review and editing. JZ contributed to resources and investigation. MZ contributed to supervision, project administration, and funding acquisition.

Corresponding author

Correspondence to Xin Wang.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interests regarding the publication of this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A

Proof of Theorem 1

Following from Eqs. (1) and (8), we have

$$\begin{aligned}w_{k+1}&=P_{{\mathcal {K}},\Delta _k^{-1}}\left( w_k-\delta _k\odot m_k\right) \\&\quad =\arg \min _{w\in {\mathcal {K}}}\left\| w-\left( w_k-\delta _{k}\odot m_k\right) \right\| _{\Delta _k^{-1}}^2. \\ \end{aligned}$$

Moreover, $P_{{\mathcal {K}},\Delta _k}\left( w^{*}\right) =w^{*}$ for all $w^{*}\in {\mathcal {K}}$. Using the non-expansiveness property of the weighted projection [28], we have

$$ \begin{aligned} \left\| {w_{{k + 1}} - w^{*} } \right\|_{{\Delta _{k}^{{ - 1}} }}^{2} & = \left\| {P_{{{\mathcal{K}},\Delta _{k}^{{ - 1}} }} \left( {w_{k} - \delta _{k} \odot m_{k} } \right) - w^{*} } \right\|_{{\Delta _{k}^{{ - 1}} }}^{2} \\ & \le \left\| {w_{k} - \delta _{k} \odot m_{k} - w^{*} } \right\|_{{\Delta _{k}^{{ - 1}} }}^{2} \\ & = \left\| {w_{k} - w^{*} } \right\|_{{\Delta _{k}^{{ - 1}} }}^{2} + \left\| {m_{k} } \right\|_{{\Delta _{k} }}^{2} \\ & \quad - 2\left\langle {m_{k} ,w_{k} - w^{*} } \right\rangle = \left\| {w_{k} - w^{*} } \right\|_{{\Delta _{k}^{{ - 1}} }}^{2} \\ & \quad + \left\| {m_{k} } \right\|_{{\Delta _{k} }}^{2} - 2\left\langle {\lambda _{{1k}} m_{{k - 1}} + \left( {1 - \lambda _{{1k}} } \right)s_{k} ,w_{k} - w^{*} } \right\rangle . \\ \end{aligned} $$

(12)

Rearranging Eq. (12), we obtain

$$\begin{aligned} \left\langle s_k,w_k-w^{*}\right\rangle&\le \frac{\left\| w_k-w^{*}\right\| _{\Delta _k^{-1}}^2-\left\| w_{k+1}-w^{*}\right\| _{\Delta _k^{-1}}^2}{2\left( 1-\lambda _{1k}\right) } \nonumber \\&\quad +\frac{\left\| m_k\right\| _{\Delta _k}^2}{2\left( 1-\lambda _{1k}\right) }+\frac{\lambda _{1k}}{1-\lambda _{1k}}\left\langle m_{k-1},w^{*}-w_k\right\rangle . \end{aligned}$$

(13)

Further, by using Cauchy–Schwarz and Young’s inequality, the term $\frac{\lambda _{1k}}{1-\lambda _{1k}}\left\langle m_{k-1},w^{*}-w_k\right\rangle $ in Eq. (13) can be upper bounded as follows:

When $k=1$, $\frac{\lambda _{1k}}{1-\lambda _{1k}}\left\langle m_{k-1},w^{*}-w_k\right\rangle =0$ due to $m_0=0$; When $k\ge 2$, we have

$$\begin{aligned}\frac{\lambda _{1k}}{1-\lambda _{1k}}\left\langle m_{k-1},w^{*}-w_k\right\rangle&\le \frac{\lambda _{1k}}{2(1-\lambda _{1k})}\left\| m_{k-1}\right\| _{\Delta _{k-1}}^2 \\&\quad +\frac{\lambda _{1k}}{2(1-\lambda _{1k})}\left\| w_{k}-w^{*}\right\| _{\Delta _{k-1}^{-1}}^2. \end{aligned}$$

(14)

Plugging Eqs. (14) into (13), we get

$$\begin{aligned}\left\langle s_k,w_k-w^{*}\right\rangle&\le \frac{\left\| w_k-w^{*}\right\| _{\Delta _k^{-1}}^2-\left\| w_{k+1}-w^{*}\right\| _{\Delta _k^{-1}}^2}{2\left( 1-\lambda _{1k}\right) } \\&\quad +\frac{\left\| m_k\right\| _{\Delta _k}^2}{2\left( 1-\lambda _{1k}\right) }+\frac{\lambda _{1k}}{2(1-\lambda _{1k})}\left\| m_{k-1}\right\| _{\Delta _{k-1}}^2 \\&\quad +\frac{\lambda _{1k}}{2(1-\lambda _{1k})}\left\| w_{k}-w^{*}\right\| _{\Delta _{k-1}^{-1}}^2. \end{aligned}$$

(15)

We now derive the bound of the regret of SAdaBoundNc. By utilizing the strong convexity of $f_k$ for all $k\in [K]$, we have

$$\begin{aligned}R_K&=\sum \limits _{k=1}^K\left( f_k\left( w_k\right) -f_k\left( w^{*}\right) \right) \\&\quad \le \sum \limits _{k=1}^K\left\langle s_k, w_k-w^{*}\right\rangle -\sum \limits _{k=1}^K\frac{\mu }{2}\left\| w_k-w^{*}\right\| ^2. \\ \end{aligned}$$

(16)

Plugging Eqs. (15) into (16), we obtain

$$\begin{aligned} R_K&\le \underbrace{\sum \limits _{k=1}^K\frac{\left\| w_k-w^{*}\right\| _{\Delta _k^{-1}}^2-\left\| w_{k+1}-w^{*}\right\| _{\Delta _k^{-1}}^2}{2\left( 1-\lambda _{1k}\right) }}_{T_1} \nonumber \\&\quad +\underbrace{\sum \limits _{k=1}^K \frac{\left\| m_k\right\| _{\Delta _k}^2}{2\left( 1-\lambda _{1k}\right) }+\sum \limits _{k=2}^K\frac{\lambda _{1k}}{2(1-\lambda _{1k})}\left\| m_{k-1}\right\| _{\Delta _{k-1}}^2}_{T_2} \nonumber \\&\quad +\underbrace{\sum \limits _{k=2}^K\frac{\lambda _{1k}}{2(1-\lambda _{1k})}\left\| w_{k}-w^{*}\right\| _{\Delta _{k-1}^{-1}}^2}_{T_3} \nonumber \\&\quad -\frac{\mu }{2}\sum \limits _{k=1}^K\left\| w_k-w^{*}\right\| ^2. \end{aligned}$$

(17)

To obtain the upper bound of Eq. (17), we first derive the following relation:

$$ \begin{aligned} T_{1} & = \frac{{\left\| {w_{1} - w^{*} } \right\|_{{\Delta _{1}^{{ - 1}} }}^{2} }}{{2(1 - \lambda _{{11}} )}} - \frac{{\left\| {w_{{K + 1}} - w^{*} } \right\|_{{\Delta _{K}^{{ - 1}} }}^{2} }}{{2(1 - \lambda _{{1K}} )}} \\ & \quad + \sum\limits_{{k = 2}}^{K} {\left( {\frac{{\left\| {w_{k} - w^{*} } \right\|_{{\Delta _{k}^{{ - 1}} }}^{2} }}{{2(1 - \lambda _{{1k}} )}} - \frac{{\left\| {w_{k} - w^{*} } \right\|_{{\Delta _{{k - 1}}^{{ - 1}} }}^{2} }}{{2(1 - \lambda _{{1(k - 1)}} )}}} \right)} \\ & \le \sum\limits_{{k = 2}}^{K} {\frac{{\left\| {w_{k} - w^{*} } \right\|_{{\Delta _{k}^{{ - 1}} }}^{2} - \left\| {w_{k} - w^{*} } \right\|_{{\Delta _{{k - 1}}^{{ - 1}} }}^{2} }}{{2(1 - \lambda _{{1k}} )}}} \\ & \quad + \frac{{\left\| {w_{1} - w^{*} } \right\|_{{\Delta _{1}^{{ - 1}} }}^{2} }}{{2(1 - \lambda _{1} )}}, \\ \end{aligned} $$

(18)

where the last inequality is obtained due to $\lambda _{11}=\lambda _1$ and $\lambda _{1k}\le \lambda _{1(k-1)}$ for all $k\in [K]$. By using Eq. (18), we further get

$$ T_{1} - \frac{\mu }{2}\sum\limits_{{k = 1}}^{K} {\left\| {w_{k} - w^{*} } \right\|^{2} } \le T_{{11}} + \sum\limits_{{k = 2}}^{K} {T_{{12}} } , $$

(19)

where

$$\begin{aligned} T_{11}=\frac{\left\| w_1-w^{*}\right\| _{\Delta _1^{-1}}^2}{2(1-\lambda _{1})}-\frac{\mu }{2}\left\| w_1-w^{*}\right\| ^2 \end{aligned}$$

and

$$\begin{aligned} T_{12}=\frac{\left\| w_k-w^{*}\right\| _{\Delta _k^{-1}}^2-\left\| w_k-w^{*}\right\| _{\Delta _{k-1}^{-1}}^2}{2(1-\lambda _{1k})}-\frac{\mu }{2}\left\| w_k-w^{*}\right\| ^2. \end{aligned}$$

To bound Eq. (19), $T_{11}$ and $T_{12}$ need to be bounded. Because Condition 1 holds, we have

$$ \begin{aligned} T_{{11}} & = \sum\limits_{{i = 1}}^{d} {\left( {w_{{1,i}} - w_{i}^{*} } \right)^{2} } \frac{{\delta _{{1,i}}^{{ - 1}} - \mu (1 - \lambda _{1} )}}{{2(1 - \lambda _{1} )}} \\ & \le \frac{{D_{\infty }^{2} }}{{2(1 - \lambda _{1} )}}\sum\limits_{{i = 1}}^{d} {\delta _{{1,i}}^{{ - 1}} } , \\ \end{aligned} $$

(20)

where the last inequality follows from Assumption 1. For the term $T_{12}$, we derive the following relation:

$$ \begin{aligned} \left\| {w_{k} - w^{*} } \right\|_{{\Delta _{k}^{{ - 1}} }}^{2} - \left\| {w_{k} - w^{*} } \right\|_{{\Delta _{{k - 1}}^{{ - 1}} }}^{2} - \mu (1 - \lambda _{{1k}} )\left\| {w_{k} - w^{*} } \right\|^{2} & = \sum\limits_{{i = 1}}^{d} {\left( {w_{{k,i}} - w_{i}^{*} } \right)^{2} } \left( {\delta _{{k,i}}^{{ - 1}} - \delta _{{k - 1,i}}^{{ - 1}} - \mu (1 - \lambda _{{1k}} )} \right) \\ & \le \sum\limits_{{i = 1}}^{d} {\left( {w_{{k,i}} - w_{i}^{*} } \right)^{2} } \left[ {\frac{k}{{\underset{\raise0.3em\hbox{$\smash{\scriptscriptstyle-}$}}{\alpha } (k)}} - \frac{{k - 1}}{{\bar{\alpha }(k - 1)}} - \mu (1 - \lambda _{{1k}} )} \right] \\ & \le \sum\limits_{{i = 1}}^{d} {\left( {w_{{k,i}} - w_{i}^{*} } \right)^{2} } \left( {\mu (1 - \lambda _{1} ) - \mu (1 - \lambda _{{1k}} )} \right) \\ & \le 0, \\ \end{aligned} $$

(21)

where we have utilized Eq. (7) to derive the first inequality; the second inequality is obtained by utilizing Eq. (9); the last inequality follows from $\lambda _{1k}\le \lambda _1$ for all $k\in [K]$. Thus, we immediately get

$$\begin{aligned} T_{12}\le 0. \end{aligned}$$

(22)

Plugging Eqs. (20) and (22) into (19), we have

$$\begin{aligned} T_1-\frac{\mu }{2}\sum \limits _{k=1}^K\left\| w_k-w^{*}\right\| ^2\le \frac{D_{\infty }^2}{2(1-\lambda _1)}\sum \limits _{i=1}^d\delta _{1,i}^{-1}. \end{aligned}$$

(23)

We now bound the term $T_2$ in Eq. (17). Following from the definition of $\delta _k$ implies that

$$\begin{aligned} \alpha _{\min }\le k\left\| \delta _k\right\| _{\infty }\le \alpha _{\max }. \end{aligned}$$

(24)

Thus, we obtain

$$ \begin{aligned} T_{2} & \le \sum\limits_{{k = 1}}^{K} {\frac{{\left\| {m_{k} } \right\|_{{\Delta _{k} }}^{2} }}{{2(1 - \lambda _{1} )}}} + \sum\limits_{{k = 2}}^{K} {\frac{{\left\| {m_{{k - 1}} } \right\|_{{\Delta _{{k - 1}} }}^{2} }}{{2(1 - \lambda _{1} )}}} \\ & \le \frac{1}{{1 - \lambda _{1} }}\sum\limits_{{k = 1}}^{K} {\left\| {m_{k} } \right\|_{{\Delta _{k} }}^{2} } \\ & \le \frac{{\alpha _{{\max }} }}{{1 - \lambda _{1} }}\sum\limits_{{k = 1}}^{K} {\frac{{\left\| {m_{k} } \right\|^{2} }}{k}} , \\ \end{aligned} $$

(25)

where the first inequality holds by $\lambda _{1k}\le \lambda _1<1$; we use Eq. (23) to derive the last inequality. To bound $T_2$, we need to upper bound the term $\sum \nolimits _{k=1}^K\left\| m_k\right\| ^2/k$ in Eq. (24). Hence, by using the definition of $m_k$ in Eq (5), we have

$$ \begin{aligned} \sum\limits_{{k = 1}}^{K} {\frac{{\left\| {m_{k} } \right\|^{2} }}{k}} & = \sum\limits_{{k = 1}}^{{K - 1}} {\frac{{\left\| {m_{k} } \right\|^{2} }}{k}} + \sum\limits_{{i = 1}}^{d} {\frac{{m_{{K,i}}^{2} }}{K}} \le \sum\limits_{{k = 1}}^{{K - 1}} {\frac{{\left\| {m_{k} } \right\|^{2} }}{k}} \\ & \quad + \sum\limits_{{i = 1}}^{d} {\underbrace {{\frac{{\left( {\sum\nolimits_{{j = 1}}^{K} {(1 - \lambda _{{1j}} )} \prod\limits_{{t = 1}}^{{K - j}} {\lambda _{{1(K - t + 1)}} } s_{{j,i}} } \right)^{2} }}{K}}}_{{T_{{21}} }}} . \\ \end{aligned} $$

(26)

To further bound Eq. (26), we now start to bound the term $T_{21}$ by using the Cauchy–Schwarz inequality as follows:

$$\begin{aligned}&T_{21} \le \frac{\sum \nolimits _{j=1}^K\prod _{t=1}^{K-j}\lambda _{1(K-t+1)}}{K} \nonumber \\&\quad \cdot \frac{\sum \nolimits _{j=1}^K\prod _{t=1}^{K-j}\lambda _{1(K-t+1)}s_{j,i}^2}{K} \nonumber \\&\quad \le \frac{\left( \sum \nolimits _{j=1}^K\lambda _1^{K-j}\right) \left( \sum \nolimits _{j=1}^K\lambda _1^{K-j}s_{j,i}^2\right) }{K} \nonumber \\&\quad \le \frac{1}{1-\lambda _1}\sum \limits _{j=1}^K\frac{\lambda _1^{K-j}s_{j,i}^2}{K}, \end{aligned}$$

(27)

where the first inequality holds by utilizing $\lambda _{1k}\le \lambda _1$ for all $k\in [K]$; the last inequality is derived from the inequality $\sum \nolimits _{j=1}^K\lambda _1^{K-j}\le 1/(1-\lambda _1)$. Plugging Eqs. (27) into (26), we get

$$\begin{aligned} \sum \limits _{k=1}^K\frac{\left\| m_k\right\| ^2}{k}&\le \sum \limits _{k=1}^{K-1}\frac{\left\| m_k\right\| ^2}{k}+\frac{1}{1-\lambda _1} \sum \limits _{i=1}^d\sum \limits _{j=1}^K\frac{\lambda _1^{K-j}s_{j,i}^2}{K} \nonumber \\&\quad \le \frac{1}{1-\lambda _1}\sum \limits _{i=1}^d\sum \limits _{k=1}^K\sum \limits _{j=1}^k\frac{\lambda _1^{k-j}s_{j,i}^2}{k} \nonumber \\&\quad \le \frac{G_{\infty }^2}{1-\lambda _1}\sum \limits _{i=1}^d\sum \limits _{k=1}^K\sum \limits _{j=1}^k\frac{\lambda _1^{k-j}}{k} \nonumber \\&\quad =\frac{G_{\infty }^2}{1-\lambda _1}\sum \limits _{i=1}^d\sum \limits _{k=1}^K\frac{1}{k}\sum \limits _{j=1}^k\lambda _1^{k-j} \nonumber \\&\quad =\frac{G_{\infty }^2}{1-\lambda _1}\sum \limits _{i=1}^d\sum \limits _{k=1}^K\sum \limits _{j=k}^K\frac{\lambda _1^{j-k}}{j} \nonumber \\&\quad \le \frac{G_{\infty }^2}{1-\lambda _1}\sum \limits _{i=1}^d\sum \limits _{k=1}^K\sum \limits _{j=k}^K\frac{\lambda _1^{j-k}}{k} \nonumber \\&\quad =\frac{G_{\infty }^2}{1-\lambda _1}\sum \limits _{i=1}^d\sum \limits _{k=1}^K\frac{1}{k}\sum \limits _{j=k}^K\lambda _1^{j-k} \nonumber \\&\quad \le \frac{G_{\infty }^2}{(1-\lambda _1)^2}\sum \limits _{i=1}^d\sum \limits _{k=1}^K\frac{1}{k} \nonumber \\&\quad \le \frac{dG_{\infty }^2}{(1-\lambda _1)^2}\left( 1+\log K\right) , \end{aligned}$$

(28)

where the third inequality is due to Assumption 2; the fifth inequality is obtained by the inequality $\sum \nolimits _{j=k}^K\lambda _1^{j-k} \le 1/(1-\lambda _1)$; the sixth inequality is derived by using the following inequality:

$$\begin{aligned} \sum \limits _{k=1}^K\frac{1}{k}\le 1+\int _{k=1}^K\frac{1}{k}dk=1+\log K. \end{aligned}$$

(29)

Hence, plugging Eqs. (28) into (25), we have

$$\begin{aligned} T_2\le \frac{d\alpha _{\max }G_{\infty }^2}{(1-\lambda _1)^3}\left( 1+\log K\right) . \end{aligned}$$

(30)

Finally, we bound the term $T_3$ as follows:

$$\begin{aligned} T_3&=\sum \limits _{k=2}^K\frac{\lambda _{1k}}{2(1-\lambda _{1k})}\left\| w_{k}-w^{*}\right\| _{\Delta _{k-1}^{-1}}^2 \nonumber \\&\quad =\sum \limits _{i=1}^d\sum \limits _{k=2}^K\frac{\lambda _{1k}}{2(1-\lambda _{1k})}\left( w_{k,i}-w_i^{*}\right) ^2\delta _{t,i}^{-1} \nonumber \\&\quad \le \frac{D_{\infty }^2}{2(1-\lambda _{1})}\sum \limits _{i=1}^d\sum \limits _{k=1}^K\lambda _{1k}\delta _{k,i}^{-1}, \end{aligned}$$

(31)

where the last inequality follows from Assumption 1 and $\lambda _{1k}\le \lambda _1$ for all $k\in [K]$. Plugging Eqs. (23), (30), and (31) into Eq. (17), Theorem 1 is proved. $\Box $

Appendix B

Proof of Corollary 1

By the definitions of bounded functions ${\underline{\alpha }}(k)$ and ${\overline{\alpha }}(k)$, we get

$$\begin{aligned}\frac{k}{{\underline{\alpha }}(k)}-\frac{k-1}{{\overline{\alpha }}(k-1)}&=k\left( \frac{1}{{\underline{\alpha }}(k)}-\frac{1}{{\overline{\alpha }}(k-1)}\right) +\frac{1}{{\overline{\alpha }}(k-1)} \\&\quad \le k\left( \frac{1}{{\underline{\alpha }}(k)}-\frac{1}{{\overline{\alpha }}(k-1)}\right) +\frac{1}{\alpha ^{\star }} \\&\quad =\frac{1}{\gamma \alpha ^{\star }}\frac{2\gamma k+1-\gamma }{1+\gamma (k-1)}+\frac{1}{\alpha ^{\star }} \\&\quad \le \frac{2}{\gamma \alpha ^{\star }}\frac{1+\gamma k}{1+\gamma (k-1)}+\frac{1}{\alpha ^{\star }} \\&\quad \le \frac{2}{\gamma \alpha ^{\star }}\left( 1+\gamma \right) +\frac{1}{\alpha ^{\star }} \\&\quad =\frac{2}{\gamma \alpha ^{\star }}+\frac{3}{\alpha ^{\star }}, \\ \end{aligned}$$

(32)

where we have used the relation $1/{\overline{\alpha }}(k)\le 1/\alpha ^{\star }$ for all $k\in [K]$ to derive the first inequality; in the last inequality, we have utilized the inequality $\frac{1+\gamma k}{1+\gamma (k-1)}\le 1+\gamma $, which is due to $\gamma ^2(k-1)\ge 0$ for all $\gamma >0$ and $k\ge 1$. Thus, for any $\alpha ^{\star }\ge \frac{3+2\gamma ^{-1}}{\mu (1-\lambda _1)}$, the following inequality holds:

$$\begin{aligned} \frac{k}{{\underline{\alpha }}(k)}-\frac{k-1}{{\overline{\alpha }}(k-1)}\le \mu \left( 1-\lambda _1\right) . \end{aligned}$$

Besides, by the definition of $\delta _t$, $\forall i\in [d]$ and $k\in [K]$, we get

$$\begin{aligned} \delta _{k,i}^{-1}=k{\hat{\delta }}_{k,i}^{-1}\le k{\underline{\alpha }}(1)^{-1}=\frac{k(1+\gamma ^{-1})}{\alpha ^{\star }}. \end{aligned}$$

(33)

Because $\lambda _{1k}=\lambda _1\nu ^{k-1}$, we have

$$\begin{aligned} \sum \limits _{k=1}^K\lambda _{1k}\delta _{k,i}^{-1}&=\lambda _1\sum \limits _{k=1}^K\nu ^{k-1}\delta _{k,i}^{-1} \nonumber \\&\quad \le \frac{\lambda _1(1+\gamma ^{-1})}{\alpha ^{\star }}\sum \limits _{k=1}^Kk\nu ^{k-1} \nonumber \\&\quad \le \frac{\lambda _1(1+\gamma ^{-1})}{\alpha ^{\star }(1-\nu )^2}, \end{aligned}$$

(34)

where the first inequality follows from Eq. (33); the last inequality is derived from the following inequality:

$$\begin{aligned} \sum \limits _{k=1}^Kk\nu ^{k-1}&\le \sum \limits _{k=0}^{\infty }k\nu ^{k-1} \\&\quad =\partial _{\nu }\left( \sum \limits _{k=0}^{\infty }\nu ^{k}\right) \\&\quad =\partial _{\nu }\left( \frac{1}{1-\nu }\right) \\&\quad =\frac{1}{(1-\nu )^2}, \end{aligned}$$

where the notation $\partial _{\nu }$ denotes the derivative operator with respect to $\nu $. In addition, by Eq. (33), which implies that $\delta _{1,i}^{-1}\le \frac{1+\gamma ^{-1}}{\alpha ^{\star }}$ for all $i\in [d]$. Thus, plugging Eqs. (34) into (10), we have

$$\begin{aligned}R_K&\le \frac{dD_{\infty }^2}{2\alpha ^{\star }(1-\lambda _1)}\left( 1+\gamma ^{-1}\right) \left( 1+\frac{\lambda _1}{(1-\nu )^2}\right) \\&\quad +\frac{d\alpha _{\max }G_{\infty }^2}{(1-\lambda _1)^3}\left( 1+\log K\right) . \end{aligned}$$

(35)

Furthermore, due to $\alpha _{\max }={\overline{\alpha }}(1)=\alpha ^{\star }(1+\gamma ^{-1})$, Corollary 1 is proved. $\Box $

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, L., Wang, X., Li, T. et al. SAdaBoundNc: an adaptive subgradient online learning algorithm with logarithmic regret bounds. Neural Comput & Applic 35, 8051–8063 (2023). https://doi.org/10.1007/s00521-022-08082-8

Download citation

Received: 28 January 2022
Accepted: 22 November 2022
Published: 09 December 2022
Issue Date: April 2023
DOI: https://doi.org/10.1007/s00521-022-08082-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SAdaBoundNc: an adaptive subgradient online learning algorithm with logarithmic regret bounds

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Stochastic Gradient Descent with Polyak’s Learning Rate

Adam revisited: a weighted past gradients perspective

Accelerating adaptive online learning by matrix approximation

Explore related subjects

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix A

Proof of Theorem 1

Appendix B

Proof of Corollary 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now