Abstract
Adaptive (sub)gradient methods have received wide applications such as the training of deep networks. The square-root regret bounds are achieved in convex settings. However, how to exploit strong convexity for improving convergence rate and improve the generalization performance of adaptive (sub)gradient methods remain an open problem. For this reason, we devise an adaptive subgradient online learning algorithm called SAdaBoundNc in strong convexity settings. Moreover, we rigorously prove that the logarithmic regret bound can be achieved by choosing a faster diminishing learning rate. Further, we conduct various experiments to evaluate the performance of SAdaBoundNc on four real-world datasets. The results demonstrate that the training speed of SAdaBoundNc outperforms stochastic gradient descent and several adaptive gradient methods in the initial training process. Moreover, the generalization performance of SAdaBou-ndNc is also better than the current state-of-the-art methods on different datasets.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Bottou L (2010) Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010. Physica-Verlag, Heidelberg, pp 177–186
Haykin S (2014) Adaptive filter theory, 5th edn. Pearson Education, London
Robbins H, Monro S (1951) A stochastic approximation method. Ann Math Stat 22:400–407
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313:504–507
Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res 12:2121–2159
Tieleman T, Hinton G (2012) Rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw Mach Learn 4:26–31
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Proceedings of the 3rd international conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7–9, Conference Track Proceedings, 2015. arxiv:1412.6980
Wilson AC, Roelofs R, Stern M, Srebro N, Recht B (2017) The marginal value of adaptive gradient methods in machine learning. In: Proceedings of the 31st international conference on neural information processing systems, Curran Associates, Inc., pp 4148–4158
Reddi SJ, Kale S, Kumar S (2018) On the convergence of Adam and beyond. In: Proceedings of the 6th international conference on learning representations, ICLR 2018, Vancouver, BC, Canada, April 30, May 3, Conference Track Proceedings. www.OpenReview.net, 2018. https://openreview.net/forum?id=ryQu7f-RZ
Luo L, Xiong Y, Liu Y, Sun X (2019) Adaptive gradient methods with dynamic bound of learning rate,. In: Proceedings of the 7th international conference on learning representations, ICLR 2019, New Orleans, LA, USA, May 6–9, 2019, www.OpenReview.net. https://openreview.net/forum?id=Bkg3g2R9FX
Chen J, Zhou D, Tang Y, Yang Z, Cao Y, Gu Q (2020) Closing the generalization gap of adaptive gradient methods in training deep neural networks. In: Proceedings of the twenty-ninth international joint conference on artificial intelligence, IJCAI , www.ijcai.org 2020, pp 3267–3275. https://doi.org/10.24963/ijcai.2020/452
Hazan E (2016) Introduction to online convex optimization. Found Trends Optim 2:157–325
Hazan E, Agarwal A, Kale S (2007) Logarithmic regret algorithms for online convex optimization. Mach Learn 69:169–192
Mukkamala MC, Hein M (2017) Variants of rmsprop and adagrad with logarithmic regret bounds. In: Proceedings of the 34th international conference on machine learning, ICML 2017, Sydney,NSW, Australia, 6–11 August , vol 70 of proceedings of machine learning research, PMLR, 2017, pp 2545–2553. http://proceedings.mlr.press/v70/mukkamala17a.html
Wang G, Lu S, Cheng Q, Tu W, Zhang L (2020) Sadam: a variant of adam for strongly convex functions. In: Proceedings of the 8th international conference on learning representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, www.OpenReview.net, 2020. https://openreview.net/forum?id=rye5YaEtPr
Zinkevich M (2003) Online convex programming and generalized infinitesimal gradient ascent. In: Proceedings of the Twentieth International conference on machine learning (ICML 2003), August 21–24, Washington, DC, USA, AAAI Press, 2003, pp 928–936. http://www.aaai.org/Library/ICML/2003/icml03-120.php
Cesa-Bianchi N, Conconi A, Gentile C (2004) On the generalization ability of on-line learning algorithms. IEEE Trans Inf Theory 50:2050–2057
Zeiler MD (2012) Adadelta: an adaptive learning rate method. CoRR arxiv:1212.5701
Dozat T (2016) Incorporating nesterov momentum into adam. In: Proceedings of the 4th international conference on learning representations, ICLR 2016, San Juan, Puerto Rico, May 2–4, Workshop Track Proceedings,2016. https://openreview.net/forum?id=OM0jvwB8jIp57Z-JjtNEZ ¬eId=OM0jvwB8jIp57ZJjtNEZ
Gregor K, Danihelka I, Graves A, Rezende DJ, Wierstra D (2015) DRAW: a recurrent neural network for image generation. In: Proceedings of the 32nd international conference on machine learning, ICML 2015, Lille, France, 6–11 July , vol 37 of JMLR workshop and conference proceedings, www.JMLR.org, 2015, pp. 1462–1471. http://proceedings.mlr.press/v37/gregor15.html
Xu K, Ba J, Kiros R, Cho K, Courville AC, Salakhutdinov R, Zemel RS, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: Proceedings of the 32nd international conference on machine learning, ICML 2015, Lille,France, 6–11 July , vol 37 of JMLR Workshop and conference proceedings, www.JMLR.org, 2015, pp 2048–2057. http://proceedings.mlr.press/v37/xuc15.html
Choi D, Shallue CJ, Nado Z, Lee J, Maddison CJ, Dahl GE (2019) On empirical comparisons of optimizers for deep learning. CoRR arxiv:1910.05446
Keskar NS, Socher R (2017) Improving generalization performance by switching from Adam to SGD. CoRR arxiv:1712.07628
Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press, Cambridge
Krizhevsky A (2009) Learning multiple layers of features from tiny images. Technical Report, Citeseer
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of 2016 IEEE conference on computer vision and pattern recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, IEEE Computer Society, 2016, pp. 770–778. https://doi.org/10.1109/CVPR.2016.90
Chen Z, Xu Y, Chen E, Yang T (2018) SADAGRAD: strongly adaptive stochastic gradient method. In: Dy JG, Krause A (Eds.), Proceedings of the 35th international conference on machine learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10–15, vol 80 of proceedings of machine learning research, PMLR, 2018, pp 912–920. http://proceedings.mlr.press/v80/chen18m.html
McMahan HB, Streeter MJ (2010) Adaptive bound optimization for online convex optimization. In: Proceedings of The 23rd conference on learning theory (COLT 2010), Haifa, Israel, June 27–29, Omnipress, 2010, pp. 244–256. http://colt2010.haifa.il.ibm.com/papers/COLT2010proceedings.pdfpage=252
Marcus M, Santorini B, Marcinkiewicz MA (1993) Building a large annotated corpus of English: The Penn Treebank
Everingham M, Gool LV, Williams CKI, Winn JM, Zisserman A (2009) The Pascal Visual Object Classes (VOC) challenge. Int J Comput Vision 88:303–338
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grants No. 62002102, No. 72002133, and No. 62172142, and in part by the Leading talents of science and technology in the Central Plain of China under Grant No. 224200510004, and in part by the Ministry of Education of China Science Foundation under Grants No. 19YJC630174, and in part by the Program for Science & Technology Innovation Talents in the University of Henan Province under Grant No. 22HASTIT014.
Author information
Authors and Affiliations
Contributions
LW contributed to methodology, theoretical analysis. XW contributed to validation and software. RZ contributed to conceptualization, writing—review and editing. JZ contributed to resources and investigation. MZ contributed to supervision, project administration, and funding acquisition.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that there is no conflict of interests regarding the publication of this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A
Proof of Theorem 1
Following from Eqs. (1) and (8), we have
Moreover, \(P_{{\mathcal {K}},\Delta _k}\left( w^{*}\right) =w^{*}\) for all \(w^{*}\in {\mathcal {K}}\). Using the non-expansiveness property of the weighted projection [28], we have
Rearranging Eq. (12), we obtain
Further, by using Cauchy–Schwarz and Young’s inequality, the term \(\frac{\lambda _{1k}}{1-\lambda _{1k}}\left\langle m_{k-1},w^{*}-w_k\right\rangle \) in Eq. (13) can be upper bounded as follows:
When \(k=1\), \(\frac{\lambda _{1k}}{1-\lambda _{1k}}\left\langle m_{k-1},w^{*}-w_k\right\rangle =0\) due to \(m_0=0\); When \(k\ge 2\), we have
Plugging Eqs. (14) into (13), we get
We now derive the bound of the regret of SAdaBoundNc. By utilizing the strong convexity of \(f_k\) for all \(k\in [K]\), we have
Plugging Eqs. (15) into (16), we obtain
To obtain the upper bound of Eq. (17), we first derive the following relation:
where the last inequality is obtained due to \(\lambda _{11}=\lambda _1\) and \(\lambda _{1k}\le \lambda _{1(k-1)}\) for all \(k\in [K]\). By using Eq. (18), we further get
where
and
To bound Eq. (19), \(T_{11}\) and \(T_{12}\) need to be bounded. Because Condition 1 holds, we have
where the last inequality follows from Assumption 1. For the term \(T_{12}\), we derive the following relation:
where we have utilized Eq. (7) to derive the first inequality; the second inequality is obtained by utilizing Eq. (9); the last inequality follows from \(\lambda _{1k}\le \lambda _1\) for all \(k\in [K]\). Thus, we immediately get
Plugging Eqs. (20) and (22) into (19), we have
We now bound the term \(T_2\) in Eq. (17). Following from the definition of \(\delta _k\) implies that
Thus, we obtain
where the first inequality holds by \(\lambda _{1k}\le \lambda _1<1\); we use Eq. (23) to derive the last inequality. To bound \(T_2\), we need to upper bound the term \(\sum \nolimits _{k=1}^K\left\| m_k\right\| ^2/k\) in Eq. (24). Hence, by using the definition of \(m_k\) in Eq (5), we have
To further bound Eq. (26), we now start to bound the term \(T_{21}\) by using the Cauchy–Schwarz inequality as follows:
where the first inequality holds by utilizing \(\lambda _{1k}\le \lambda _1\) for all \(k\in [K]\); the last inequality is derived from the inequality \(\sum \nolimits _{j=1}^K\lambda _1^{K-j}\le 1/(1-\lambda _1)\). Plugging Eqs. (27) into (26), we get
where the third inequality is due to Assumption 2; the fifth inequality is obtained by the inequality \(\sum \nolimits _{j=k}^K\lambda _1^{j-k} \le 1/(1-\lambda _1)\); the sixth inequality is derived by using the following inequality:
Hence, plugging Eqs. (28) into (25), we have
Finally, we bound the term \(T_3\) as follows:
where the last inequality follows from Assumption 1 and \(\lambda _{1k}\le \lambda _1\) for all \(k\in [K]\). Plugging Eqs. (23), (30), and (31) into Eq. (17), Theorem 1 is proved. \(\Box \)
Appendix B
Proof of Corollary 1
By the definitions of bounded functions \({\underline{\alpha }}(k)\) and \({\overline{\alpha }}(k)\), we get
where we have used the relation \(1/{\overline{\alpha }}(k)\le 1/\alpha ^{\star }\) for all \(k\in [K]\) to derive the first inequality; in the last inequality, we have utilized the inequality \(\frac{1+\gamma k}{1+\gamma (k-1)}\le 1+\gamma \), which is due to \(\gamma ^2(k-1)\ge 0\) for all \(\gamma >0\) and \(k\ge 1\). Thus, for any \(\alpha ^{\star }\ge \frac{3+2\gamma ^{-1}}{\mu (1-\lambda _1)}\), the following inequality holds:
Besides, by the definition of \(\delta _t\), \(\forall i\in [d]\) and \(k\in [K]\), we get
Because \(\lambda _{1k}=\lambda _1\nu ^{k-1}\), we have
where the first inequality follows from Eq. (33); the last inequality is derived from the following inequality:
where the notation \(\partial _{\nu }\) denotes the derivative operator with respect to \(\nu \). In addition, by Eq. (33), which implies that \(\delta _{1,i}^{-1}\le \frac{1+\gamma ^{-1}}{\alpha ^{\star }}\) for all \(i\in [d]\). Thus, plugging Eqs. (34) into (10), we have
Furthermore, due to \(\alpha _{\max }={\overline{\alpha }}(1)=\alpha ^{\star }(1+\gamma ^{-1})\), Corollary 1 is proved. \(\Box \)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, L., Wang, X., Li, T. et al. SAdaBoundNc: an adaptive subgradient online learning algorithm with logarithmic regret bounds. Neural Comput & Applic 35, 8051–8063 (2023). https://doi.org/10.1007/s00521-022-08082-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-022-08082-8