Skip to main content
Log in

LightAdam: Towards a Fast and Accurate Adaptive Momentum Online Algorithm

  • Published:
Cognitive Computation Aims and scope Submit manuscript

Abstract

Adaptive optimization algorithms enjoy fast convergence and have been widely exploited in pattern recognition and cognitively-inspired machine learning. These algorithms may however be of high computational cost and low generalization ability due to their projection steps. Such limitations make them difficult to be applied in big data analytics, which may typically be seen in cognitively inspired learning, e.g. deep learning tasks. In this paper, we propose a fast and accurate adaptive momentum online algorithm, called LightAdam, to alleviate the drawbacks of projection steps for the adaptive algorithms. The proposed algorithm substantially reduces computational cost for each iteration step by replacing high-order projection operators with one-dimensional linear searches. Moreover, we introduce a novel second-order momentum and engage dynamic learning rate bounds in the proposed algorithm, thereby obtaining a higher generalization ability than other adaptive algorithms. We theoretically analyze that our proposed algorithm has a guaranteed convergence bound, and prove that our proposed algorithm has better generalization capability as compared to Adam. We conduct extensive experiments on three public datasets for image pattern classification, and validate the computational benefit and accuracy performance of the proposed algorithm in comparison with other state-of-the-art adaptive optimization algorithms

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. McMahan HB. Streeter MJ. Adaptive bound optimization for online convex optimization, in: The 23rd Conference on Learning Theory. 2010:244–256.

  2. Sutskever I, Martens J, Dahl GE, Hinton GE. On the importance of initialization and momentum in deep learning, in: Proceedings of the 30th International Conference on Machine Learning. 2013:1139–1147.

  3. Long M, Cao Y, Cao Z, Wang J, Jordan M. Transferable representation learning with deep adaptation networks. IEEE Trans Pattern Anal Mach Intell. 2019;41:3071–85.

    Article  Google Scholar 

  4. Yang X, Huang K, Zhang R, et al. A Novel Deep Density Model for Unsupervised Learning. Cogn Comput. 2019;11:778–88.

    Article  Google Scholar 

  5. Nguyen B, Morell C, Baets BD. Scalable large-margin distance metric learning using stochastic gradient descent. IEEE Transactions on Cybernetics. 2020;50:1072–83.

    Article  Google Scholar 

  6. Balcan M, Khodak M, Talwalkar A. Provable guarantees for gradient-based meta-learning, in: Proceedings of the 36th International Conference on Machine Learning, 2019:424–433.

  7. Nesterov Y. A method for unconstrained convex minimization problem with the rate of convergence o(1=k2). Doklady AN USSR. 1983;269:543–7.

    Google Scholar 

  8. Tieleman T, Hinton G. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude, in: COURSERA: Neural Networks for Machine Learning. 2012.

  9. Duchi J, Hazan E, Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res. 2011;12:2121–59.

    MathSciNet  MATH  Google Scholar 

  10. Ghadimi E, Feyzmahdavian HR, Johansson M. Global convergence of the heavy-ball method for convex optimization, in: Proceedings of The European Control Conference. 2015:310–315.

  11. Yang X, Zheng X, Gao H. SGD-Based Adaptive NN Control Design for Uncertain Nonlinear Systems. IEEE Transactions on Neural Networks and Learning Systems. 2018;29(10):5071–83.

    Article  MathSciNet  Google Scholar 

  12. Peng Y, Hao Z, Yun X. Lock-free parallelization for variance-reduced stochastic gradient descent onstreaming data. IEEE Trans Parallel Distrib Syst. 2020;31:2220–31.

    Article  Google Scholar 

  13. Perantonis SJ, Karras DA. An efficient constrained learning algorithm with momentum acceleration. Neural Netw. 1995;8:237–49.

    Article  Google Scholar 

  14. Kingma DP, Ba JL. Adam: A method for stochastic optimization, in: Proceedings of the 3rd International Conference on Learning Representations. 2015:1–15.

  15. Gu G, Dogandžić A. Projected nesterov’s proximal-gradient algorithm for sparse signal recovery. IEEE Trans Signal Process. 2017;65:3510–25.

    Article  MathSciNet  Google Scholar 

  16. Chen J, Zhou D, Tang Y, et al. Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks, in: Twenty-Ninth International Joint Conference on Artificial Intelligence and Seventeenth Pacific Rim International Conference on Artificial Intelligence. 2020.

  17. Reddi SJ, Kale K, Kumar S. On the convergence of adam and beyond, in: Proceedings of the Sixth International Conference on Learning Representations. 2018:1–23.

  18. Li W, Zhang Z, Wang X, Luo P. Adax: Adaptive gradient descent with exponential long term memory. 2020. https://arxiv.org/abs/2004.09740

  19. Luo L, Xiong Y, Liu Y, Sun X. Adaptive gradient methods with dynamic bound of learning rate, in: Proceedings of the Seventh International Conference on Learning Representations. 2019:1–19.

  20. Zhou Z, Zhang Q, Lu G, Wang H, Zhang W, Yu Y. Adashift: Decorrelation and convergence of adaptive learning rate methods. 2019:1–26.

  21. Hazan E, Kale S. Projection-free online learning, in: Proceedings of the 29th International Conference on Machine Learning, 2012:1–8.

  22. Balles L, Hennig P. Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients, in: Proceedings of the 35th International Conference on Machine Learning, PMLR 80:404-413. 2018.

  23. Chen L, Harshaw C, Hassani H, Karbasi A. Projection-free online optimization with stochastic gradient: From convexity to submodularity, in: Proceedings of the 35th International Conference on Machine Learning. 2018:813–822.

  24. Hazan E, Minasyan E. Faster projection-free online learning, in: Proceedings of the 33rd Annual Conference on Learning Theory. 2020:1877–1893.

  25. Zhang M, Zhou Y, Quan W, Zhu J, Zheng R, Wu Q. Online learning for iot optimization: A frank-wolfe adam based algorithm. IEEE Internet Things J. 2020;7:8228–37.

    Article  Google Scholar 

  26. Zinkevich M. Online convex programming and generalized infinitesimal gradient ascent, in: Proceedings of the Twentieth International Conference on Machine Learning. 2003:928–936.

  27. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016:770–778.

  28. Huang G, Liu Z, Maaten L, Weinberger KQ. Densely connected convolutional network, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017:1–9.

  29. Berrada L, Zisserman A, Kumar MP. Deep Frank-Wolfe For Neural Network Optimization, in: Proceedings of the International Conference on Learning Representations. 2019.

  30. Lu H, Jin L, Luo X, et al. RNN for Solving Perturbed Time-Varying Underdetermined Linear System With Double Bound Limits on Residual Errors and State Variables. IEEE Trans Industr Inf. 2019;15(11):5931–42.

    Article  Google Scholar 

  31. Xin L, Zhou M, Shang M, Xia Y. A Novel Approach to Extracting Non-Negative Latent Factors From Non-Negative Big Sparse Matrices. IEEE Access. 2016;4:2649–55.

    Article  Google Scholar 

  32. Luo X, Zhou MC, Li S, et al. Algorithms of Unconstrained Non-negative Latent Factor Analysis for Recommender Systems. IEEE Transactions on Big Data. 2021;7(1):227–40.

    Article  Google Scholar 

Download references

Funding

This work was partially supported by Chinese Academy of Sciences under grant  No. Y9BEJ11001 and the innovation workstation of Suzhou Institute of Nano-Tech and Nano-Bionics (SINANO) under grant No. E010210101. This work was also partially supported by National Natural Science Foundation of China under no.61876155 and Jiangsu Science and Technology Programme under no. BE2020006-4.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xin Liu.

Ethics declarations

Ethical Approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Conflict of Interest

The authors declare that they have no conflict of interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix of LightAdam

Appendix of LightAdam

Proof of Lemma 1

Proof

From the definition of \(z_{t}\) and \(\mathbf {x}_{t+1}\), we obtain

$$\begin{aligned} z_t(\mathbf {x}_{t+1})&= Y_t(\mathbf {x}_{t+1}) - Y_t(\mathbf {x}_t^*) \nonumber \\&=Y_t\big ((1-\xi _t)\mathbf {x}_t+\xi _t\mathbf {s}_t\big )-Y_t(\mathbf {x}_t^*) \nonumber \\&=Y_t\big (\mathbf {x}_t+\xi _t(\mathbf {s}_t-\mathbf {x}_t)\big )-Y_t(\mathbf {x}_t^*) \end{aligned}$$
(31)

Based on the fact that \(Y_t(\mathbf {x})\) is 2-smooth, we have

$$\begin{aligned} z_t(\mathbf {x}_{t+1})&= Y_t(\mathbf {x}_t)+\xi _t\langle \nabla Y_t(\mathbf {x}_t),\mathbf {s}_t-\mathbf {x}_t\rangle \nonumber \\&+\Vert \xi _t(\mathbf {s}_t-\mathbf {x}_t)\Vert ^2 -Y_t(\mathbf {x}_t^*). \end{aligned}$$
(32)

Moreover, from the definition of \(\mathbf {s}_t\), we have \(\mathbf {s}_t,\mathbf {x}_t\in \mathcal {F}\). Therefore, from Assumption 3 and the definition of \(\xi _t\), we obtain

$$\begin{aligned} z_t(\mathbf {x}_{t+1})&\le Y_t(\mathbf {x}_t)-Y_t(\mathbf {x}_t^*)\nonumber \\&+\frac{\varpi _{\top }}{\sqrt{t}}\left\langle \nabla Y_t(\mathbf {x}_t),\mathbf {s}_t-\mathbf {x}_t\right\rangle +\frac{\varpi _{\top }^2}{t} J_{\top }^2. \end{aligned}$$
(33)

According to the definition that \(\mathbf {s}_t:=\arg \min _{\mathbf {x}\in \mathcal {F}}\left\langle \nabla Y_t(\mathbf {x}_t),\mathbf {x} \right\rangle\), we have

$$\begin{aligned} \langle \nabla Y_t(\mathbf {x}_t),\mathbf {s}_t\rangle \le \langle \nabla Y_t(\mathbf {x}_t),\mathbf {x}_t^*\rangle . \end{aligned}$$
(34)

Based on Equation (34) and the convexity of \(Y_t(\mathbf {x})\), we obtain

$$\begin{aligned} \langle \nabla Y_t(\mathbf {x}_t),\mathbf {s}_t-\mathbf {x}_t\rangle&\le \langle \nabla Y_t(\mathbf {x}_t),\mathbf {x}_t^*-\mathbf {x}_t\rangle \nonumber \\&\le Y_t(\mathbf {x}_t^*)-Y_t(\mathbf {x}_t). \end{aligned}$$
(35)

Furthermore, inserting Equation (35) into Equation (33), we obtain

$$\begin{aligned} z_t(\mathbf {x}_{t+1})&\le Y_t(\mathbf {x}_t)-Y_t(\mathbf {x}_t^*)+\frac{\varpi _{\top }}{\sqrt{t}} (Y_t(\mathbf {x}_t^*)-Y_t(\mathbf {x}_t)) +\frac{\varpi _{\top }^2}{t} J_{\top }^2 \nonumber \\&\le \Big (1-\frac{\varpi _{\top }}{\sqrt{t}}\Big )\Big [Y_t(\mathbf {x}_t)-Y_t(\mathbf {x}_t^*)\Big ] +\frac{\varpi _{\top }^2}{t} J_{\top }^2 \nonumber \\&\le \Big (1-\frac{\varpi _{\top }}{\sqrt{t}}\Big ) z_t +\frac{\varpi _{\top }^2}{t} J_{\top }^2. \end{aligned}$$
(36)

Since that

$$\mathbf {x}_t^*:=\arg \min _{\mathbf {x}\in \mathbb {R}^n}Y_t(\mathbf {x}),$$

we have \(Y_t(\mathbf {x}_{t}^*)\le Y_t(\mathbf {x}_{t+1}^*).\) Furthermore, from the definition of \(z_{t+1},\) we obtain

$$\begin{aligned} z_{t+1}&= Y_{t+1}(\mathbf {x}_{t+1}) - Y_{t+1}(\mathbf {x}_{t+1}^*) \nonumber \\&= Y_{t+1}(\mathbf {x}_{t+1}) - Y_{t}(\mathbf {x}_{t+1})+ Y_{t}(\mathbf {x}_{t+1}) - Y_{t}(\mathbf {x}_{t+1}^*) \nonumber \\&+ Y_{t}(\mathbf {x}_{t+1}^*) - Y_{t+1}(\mathbf {x}_{t+1}^*) \nonumber \\&\le Y_{t+1}(\mathbf {x}_{t+1}) - Y_{t}(\mathbf {x}_{t+1}) + Y_{t}(\mathbf {x}_{t+1}) - Y_{t}(\mathbf {x}_{t}^*) \nonumber \\&+ Y_{t}(\mathbf {x}_{t+1}^*) - Y_{t+1}(\mathbf {x}_{t+1}^*). \end{aligned}$$
(37)

Moreover, by the definition of \(Y_{t+1}(\mathbf {x})\), we obtain

$$\begin{aligned} Y_{t+1}(\mathbf {x}) - Y_{t}(\mathbf {x})&= \delta \left\langle \sum _{\tau =1}^{t} \mathbf {m}_{\tau }, \mathbf {x}\right\rangle + \Vert \mathbf {x}-\mathbf {x}_1\Vert ^2 \nonumber \\&- \delta \left\langle \sum _{\tau =1}^{t-1} \mathbf {m}_{\tau }, \mathbf {x}\right\rangle - \Vert \mathbf {x}-\mathbf {x}_1\Vert ^2 \nonumber \\&=\delta \langle \mathbf {m}_{t}, \mathbf {x} \rangle . \end{aligned}$$
(38)

In addition, combining Equations (37) with (38), we have

$$\begin{aligned} z_{t+1}&= \delta \langle \mathbf {m}_{t}, \mathbf {x}_{t+1} \rangle + z_t(\mathbf {x}_{t+1}) - \delta \langle \mathbf {m}_{t}, \mathbf {x}_{t+1}^* \rangle \nonumber \\&= z_t(\mathbf {x}_{t+1}) + \delta \langle \mathbf {m}_{t}, \mathbf {x}_{t+1}-\mathbf {x}_{t+1}^* \rangle . \end{aligned}$$
(39)

Furthermore, applying the Cauchy-Schwarz inequality into Equation (39), we obtain

$$\begin{aligned} z_{t+1} = z_t(\mathbf {x}_{t+1}) + \delta \Vert \mathbf {m}_{t}\Vert \Vert \mathbf {x}_{t+1}-\mathbf {x}_{t+1}^*\Vert . \end{aligned}$$
(40)

Next, according to the Assumption 2, and using the recursion algorithm for Equation (12), we obtain

$$\begin{aligned} \Vert \mathbf {m}_t\Vert&= \left\| \beta _1^t\mathbf {m}_0 + (1-\beta _1)\left( \mathbf {g}_t + \beta _1\mathbf {g}_{t-1}+\ldots +\beta _1^{t-1}\mathbf {g}_1\right) \right\| \nonumber \\&\le (1-\beta _1)\left( 1+\beta _1+\ldots +\beta _1^{t-1}\right) K_{\top } \nonumber \\&\le K_{\top }. \end{aligned}$$
(41)

By definition, \(Y_t(\mathbf {x})\) is 2-strongly convex. Meanwhile, from the optimality of \(\mathbf {x}_t^*\) and the definition 2, for any \(\mathbf {x}\in \mathcal {F}\), we obtain

$$\begin{aligned} \Vert \mathbf {x}-\mathbf {x}_t^*\Vert ^2\le Y_t(\mathbf {x}) - Y_t(\mathbf {x}_t^*). \end{aligned}$$
(42)

Let \(\mathbf {x}=\mathbf {x}_{t+1}\), and for time \(t+1\), we obtain

$$\begin{aligned} \Vert \mathbf {x}_{t+1}-\mathbf {x}_{t+1}^*\Vert ^2&\le Y_t(\mathbf {x}_{t+1}) - Y_t(\mathbf {x}_{t+1}^*)\nonumber \\&=z_{t+1}. \end{aligned}$$
(43)

Therefore, combining Equations (36), (40), (41) and (43), we have

$$\begin{aligned} z_{t+1} \le \left( 1-\frac{\varpi _{\top }}{\sqrt{t}}\right) z_t + \delta K_{\top }\sqrt{z_{t+1}}+\frac{\varpi _{\top }^2 J_{\top }^2}{t}. \end{aligned}$$
(44)

Consequently, we obtain Lemma 1 through the above analysis.

Proof of Lemma 2

Proof

To compare the terms \(\frac{1}{\sqrt{t}}\left( 1-\frac{1}{2\sqrt{t}}\right)\) and \(\frac{1}{\sqrt{t+1}}\), we directly calculate the difference of their squares as follows:

$$\begin{aligned}&\left[ \frac{1}{\sqrt{t}}\left( 1-\frac{1}{2\sqrt{t}}\right) \right] ^2 - \left( \frac{1}{\sqrt{t+1}}\right) ^2 \nonumber \\&= \frac{1}{t}\left( 1-\frac{1}{\sqrt{t}}+\frac{1}{4t}\right) - \frac{1}{t+1} \nonumber \\&= \frac{1}{t}-\frac{1}{t\sqrt{t}}+\frac{1}{4t^2}-\frac{1}{t+1} \nonumber \\&= \frac{t(t+1)}{t(t+1)}\left( \frac{1}{t}-\frac{1}{t\sqrt{t}}+\frac{1}{4t^2}-\frac{1}{t+1}\right) \nonumber \\&=\frac{1}{t(t+1)}\left( t+1-\frac{t+1}{\sqrt{t}}+\frac{t+1}{4t}-t\right) \nonumber \\&=\frac{1}{4t^2(t+1)}\left( 5t+1-4\sqrt{t}(t+1)\right) . \end{aligned}$$
(45)

Follows the fact that \(5t-1\le 4\sqrt{t}(t+1)\) for all \(t\ge 1\), we have that

$$\begin{aligned} \left[ \frac{1}{\sqrt{t}}\left( 1-\frac{1}{2\sqrt{t}}\right) \right] ^2 - \left( \frac{1}{\sqrt{t+1}}\right) ^2\le 0. \end{aligned}$$
(46)

Therefore, by Equation (46), we attain the result of Lemma 2

$$\begin{aligned} \frac{1}{\sqrt{t}}\left( 1-\frac{1}{2\sqrt{t}} \right) \le \frac{1}{\sqrt{t+1}}. \end{aligned}$$
(47)

The proof of Lemma 2 is completed.

Proof of Lemma 3

Proof

Since the parameters chosen by our proposed algorithm satisfy

$$t\delta K_{\top }\sqrt{z_{t+1}}\le 3\varpi _{\top }^2 J_{\top }^2-2\varpi _{\top }J_{\top }^2,$$

thus, from Equation (24), we obtain

$$\begin{aligned} z_{t+1} \le \left( 1-\frac{\varpi _{\top }}{\sqrt{t}}\right) z_t +\frac{4\varpi _{\top }^2 J_{\top }^2}{t}-\frac{2\varpi _{\top } J_{\top }^2}{t}. \end{aligned}$$
(48)

Next, to prove the correctness of Equation (26), we use the mathematical induction. Firstly, considering the case that \(t=1\), we have

$$\begin{aligned} z_1&= Y_1(\mathbf {x}_1) - Y_1(\mathbf {x}_1^*) \nonumber \\&=\delta \langle \mathbf {m}_0,\mathbf {x}_1\rangle + \Vert \mathbf {x}_1-\mathbf {x}_1\Vert ^2 - \delta \langle \mathbf {m}_0,\mathbf {x}_1^*\rangle - \Vert \mathbf {x}_1-\mathbf {x}_1^*\Vert ^2 \nonumber \\&= - \Vert \mathbf {x}_1-\mathbf {x}_1^*\Vert ^2 \le 4\varpi _{\top }J_{\top }^2 . \end{aligned}$$
(49)

Thus, the case when \(t=1\) is satisfied. Secondly, assuming that the Equation (26) is true for time t. Next, by the mathematical induction, we consider the case \(t+1\) as follows. From Equation (24) and the relationship

$$t\delta K_{\top }\sqrt{z_{t+1}}\le 3\varpi _{\top }^2 J_{\top }^2-2\varpi _{\top }J_{\top }^2,$$

we obtain

$$\begin{aligned} z_{t+1}&\le \left( 1-\frac{\varpi _{\top }}{\sqrt{t}}\right) z_t + \delta K_{\top }\sqrt{z_{t+1}}+\frac{\varpi _{\top }^2 J_{\top }^2}{t} \nonumber \\&\le \left( 1-\frac{\varpi _{\top }}{\sqrt{t}}\right) z_t+\frac{3\varpi _{\top }^2 J_{\top }^2-2\varpi _{\top }J_{\top }^2}{t}+\frac{\varpi _{\top }^2 J_{\top }^2}{t} \nonumber \\&\le \left( 1-\frac{\varpi _{\top }}{\sqrt{t}}\right) \frac{4\varpi _{\top }J_{\top }^2}{\sqrt{t}}+\frac{4\varpi _{\top }^2 J_{\top }^2}{t}-\frac{2\varpi _{\top } J_{\top }^2}{t} \nonumber \\&\le \frac{4\varpi _{\top }J_{\top }^2}{\sqrt{t}} - \frac{4\varpi _{\top }^2 J_{\top }^2}{t} +\frac{4\varpi _{\top }^2 J_{\top }^2}{t}-\frac{2\varpi _{\top } J_{\top }^2}{t} \nonumber \\&\le 4\varpi _{\top }J_{\top }^2\left( \frac{1}{\sqrt{t}} -\frac{1}{2t}\right) \nonumber \\&\le 4\varpi _{\top }J_{\top }^2\left[ \frac{1}{\sqrt{t}}\left( 1 -\frac{1}{2\sqrt{t}}\right) \right] . \end{aligned}$$
(50)

In addition, applying Lemma 2 into Equation (50), we attain

$$\begin{aligned} z_{t+1}&\le 4\varpi _{\top }J_{\top }^2\left[ \frac{1}{\sqrt{t}}\left( 1 -\frac{1}{2\sqrt{t}}\right) \right] \le \frac{4\varpi _{\top }J_{\top }^2}{\sqrt{t+1}}. \end{aligned}$$
(51)

By Equation (51), the Equation (26) is true for time \(t+1\), therefore, the mathematical induction is satisfied for all \(t\in \{1,\ldots ,T\}\). The proof of Lemma 3 is completed.

Proof of Theorem 1

Proof

Denoting that \(\mathbf {x}^*:=\arg \min _{\mathbf {x}\in \mathcal {F}}\sum _{t=1}^T f_t(\mathbf {x})\), and according to the definition of the regret \(\mathcal {R}(T)\), we have

$$\begin{aligned} \mathcal {R}(T)&= \sum _{t=1}^T f_t(\mathbf {x}_t) - \min _{\mathbf {x}\in \mathcal {F}}\sum _{t=1}^T f_t(\mathbf {x}) \nonumber \\&= \sum _{t=1}^T\big [ f_t(\mathbf {x}_t) - f_t(\mathbf {x}^*) \big ] \nonumber \\&= \sum _{t=1}^T\big [ f_t(\mathbf {x}_t) - f_t(\mathbf {x}_t^*) + f_t(\mathbf {x}_t^*) - f_t(\mathbf {x}^*) \big ] \nonumber \\&\le \sum _{t=1}^T \big \vert f_t(\mathbf {x}_t) - f_t(\mathbf {x}_t^*)\big \vert + \sum _{t=1}^T \big [f_t(\mathbf {x}_t^*) - f_t(\mathbf {x}^*) \big ]. \end{aligned}$$
(52)

To get the bound of \(\mathcal {R}(T)\), we first consider the term

$$\sum _{t=1}^T \big \vert f_t(\mathbf {x}_t) - f_t(\mathbf {x}_t^*)\big \vert$$

in Equation (52). Reviewing the Assumption 1, we know that the function \(f_t(\mathbf {x})\) is Lipschitz with L for all \(t\in \{1,\ldots ,T\}\). Moreover, by the Definition 3, we obtain

$$\begin{aligned} \big \vert f_t(\mathbf {x}_t) - f_t(\mathbf {x}_t^*)\big \vert \le L\Vert \mathbf {x}_t-\mathbf {x}_t^*\Vert . \end{aligned}$$
(53)

Moreover, summing Equation (53), and we have

$$\begin{aligned} \sum _{t=1}^T\big \vert f_t(\mathbf {x}_t) - f_t(\mathbf {x}_t^*)\big \vert \le L\sum _{t=1}^T\Vert \mathbf {x}_t-\mathbf {x}_t^*\Vert . \end{aligned}$$
(54)

Let \(\mathbf {x}=\mathbf {x}_t\) and substitute it into Equation (42), and applying Lemma 3, we attain

$$\begin{aligned} \Vert \mathbf {x}_t-\mathbf {x}_t^*\Vert&\le \sqrt{Y_t(\mathbf {x}_t)-Y_t(\mathbf {x}_t^*)} \nonumber \\&\le \sqrt{z_t} \nonumber \\&\le \frac{2J_{\top }\sqrt{\varpi _{\top }}}{t^{1/4}}. \end{aligned}$$
(55)

By the definition of the definite integral, we have the relationship

$$\sum _{t=1}^T \frac{1}{t^{1/4}}\le \int _{0}^T \frac{1}{t^{1/4}}dt=\frac{4}{3}T^{3/4}.$$

Therefore, combining Equations (54) and (55), we obtain

$$\begin{aligned} \sum _{t=1}^T\big \vert f_t(\mathbf {x}_t) - f_t(\mathbf {x}_t^*)\big \vert&\le L\sum _{t=1}^T\Vert \mathbf {x}_t-\mathbf {x}_t^*\Vert \nonumber \\&\le L\sum _{t=1}^T\frac{2J_{\top }\sqrt{\varpi _{\top }}}{t^{1/4}} \nonumber \\&\le \frac{8}{3}LJ_{\top }\sqrt{\varpi _{\top }}T^{3/4}. \end{aligned}$$
(56)

Now, the bound of \(\sum _{t=1}^T \big \vert f_t(\mathbf {x}_t) - f_t(\mathbf {x}_t^*)\big \vert\) is obtained. Next, we turn to calculate the bound of the term

$$\sum _{t=1}^T \big [f_t(\mathbf {x}_t^*) - f_t(\mathbf {x}^*) \big ]$$

in Equation (52). By the smoothness of \(f_t(\mathbf {x})\) and the Definition 4, we have

$$\begin{aligned} f_t(\mathbf {x}_t^*) - f_t(\mathbf {x}^*)&=f_t\left[ \mathbf {x}_t-(\mathbf {x}_t-\mathbf {x}_t^*)\right] - f_t(\mathbf {x}^*) \nonumber \\&\le f_t(\mathbf {x}_t) - f_t(\mathbf {x}^*) - \mathbf {g}_t\odot (\mathbf {x}_t-\mathbf {x}_t^*) \nonumber \\&+\Vert \mathbf {x}_t-\mathbf {x}_t^*\Vert ^2 . \end{aligned}$$
(57)

Moreover, from the convexity of \(f_t(\mathbf {x})\), the Definition 1 and the optimal of \(\mathbf {x}^*\), we further obtain

$$\begin{aligned} f_t(\mathbf {x}_t^*) - f_t(\mathbf {x}^*)&\le \nabla f_{t}(\mathbf {x}^*)\odot (\mathbf {x}^*-\mathbf {x}_t) +\Vert \mathbf {x}_t-\mathbf {x}_t^*\Vert ^2 - \mathbf {g}_t\odot (\mathbf {x}_t-\mathbf {x}_t^*)\nonumber \\&\le \Vert \mathbf {x}_t-\mathbf {x}_t^*\Vert ^2 - \mathbf {g}_t\odot (\mathbf {x}_t-\mathbf {x}_t^*). \end{aligned}$$
(58)

In addition, from Equation (58), and applying the Cauchy-Schwarz inequality, we attain

$$\begin{aligned} f_t(\mathbf {x}_t^*) - f_t(\mathbf {x}^*)&\le \Vert \mathbf {x}_t-\mathbf {x}_t^*\Vert ^2 - \mathbf {g}_t\odot (\mathbf {x}_t-\mathbf {x}_t^*) \nonumber \\&\le \Vert \mathbf {x}_t-\mathbf {x}_t^*\Vert ^2 + \Vert \mathbf {g}_t\Vert \Vert \mathbf {x}_t-\mathbf {x}_t^*\Vert . \end{aligned}$$
(59)

Applying Assumptions 2 and 3, and from Equation (55), we further have

$$\begin{aligned} f_t(\mathbf {x}_t^*) - f_t(\mathbf {x}^*)&\le \left( \frac{2J_{\top }\sqrt{\varpi _{\top }}}{t^{1/4}}\right) ^2+ \frac{2J_{\top }\sqrt{\varpi _{\top }}}{t^{1/4}} \nonumber \\&\le \frac{4J_{\top }^2\varpi _{\top }}{t^{1/2}}+\frac{2K_{\top }J_{\top }\sqrt{\varpi _{\top }}}{t^{1/4}}. \end{aligned}$$
(60)

Next, sum over both sides of Equation (60), we obtain

$$\begin{aligned} \sum _{t=1}^{T} f_t(\mathbf {x}_t^*) - f_t(\mathbf {x}^*)&\le \sum _{t=1}^{T}\frac{4J_{\top }^2\varpi _{\top }}{t^{1/2}}+\sum _{t=1}^{T}\frac{2K_{\top }J_{\top }\sqrt{\varpi _{\top }}}{t^{1/4}}. \end{aligned}$$
(61)

Substituting the inequalities

$$\sum _{t=1}^T \frac{1}{t^{1/2}}\le 2\sqrt{T}$$

and

$$\sum _{t=1}^T \frac{1}{t^{1/4}}\le \frac{4}{3}T^{3/4}$$

into Equation (60), and we attain

$$\begin{aligned} \sum _{t=1}^{T} f_t(\mathbf {x}_t^*) - f_t(\mathbf {x}^*) \le 8J_{\top }^2\varpi _{\top }T^{1/2} + \frac{8}{3}K_{\top }J_{\top }\sqrt{\varpi _{\top }}T^{3/4}. \end{aligned}$$
(62)

Finally, combining Equations (52), (56) and (62), we have that

$$\begin{aligned} \mathcal {R}(T)&\le \frac{8}{3}LJ_{\top }\sqrt{\varpi _{\top }}T^{3/4}+ \frac{8}{3}K_{\top }J_{\top }\sqrt{\varpi _{\top }}T^{3/4} + 8J_{\top }^2\varpi _{\top }T^{1/2}\nonumber \\&= \frac{8}{3}(L+K_{\top })J_{\top }\sqrt{\varpi _{\top }}T^{3/4} + 8J_{\top }^2\varpi _{\top }T^{1/2}. \end{aligned}$$
(63)

Therefore, the stated bound of the regret \(\mathcal {R}(T)\) is obtained.

Proof of Theorem 2

Proof

Following with [18], we define a loss function as equation (30). Therefore, the minimum regret of Equation (30) is obtained when \(x=0\). For Adam, we set \(\beta _1=0, 0<\sqrt{\beta _2}<\lambda <1,\) and \(\alpha _t = \alpha / \sqrt{t},\) where \(t\in \{1,\ldots ,T\}\). Then, the gradient of \(f_t(x_{t,i})\) when \(x_{t,i}\ge 0\) as follows

$$\begin{aligned} g_{t,i} = C\lambda ^{t-1}. \end{aligned}$$
(64)

Moreover, from Equation (27) and applying the recursive algorithm, we obtain the following

$$\begin{aligned} v_{t,i}&= \beta _2 v_{t-1,i} + (1-\beta _2)\left( C\lambda ^{t-1}\right) ^2 \nonumber \\&= \sum _{\tau =1}^{t}\beta _{2}^{t-\tau }(1-\beta _2)\left( C\lambda ^{t-1}\right) ^2 \nonumber \\&= \frac{(1-\beta _2)\beta _2^t C^2}{\lambda ^2}\sum _{\tau =1}^{t}\left( \frac{\lambda ^2}{\beta _2}\right) ^{\tau } \nonumber \\&= \frac{(1-\beta _2)\left( \lambda ^{2t}-\beta _2^t\right) C^2}{\lambda ^2 - \beta _2}. \end{aligned}$$
(65)

Since that \(\sqrt{\beta _2}<\lambda\), we have the following

$$\begin{aligned} \alpha _t\frac{m_{t,i}}{\sqrt{v_{t,i}}}&= \frac{\alpha (1-\beta _1)\left( \lambda ^t-\beta _1^t\right) }{\sqrt{t}(\lambda -\beta _1)}\frac{\sqrt{\lambda ^2-\beta _2}}{\sqrt{(1-\beta _2)\left( \lambda ^{2t}-\beta _2^t\right) }} \nonumber \\&= \frac{\alpha (1-\beta _1)\sqrt{\lambda ^2-\beta _2}}{\sqrt{t}(\lambda -\beta _1)\sqrt{(1-\beta _2)}}\frac{1-\left( \beta _1/\lambda \right) ^t}{\sqrt{1-\left( \beta _2/\lambda ^2\right) ^t}} \nonumber \\&\ge \frac{\alpha (1-\beta _1)\sqrt{\lambda ^2-\beta _2}}{\sqrt{t}(\lambda -\beta _1)\sqrt{(1-\beta _2)}}\left( 1-\frac{\beta _1}{\lambda }\right) . \end{aligned}$$
(66)

By Equations (28) and (66), we attain the following

$$\begin{aligned} x_{t+1,i}&= x_{t,i}-\alpha _t\frac{m_{t,i}}{\sqrt{v_{t,i}}} \nonumber \\&\le x_{t,i} - \frac{\alpha (1-\beta _1)\sqrt{\lambda ^2-\beta _2}}{\sqrt{t}(\lambda -\beta _1)\sqrt{(1-\beta _2)}}\left( 1-\frac{\beta _1}{\lambda }\right) \nonumber \\&\le x_{0,i} - \frac{(1-\beta _1)\sqrt{\lambda ^2-\beta _2}}{(\lambda -\beta _1)\sqrt{(1-\beta _2)}}\left( 1-\frac{\beta _1}{\lambda }\right) \sum _{\tau =1}^t\frac{1}{\sqrt{\tau }}. \end{aligned}$$
(67)

Since that \(\sum _{\tau =1}^t\frac{1}{\sqrt{\tau }}\) diverges when \(t\rightarrow \infty\), and from Equation (67), Adam would always reach the negative region if t is large enough.

For LightAdam, we also set \(\beta _1=0, 0<\sqrt{\beta _2}<\lambda <1,\) and \(\alpha _t = \alpha / \sqrt{t},\) where \(t\in \{1,\ldots ,T\}\). Then, by Equations (13) and (14), we obtain the following

$$\begin{aligned} \hat{v}_{t,i}&= \frac{\sum _{\tau =1}^t\beta _2(1+\beta _2)^{t-\tau }\left( C\lambda ^{\tau -1}\right) ^2}{(1+\beta _2)^t-1} \nonumber \\&= \frac{(1+\beta _2)^t\beta _2\lambda ^{-2}C^2}{(1+\beta _2)^t-1}\sum _{\tau =1}^t\left( \frac{\lambda ^2}{1+\beta _2}\right) ^{\tau } \nonumber \\&= \frac{\beta _2 C^2}{1+\beta _2-\lambda ^2}\frac{(1+\beta _2)^t-\lambda ^{2t}}{(1+\beta _2)^t-1}. \end{aligned}$$
(68)

By Equation (68), we further have the following

$$\begin{aligned} \alpha _t\frac{m_{t,i}}{\sqrt{\hat{v}_{t,i}}}&= \frac{\alpha \lambda ^t C}{\sqrt{t}\lambda }\frac{\sqrt{1+\beta _2-\lambda ^2}\sqrt{(1+\beta _2)^t-1}}{\sqrt{\beta _2 C^2}\sqrt{(1+\beta _2)^t-\lambda ^{2t}}} \nonumber \\&=\frac{\alpha \lambda ^{t-1}\sqrt{1+\beta _2-\lambda ^2}}{\sqrt{\beta _2 t}}\frac{\sqrt{(1+\beta _2)^t-1}}{\sqrt{\sqrt{(1+\beta _2)^t-\lambda ^{2t}}}} \nonumber \\&\le \alpha \sqrt{\frac{1+\beta _2-\lambda ^2}{\beta _2 t}}\lambda ^{t-1}. \end{aligned}$$
(69)

Moreover, from Equations (15), (18) and (69), we attain the following

$$\begin{aligned} x_{t+1,i}&= x_{t,i}-\alpha _t\frac{m_{t,i}}{\sqrt{v_{t,i}}} \nonumber \\&\ge x_{t,i} - \alpha \sqrt{\frac{1+\beta _2-\lambda ^2}{\beta _2 t}}\lambda ^{t-1} \nonumber \\&\ge x_{0,i} - \alpha \sqrt{\frac{1+\beta _2-\lambda ^2}{\beta _2 t}}\lambda ^{t-1}\sum _{\tau =1}^t\frac{1}{\sqrt{\tau }}. \end{aligned}$$
(70)

From Equation (69), the step size of LightAdam has a lower bound, therefore, its step size is not affected by extreme gradients. In addition, from Equation (70), we can observe that LightAdam is able to converge to the optimal solution if its step size and parameters are initialized suitably. Therefore, the proof of Theorem 2 is completed.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhou, Y., Huang, K., Cheng, C. et al. LightAdam: Towards a Fast and Accurate Adaptive Momentum Online Algorithm. Cogn Comput 14, 764–779 (2022). https://doi.org/10.1007/s12559-021-09985-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12559-021-09985-9

Keywords

Navigation