LightAdam: Towards a Fast and Accurate Adaptive Momentum Online Algorithm

Zhou, Yangfan; Huang, Kaizhu; Cheng, Cheng; Wang, Xuguang; Liu, Xin

doi:10.1007/s12559-021-09985-9

LightAdam: Towards a Fast and Accurate Adaptive Momentum Online Algorithm

Published: 11 January 2022

Volume 14, pages 764–779, (2022)
Cite this article

Cognitive Computation Aims and scope Submit manuscript

Yangfan Zhou^1,2,
Kaizhu Huang³,
Cheng Cheng^1,4,
Xuguang Wang¹ &
…
Xin Liu ORCID: orcid.org/0000-0003-4083-4731^1,2

412 Accesses
7 Citations
Explore all metrics

Abstract

Adaptive optimization algorithms enjoy fast convergence and have been widely exploited in pattern recognition and cognitively-inspired machine learning. These algorithms may however be of high computational cost and low generalization ability due to their projection steps. Such limitations make them difficult to be applied in big data analytics, which may typically be seen in cognitively inspired learning, e.g. deep learning tasks. In this paper, we propose a fast and accurate adaptive momentum online algorithm, called LightAdam, to alleviate the drawbacks of projection steps for the adaptive algorithms. The proposed algorithm substantially reduces computational cost for each iteration step by replacing high-order projection operators with one-dimensional linear searches. Moreover, we introduce a novel second-order momentum and engage dynamic learning rate bounds in the proposed algorithm, thereby obtaining a higher generalization ability than other adaptive algorithms. We theoretically analyze that our proposed algorithm has a guaranteed convergence bound, and prove that our proposed algorithm has better generalization capability as compared to Adam. We conduct extensive experiments on three public datasets for image pattern classification, and validate the computational benefit and accuracy performance of the proposed algorithm in comparison with other state-of-the-art adaptive optimization algorithms

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Machine Learning: Algorithms, Real-World Applications and Research Directions

Article 22 March 2021

Iqbal H. Sarker

Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions

Article 18 August 2021

Iqbal H. Sarker

Machine learning and deep learning

Article Open access 08 April 2021

Christian Janiesch, Patrick Zschech & Kai Heinrich

References

McMahan HB. Streeter MJ. Adaptive bound optimization for online convex optimization, in: The 23rd Conference on Learning Theory. 2010:244–256.
Sutskever I, Martens J, Dahl GE, Hinton GE. On the importance of initialization and momentum in deep learning, in: Proceedings of the 30th International Conference on Machine Learning. 2013:1139–1147.
Long M, Cao Y, Cao Z, Wang J, Jordan M. Transferable representation learning with deep adaptation networks. IEEE Trans Pattern Anal Mach Intell. 2019;41:3071–85.
Article Google Scholar
Yang X, Huang K, Zhang R, et al. A Novel Deep Density Model for Unsupervised Learning. Cogn Comput. 2019;11:778–88.
Article Google Scholar
Nguyen B, Morell C, Baets BD. Scalable large-margin distance metric learning using stochastic gradient descent. IEEE Transactions on Cybernetics. 2020;50:1072–83.
Article Google Scholar
Balcan M, Khodak M, Talwalkar A. Provable guarantees for gradient-based meta-learning, in: Proceedings of the 36th International Conference on Machine Learning, 2019:424–433.
Nesterov Y. A method for unconstrained convex minimization problem with the rate of convergence o(1=k2). Doklady AN USSR. 1983;269:543–7.
Google Scholar
Tieleman T, Hinton G. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude, in: COURSERA: Neural Networks for Machine Learning. 2012.
Duchi J, Hazan E, Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res. 2011;12:2121–59.
MathSciNet MATH Google Scholar
Ghadimi E, Feyzmahdavian HR, Johansson M. Global convergence of the heavy-ball method for convex optimization, in: Proceedings of The European Control Conference. 2015:310–315.
Yang X, Zheng X, Gao H. SGD-Based Adaptive NN Control Design for Uncertain Nonlinear Systems. IEEE Transactions on Neural Networks and Learning Systems. 2018;29(10):5071–83.
Article MathSciNet Google Scholar
Peng Y, Hao Z, Yun X. Lock-free parallelization for variance-reduced stochastic gradient descent onstreaming data. IEEE Trans Parallel Distrib Syst. 2020;31:2220–31.
Article Google Scholar
Perantonis SJ, Karras DA. An efficient constrained learning algorithm with momentum acceleration. Neural Netw. 1995;8:237–49.
Article Google Scholar
Kingma DP, Ba JL. Adam: A method for stochastic optimization, in: Proceedings of the 3rd International Conference on Learning Representations. 2015:1–15.
Gu G, Dogandžić A. Projected nesterov’s proximal-gradient algorithm for sparse signal recovery. IEEE Trans Signal Process. 2017;65:3510–25.
Article MathSciNet Google Scholar
Chen J, Zhou D, Tang Y, et al. Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks, in: Twenty-Ninth International Joint Conference on Artificial Intelligence and Seventeenth Pacific Rim International Conference on Artificial Intelligence. 2020.
Reddi SJ, Kale K, Kumar S. On the convergence of adam and beyond, in: Proceedings of the Sixth International Conference on Learning Representations. 2018:1–23.
Li W, Zhang Z, Wang X, Luo P. Adax: Adaptive gradient descent with exponential long term memory. 2020. https://arxiv.org/abs/2004.09740
Luo L, Xiong Y, Liu Y, Sun X. Adaptive gradient methods with dynamic bound of learning rate, in: Proceedings of the Seventh International Conference on Learning Representations. 2019:1–19.
Zhou Z, Zhang Q, Lu G, Wang H, Zhang W, Yu Y. Adashift: Decorrelation and convergence of adaptive learning rate methods. 2019:1–26.
Hazan E, Kale S. Projection-free online learning, in: Proceedings of the 29th International Conference on Machine Learning, 2012:1–8.
Balles L, Hennig P. Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients, in: Proceedings of the 35th International Conference on Machine Learning, PMLR 80:404-413. 2018.
Chen L, Harshaw C, Hassani H, Karbasi A. Projection-free online optimization with stochastic gradient: From convexity to submodularity, in: Proceedings of the 35th International Conference on Machine Learning. 2018:813–822.
Hazan E, Minasyan E. Faster projection-free online learning, in: Proceedings of the 33rd Annual Conference on Learning Theory. 2020:1877–1893.
Zhang M, Zhou Y, Quan W, Zhu J, Zheng R, Wu Q. Online learning for iot optimization: A frank-wolfe adam based algorithm. IEEE Internet Things J. 2020;7:8228–37.
Article Google Scholar
Zinkevich M. Online convex programming and generalized infinitesimal gradient ascent, in: Proceedings of the Twentieth International Conference on Machine Learning. 2003:928–936.
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016:770–778.
Huang G, Liu Z, Maaten L, Weinberger KQ. Densely connected convolutional network, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017:1–9.
Berrada L, Zisserman A, Kumar MP. Deep Frank-Wolfe For Neural Network Optimization, in: Proceedings of the International Conference on Learning Representations. 2019.
Lu H, Jin L, Luo X, et al. RNN for Solving Perturbed Time-Varying Underdetermined Linear System With Double Bound Limits on Residual Errors and State Variables. IEEE Trans Industr Inf. 2019;15(11):5931–42.
Article Google Scholar
Xin L, Zhou M, Shang M, Xia Y. A Novel Approach to Extracting Non-Negative Latent Factors From Non-Negative Big Sparse Matrices. IEEE Access. 2016;4:2649–55.
Article Google Scholar
Luo X, Zhou MC, Li S, et al. Algorithms of Unconstrained Non-negative Latent Factor Analysis for Recommender Systems. IEEE Transactions on Big Data. 2021;7(1):227–40.
Article Google Scholar

Download references

Funding

This work was partially supported by Chinese Academy of Sciences under grant No. Y9BEJ11001 and the innovation workstation of Suzhou Institute of Nano-Tech and Nano-Bionics (SINANO) under grant No. E010210101. This work was also partially supported by National Natural Science Foundation of China under no.61876155 and Jiangsu Science and Technology Programme under no. BE2020006-4.

Author information

Authors and Affiliations

Suzhou Institute of Nano-Tech and Nano-Bionics (SINANO), Chinese Academy of Sciences, 398 Ruoshui Road, Suzhou, 215123, Jiangsu, China
Yangfan Zhou, Cheng Cheng, Xuguang Wang & Xin Liu
School of Nano-Tech and Nano-Bionics, University of Science and Technology of China, 96 Jinzhai Road, Hefei, 230026, Anhui, China
Yangfan Zhou & Xin Liu
School of Advanced Technology, Xi’an Jiaotong-Liverpool University, 111 Ren’ai Road, Suzhou, 215123, Jiangsu, China
Kaizhu Huang
Gusu Laboratory of Materials, 388 Ruoshui Road, Suzhou, 215123, Jiangsu, China
Cheng Cheng

Authors

Yangfan Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Kaizhu Huang
View author publications
You can also search for this author in PubMed Google Scholar
Cheng Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Xuguang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xin Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xin Liu.

Ethics declarations

Ethical Approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Conflict of Interest

The authors declare that they have no conflict of interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix of LightAdam

Proof of Lemma 1

Proof

From the definition of $z_{t}$ and $\mathbf {x}_{t+1}$, we obtain

$$\begin{aligned} z_t(\mathbf {x}_{t+1})&= Y_t(\mathbf {x}_{t+1}) - Y_t(\mathbf {x}_t^*) \nonumber \\&=Y_t\big ((1-\xi _t)\mathbf {x}_t+\xi _t\mathbf {s}_t\big )-Y_t(\mathbf {x}_t^*) \nonumber \\&=Y_t\big (\mathbf {x}_t+\xi _t(\mathbf {s}_t-\mathbf {x}_t)\big )-Y_t(\mathbf {x}_t^*) \end{aligned}$$

(31)

Based on the fact that $Y_t(\mathbf {x})$ is 2-smooth, we have

$$\begin{aligned} z_t(\mathbf {x}_{t+1})&= Y_t(\mathbf {x}_t)+\xi _t\langle \nabla Y_t(\mathbf {x}_t),\mathbf {s}_t-\mathbf {x}_t\rangle \nonumber \\&+\Vert \xi _t(\mathbf {s}_t-\mathbf {x}_t)\Vert ^2 -Y_t(\mathbf {x}_t^*). \end{aligned}$$

(32)

Moreover, from the definition of $\mathbf {s}_t$, we have $\mathbf {s}_t,\mathbf {x}_t\in \mathcal {F}$. Therefore, from Assumption 3 and the definition of $\xi _t$, we obtain

$$\begin{aligned} z_t(\mathbf {x}_{t+1})&\le Y_t(\mathbf {x}_t)-Y_t(\mathbf {x}_t^*)\nonumber \\&+\frac{\varpi _{\top }}{\sqrt{t}}\left\langle \nabla Y_t(\mathbf {x}_t),\mathbf {s}_t-\mathbf {x}_t\right\rangle +\frac{\varpi _{\top }^2}{t} J_{\top }^2. \end{aligned}$$

(33)

According to the definition that $\mathbf {s}_t:=\arg \min _{\mathbf {x}\in \mathcal {F}}\left\langle \nabla Y_t(\mathbf {x}_t),\mathbf {x} \right\rangle$, we have

$$\begin{aligned} \langle \nabla Y_t(\mathbf {x}_t),\mathbf {s}_t\rangle \le \langle \nabla Y_t(\mathbf {x}_t),\mathbf {x}_t^*\rangle . \end{aligned}$$

(34)

Based on Equation (34) and the convexity of $Y_t(\mathbf {x})$, we obtain

$$\begin{aligned} \langle \nabla Y_t(\mathbf {x}_t),\mathbf {s}_t-\mathbf {x}_t\rangle&\le \langle \nabla Y_t(\mathbf {x}_t),\mathbf {x}_t^*-\mathbf {x}_t\rangle \nonumber \\&\le Y_t(\mathbf {x}_t^*)-Y_t(\mathbf {x}_t). \end{aligned}$$

(35)

Furthermore, inserting Equation (35) into Equation (33), we obtain

$$\begin{aligned} z_t(\mathbf {x}_{t+1})&\le Y_t(\mathbf {x}_t)-Y_t(\mathbf {x}_t^*)+\frac{\varpi _{\top }}{\sqrt{t}} (Y_t(\mathbf {x}_t^*)-Y_t(\mathbf {x}_t)) +\frac{\varpi _{\top }^2}{t} J_{\top }^2 \nonumber \\&\le \Big (1-\frac{\varpi _{\top }}{\sqrt{t}}\Big )\Big [Y_t(\mathbf {x}_t)-Y_t(\mathbf {x}_t^*)\Big ] +\frac{\varpi _{\top }^2}{t} J_{\top }^2 \nonumber \\&\le \Big (1-\frac{\varpi _{\top }}{\sqrt{t}}\Big ) z_t +\frac{\varpi _{\top }^2}{t} J_{\top }^2. \end{aligned}$$

(36)

Since that

$$\mathbf {x}_t^*:=\arg \min _{\mathbf {x}\in \mathbb {R}^n}Y_t(\mathbf {x}),$$

we have $Y_t(\mathbf {x}_{t}^*)\le Y_t(\mathbf {x}_{t+1}^*).$ Furthermore, from the definition of $z_{t+1},$ we obtain

$$\begin{aligned} z_{t+1}&= Y_{t+1}(\mathbf {x}_{t+1}) - Y_{t+1}(\mathbf {x}_{t+1}^*) \nonumber \\&= Y_{t+1}(\mathbf {x}_{t+1}) - Y_{t}(\mathbf {x}_{t+1})+ Y_{t}(\mathbf {x}_{t+1}) - Y_{t}(\mathbf {x}_{t+1}^*) \nonumber \\&+ Y_{t}(\mathbf {x}_{t+1}^*) - Y_{t+1}(\mathbf {x}_{t+1}^*) \nonumber \\&\le Y_{t+1}(\mathbf {x}_{t+1}) - Y_{t}(\mathbf {x}_{t+1}) + Y_{t}(\mathbf {x}_{t+1}) - Y_{t}(\mathbf {x}_{t}^*) \nonumber \\&+ Y_{t}(\mathbf {x}_{t+1}^*) - Y_{t+1}(\mathbf {x}_{t+1}^*). \end{aligned}$$

(37)

Moreover, by the definition of $Y_{t+1}(\mathbf {x})$, we obtain

$$\begin{aligned} Y_{t+1}(\mathbf {x}) - Y_{t}(\mathbf {x})&= \delta \left\langle \sum _{\tau =1}^{t} \mathbf {m}_{\tau }, \mathbf {x}\right\rangle + \Vert \mathbf {x}-\mathbf {x}_1\Vert ^2 \nonumber \\&- \delta \left\langle \sum _{\tau =1}^{t-1} \mathbf {m}_{\tau }, \mathbf {x}\right\rangle - \Vert \mathbf {x}-\mathbf {x}_1\Vert ^2 \nonumber \\&=\delta \langle \mathbf {m}_{t}, \mathbf {x} \rangle . \end{aligned}$$

(38)

In addition, combining Equations (37) with (38), we have

$$\begin{aligned} z_{t+1}&= \delta \langle \mathbf {m}_{t}, \mathbf {x}_{t+1} \rangle + z_t(\mathbf {x}_{t+1}) - \delta \langle \mathbf {m}_{t}, \mathbf {x}_{t+1}^* \rangle \nonumber \\&= z_t(\mathbf {x}_{t+1}) + \delta \langle \mathbf {m}_{t}, \mathbf {x}_{t+1}-\mathbf {x}_{t+1}^* \rangle . \end{aligned}$$

(39)

Furthermore, applying the Cauchy-Schwarz inequality into Equation (39), we obtain

$$\begin{aligned} z_{t+1} = z_t(\mathbf {x}_{t+1}) + \delta \Vert \mathbf {m}_{t}\Vert \Vert \mathbf {x}_{t+1}-\mathbf {x}_{t+1}^*\Vert . \end{aligned}$$

(40)

Next, according to the Assumption 2, and using the recursion algorithm for Equation (12), we obtain

$$\begin{aligned} \Vert \mathbf {m}_t\Vert&= \left\| \beta _1^t\mathbf {m}_0 + (1-\beta _1)\left( \mathbf {g}_t + \beta _1\mathbf {g}_{t-1}+\ldots +\beta _1^{t-1}\mathbf {g}_1\right) \right\| \nonumber \\&\le (1-\beta _1)\left( 1+\beta _1+\ldots +\beta _1^{t-1}\right) K_{\top } \nonumber \\&\le K_{\top }. \end{aligned}$$

(41)

By definition, $Y_t(\mathbf {x})$ is 2-strongly convex. Meanwhile, from the optimality of $\mathbf {x}_t^*$ and the definition 2, for any $\mathbf {x}\in \mathcal {F}$, we obtain

$$\begin{aligned} \Vert \mathbf {x}-\mathbf {x}_t^*\Vert ^2\le Y_t(\mathbf {x}) - Y_t(\mathbf {x}_t^*). \end{aligned}$$

(42)

Let $\mathbf {x}=\mathbf {x}_{t+1}$, and for time $t+1$, we obtain

$$\begin{aligned} \Vert \mathbf {x}_{t+1}-\mathbf {x}_{t+1}^*\Vert ^2&\le Y_t(\mathbf {x}_{t+1}) - Y_t(\mathbf {x}_{t+1}^*)\nonumber \\&=z_{t+1}. \end{aligned}$$

(43)

Therefore, combining Equations (36), (40), (41) and (43), we have

$$\begin{aligned} z_{t+1} \le \left( 1-\frac{\varpi _{\top }}{\sqrt{t}}\right) z_t + \delta K_{\top }\sqrt{z_{t+1}}+\frac{\varpi _{\top }^2 J_{\top }^2}{t}. \end{aligned}$$

(44)

Consequently, we obtain Lemma 1 through the above analysis.

Proof of Lemma 2

Proof

To compare the terms $\frac{1}{\sqrt{t}}\left( 1-\frac{1}{2\sqrt{t}}\right)$ and $\frac{1}{\sqrt{t+1}}$, we directly calculate the difference of their squares as follows:

$$\begin{aligned}&\left[ \frac{1}{\sqrt{t}}\left( 1-\frac{1}{2\sqrt{t}}\right) \right] ^2 - \left( \frac{1}{\sqrt{t+1}}\right) ^2 \nonumber \\&= \frac{1}{t}\left( 1-\frac{1}{\sqrt{t}}+\frac{1}{4t}\right) - \frac{1}{t+1} \nonumber \\&= \frac{1}{t}-\frac{1}{t\sqrt{t}}+\frac{1}{4t^2}-\frac{1}{t+1} \nonumber \\&= \frac{t(t+1)}{t(t+1)}\left( \frac{1}{t}-\frac{1}{t\sqrt{t}}+\frac{1}{4t^2}-\frac{1}{t+1}\right) \nonumber \\&=\frac{1}{t(t+1)}\left( t+1-\frac{t+1}{\sqrt{t}}+\frac{t+1}{4t}-t\right) \nonumber \\&=\frac{1}{4t^2(t+1)}\left( 5t+1-4\sqrt{t}(t+1)\right) . \end{aligned}$$

(45)

Follows the fact that $5t-1\le 4\sqrt{t}(t+1)$ for all $t\ge 1$, we have that

$$\begin{aligned} \left[ \frac{1}{\sqrt{t}}\left( 1-\frac{1}{2\sqrt{t}}\right) \right] ^2 - \left( \frac{1}{\sqrt{t+1}}\right) ^2\le 0. \end{aligned}$$

(46)

Therefore, by Equation (46), we attain the result of Lemma 2

$$\begin{aligned} \frac{1}{\sqrt{t}}\left( 1-\frac{1}{2\sqrt{t}} \right) \le \frac{1}{\sqrt{t+1}}. \end{aligned}$$

(47)

The proof of Lemma 2 is completed.

Proof of Lemma 3

Proof

Since the parameters chosen by our proposed algorithm satisfy

$$t\delta K_{\top }\sqrt{z_{t+1}}\le 3\varpi _{\top }^2 J_{\top }^2-2\varpi _{\top }J_{\top }^2,$$

thus, from Equation (24), we obtain

$$\begin{aligned} z_{t+1} \le \left( 1-\frac{\varpi _{\top }}{\sqrt{t}}\right) z_t +\frac{4\varpi _{\top }^2 J_{\top }^2}{t}-\frac{2\varpi _{\top } J_{\top }^2}{t}. \end{aligned}$$

(48)

Next, to prove the correctness of Equation (26), we use the mathematical induction. Firstly, considering the case that $t=1$, we have

$$\begin{aligned} z_1&= Y_1(\mathbf {x}_1) - Y_1(\mathbf {x}_1^*) \nonumber \\&=\delta \langle \mathbf {m}_0,\mathbf {x}_1\rangle + \Vert \mathbf {x}_1-\mathbf {x}_1\Vert ^2 - \delta \langle \mathbf {m}_0,\mathbf {x}_1^*\rangle - \Vert \mathbf {x}_1-\mathbf {x}_1^*\Vert ^2 \nonumber \\&= - \Vert \mathbf {x}_1-\mathbf {x}_1^*\Vert ^2 \le 4\varpi _{\top }J_{\top }^2 . \end{aligned}$$

(49)

Thus, the case when $t=1$ is satisfied. Secondly, assuming that the Equation (26) is true for time t. Next, by the mathematical induction, we consider the case $t+1$ as follows. From Equation (24) and the relationship

$$t\delta K_{\top }\sqrt{z_{t+1}}\le 3\varpi _{\top }^2 J_{\top }^2-2\varpi _{\top }J_{\top }^2,$$

we obtain

$$\begin{aligned} z_{t+1}&\le \left( 1-\frac{\varpi _{\top }}{\sqrt{t}}\right) z_t + \delta K_{\top }\sqrt{z_{t+1}}+\frac{\varpi _{\top }^2 J_{\top }^2}{t} \nonumber \\&\le \left( 1-\frac{\varpi _{\top }}{\sqrt{t}}\right) z_t+\frac{3\varpi _{\top }^2 J_{\top }^2-2\varpi _{\top }J_{\top }^2}{t}+\frac{\varpi _{\top }^2 J_{\top }^2}{t} \nonumber \\&\le \left( 1-\frac{\varpi _{\top }}{\sqrt{t}}\right) \frac{4\varpi _{\top }J_{\top }^2}{\sqrt{t}}+\frac{4\varpi _{\top }^2 J_{\top }^2}{t}-\frac{2\varpi _{\top } J_{\top }^2}{t} \nonumber \\&\le \frac{4\varpi _{\top }J_{\top }^2}{\sqrt{t}} - \frac{4\varpi _{\top }^2 J_{\top }^2}{t} +\frac{4\varpi _{\top }^2 J_{\top }^2}{t}-\frac{2\varpi _{\top } J_{\top }^2}{t} \nonumber \\&\le 4\varpi _{\top }J_{\top }^2\left( \frac{1}{\sqrt{t}} -\frac{1}{2t}\right) \nonumber \\&\le 4\varpi _{\top }J_{\top }^2\left[ \frac{1}{\sqrt{t}}\left( 1 -\frac{1}{2\sqrt{t}}\right) \right] . \end{aligned}$$

(50)

In addition, applying Lemma 2 into Equation (50), we attain

$$\begin{aligned} z_{t+1}&\le 4\varpi _{\top }J_{\top }^2\left[ \frac{1}{\sqrt{t}}\left( 1 -\frac{1}{2\sqrt{t}}\right) \right] \le \frac{4\varpi _{\top }J_{\top }^2}{\sqrt{t+1}}. \end{aligned}$$

(51)

By Equation (51), the Equation (26) is true for time $t+1$, therefore, the mathematical induction is satisfied for all $t\in \{1,\ldots ,T\}$. The proof of Lemma 3 is completed.

Proof of Theorem 1

Proof

Denoting that $\mathbf {x}^*:=\arg \min _{\mathbf {x}\in \mathcal {F}}\sum _{t=1}^T f_t(\mathbf {x})$, and according to the definition of the regret $\mathcal {R}(T)$, we have

$$\begin{aligned} \mathcal {R}(T)&= \sum _{t=1}^T f_t(\mathbf {x}_t) - \min _{\mathbf {x}\in \mathcal {F}}\sum _{t=1}^T f_t(\mathbf {x}) \nonumber \\&= \sum _{t=1}^T\big [ f_t(\mathbf {x}_t) - f_t(\mathbf {x}^*) \big ] \nonumber \\&= \sum _{t=1}^T\big [ f_t(\mathbf {x}_t) - f_t(\mathbf {x}_t^*) + f_t(\mathbf {x}_t^*) - f_t(\mathbf {x}^*) \big ] \nonumber \\&\le \sum _{t=1}^T \big \vert f_t(\mathbf {x}_t) - f_t(\mathbf {x}_t^*)\big \vert + \sum _{t=1}^T \big [f_t(\mathbf {x}_t^*) - f_t(\mathbf {x}^*) \big ]. \end{aligned}$$

(52)

To get the bound of $\mathcal {R}(T)$, we first consider the term

$$\sum _{t=1}^T \big \vert f_t(\mathbf {x}_t) - f_t(\mathbf {x}_t^*)\big \vert$$

in Equation (52). Reviewing the Assumption 1, we know that the function $f_t(\mathbf {x})$ is Lipschitz with L for all $t\in \{1,\ldots ,T\}$. Moreover, by the Definition 3, we obtain

$$\begin{aligned} \big \vert f_t(\mathbf {x}_t) - f_t(\mathbf {x}_t^*)\big \vert \le L\Vert \mathbf {x}_t-\mathbf {x}_t^*\Vert . \end{aligned}$$

(53)

Moreover, summing Equation (53), and we have

$$\begin{aligned} \sum _{t=1}^T\big \vert f_t(\mathbf {x}_t) - f_t(\mathbf {x}_t^*)\big \vert \le L\sum _{t=1}^T\Vert \mathbf {x}_t-\mathbf {x}_t^*\Vert . \end{aligned}$$

(54)

Let $\mathbf {x}=\mathbf {x}_t$ and substitute it into Equation (42), and applying Lemma 3, we attain

$$\begin{aligned} \Vert \mathbf {x}_t-\mathbf {x}_t^*\Vert&\le \sqrt{Y_t(\mathbf {x}_t)-Y_t(\mathbf {x}_t^*)} \nonumber \\&\le \sqrt{z_t} \nonumber \\&\le \frac{2J_{\top }\sqrt{\varpi _{\top }}}{t^{1/4}}. \end{aligned}$$

(55)

By the definition of the definite integral, we have the relationship

$$\sum _{t=1}^T \frac{1}{t^{1/4}}\le \int _{0}^T \frac{1}{t^{1/4}}dt=\frac{4}{3}T^{3/4}.$$

Therefore, combining Equations (54) and (55), we obtain

$$\begin{aligned} \sum _{t=1}^T\big \vert f_t(\mathbf {x}_t) - f_t(\mathbf {x}_t^*)\big \vert&\le L\sum _{t=1}^T\Vert \mathbf {x}_t-\mathbf {x}_t^*\Vert \nonumber \\&\le L\sum _{t=1}^T\frac{2J_{\top }\sqrt{\varpi _{\top }}}{t^{1/4}} \nonumber \\&\le \frac{8}{3}LJ_{\top }\sqrt{\varpi _{\top }}T^{3/4}. \end{aligned}$$

(56)

Now, the bound of $\sum _{t=1}^T \big \vert f_t(\mathbf {x}_t) - f_t(\mathbf {x}_t^*)\big \vert$ is obtained. Next, we turn to calculate the bound of the term

$$\sum _{t=1}^T \big [f_t(\mathbf {x}_t^*) - f_t(\mathbf {x}^*) \big ]$$

in Equation (52). By the smoothness of $f_t(\mathbf {x})$ and the Definition 4, we have

$$\begin{aligned} f_t(\mathbf {x}_t^*) - f_t(\mathbf {x}^*)&=f_t\left[ \mathbf {x}_t-(\mathbf {x}_t-\mathbf {x}_t^*)\right] - f_t(\mathbf {x}^*) \nonumber \\&\le f_t(\mathbf {x}_t) - f_t(\mathbf {x}^*) - \mathbf {g}_t\odot (\mathbf {x}_t-\mathbf {x}_t^*) \nonumber \\&+\Vert \mathbf {x}_t-\mathbf {x}_t^*\Vert ^2 . \end{aligned}$$

(57)

Moreover, from the convexity of $f_t(\mathbf {x})$, the Definition 1 and the optimal of $\mathbf {x}^*$, we further obtain

$$\begin{aligned} f_t(\mathbf {x}_t^*) - f_t(\mathbf {x}^*)&\le \nabla f_{t}(\mathbf {x}^*)\odot (\mathbf {x}^*-\mathbf {x}_t) +\Vert \mathbf {x}_t-\mathbf {x}_t^*\Vert ^2 - \mathbf {g}_t\odot (\mathbf {x}_t-\mathbf {x}_t^*)\nonumber \\&\le \Vert \mathbf {x}_t-\mathbf {x}_t^*\Vert ^2 - \mathbf {g}_t\odot (\mathbf {x}_t-\mathbf {x}_t^*). \end{aligned}$$

(58)

In addition, from Equation (58), and applying the Cauchy-Schwarz inequality, we attain

$$\begin{aligned} f_t(\mathbf {x}_t^*) - f_t(\mathbf {x}^*)&\le \Vert \mathbf {x}_t-\mathbf {x}_t^*\Vert ^2 - \mathbf {g}_t\odot (\mathbf {x}_t-\mathbf {x}_t^*) \nonumber \\&\le \Vert \mathbf {x}_t-\mathbf {x}_t^*\Vert ^2 + \Vert \mathbf {g}_t\Vert \Vert \mathbf {x}_t-\mathbf {x}_t^*\Vert . \end{aligned}$$

(59)

Applying Assumptions 2 and 3, and from Equation (55), we further have

$$\begin{aligned} f_t(\mathbf {x}_t^*) - f_t(\mathbf {x}^*)&\le \left( \frac{2J_{\top }\sqrt{\varpi _{\top }}}{t^{1/4}}\right) ^2+ \frac{2J_{\top }\sqrt{\varpi _{\top }}}{t^{1/4}} \nonumber \\&\le \frac{4J_{\top }^2\varpi _{\top }}{t^{1/2}}+\frac{2K_{\top }J_{\top }\sqrt{\varpi _{\top }}}{t^{1/4}}. \end{aligned}$$

(60)

Next, sum over both sides of Equation (60), we obtain

$$\begin{aligned} \sum _{t=1}^{T} f_t(\mathbf {x}_t^*) - f_t(\mathbf {x}^*)&\le \sum _{t=1}^{T}\frac{4J_{\top }^2\varpi _{\top }}{t^{1/2}}+\sum _{t=1}^{T}\frac{2K_{\top }J_{\top }\sqrt{\varpi _{\top }}}{t^{1/4}}. \end{aligned}$$

(61)

Substituting the inequalities

$$\sum _{t=1}^T \frac{1}{t^{1/2}}\le 2\sqrt{T}$$

and

$$\sum _{t=1}^T \frac{1}{t^{1/4}}\le \frac{4}{3}T^{3/4}$$

into Equation (60), and we attain

$$\begin{aligned} \sum _{t=1}^{T} f_t(\mathbf {x}_t^*) - f_t(\mathbf {x}^*) \le 8J_{\top }^2\varpi _{\top }T^{1/2} + \frac{8}{3}K_{\top }J_{\top }\sqrt{\varpi _{\top }}T^{3/4}. \end{aligned}$$

(62)

Finally, combining Equations (52), (56) and (62), we have that

$$\begin{aligned} \mathcal {R}(T)&\le \frac{8}{3}LJ_{\top }\sqrt{\varpi _{\top }}T^{3/4}+ \frac{8}{3}K_{\top }J_{\top }\sqrt{\varpi _{\top }}T^{3/4} + 8J_{\top }^2\varpi _{\top }T^{1/2}\nonumber \\&= \frac{8}{3}(L+K_{\top })J_{\top }\sqrt{\varpi _{\top }}T^{3/4} + 8J_{\top }^2\varpi _{\top }T^{1/2}. \end{aligned}$$

(63)

Therefore, the stated bound of the regret $\mathcal {R}(T)$ is obtained.

Proof of Theorem 2

Proof

Following with [18], we define a loss function as equation (30). Therefore, the minimum regret of Equation (30) is obtained when $x=0$. For Adam, we set $\beta _1=0, 0<\sqrt{\beta _2}<\lambda <1,$ and $\alpha _t = \alpha / \sqrt{t},$ where $t\in \{1,\ldots ,T\}$. Then, the gradient of $f_t(x_{t,i})$ when $x_{t,i}\ge 0$ as follows

$$\begin{aligned} g_{t,i} = C\lambda ^{t-1}. \end{aligned}$$

(64)

Moreover, from Equation (27) and applying the recursive algorithm, we obtain the following

$$\begin{aligned} v_{t,i}&= \beta _2 v_{t-1,i} + (1-\beta _2)\left( C\lambda ^{t-1}\right) ^2 \nonumber \\&= \sum _{\tau =1}^{t}\beta _{2}^{t-\tau }(1-\beta _2)\left( C\lambda ^{t-1}\right) ^2 \nonumber \\&= \frac{(1-\beta _2)\beta _2^t C^2}{\lambda ^2}\sum _{\tau =1}^{t}\left( \frac{\lambda ^2}{\beta _2}\right) ^{\tau } \nonumber \\&= \frac{(1-\beta _2)\left( \lambda ^{2t}-\beta _2^t\right) C^2}{\lambda ^2 - \beta _2}. \end{aligned}$$

(65)

Since that $\sqrt{\beta _2}<\lambda$, we have the following

$$\begin{aligned} \alpha _t\frac{m_{t,i}}{\sqrt{v_{t,i}}}&= \frac{\alpha (1-\beta _1)\left( \lambda ^t-\beta _1^t\right) }{\sqrt{t}(\lambda -\beta _1)}\frac{\sqrt{\lambda ^2-\beta _2}}{\sqrt{(1-\beta _2)\left( \lambda ^{2t}-\beta _2^t\right) }} \nonumber \\&= \frac{\alpha (1-\beta _1)\sqrt{\lambda ^2-\beta _2}}{\sqrt{t}(\lambda -\beta _1)\sqrt{(1-\beta _2)}}\frac{1-\left( \beta _1/\lambda \right) ^t}{\sqrt{1-\left( \beta _2/\lambda ^2\right) ^t}} \nonumber \\&\ge \frac{\alpha (1-\beta _1)\sqrt{\lambda ^2-\beta _2}}{\sqrt{t}(\lambda -\beta _1)\sqrt{(1-\beta _2)}}\left( 1-\frac{\beta _1}{\lambda }\right) . \end{aligned}$$

(66)

By Equations (28) and (66), we attain the following

$$\begin{aligned} x_{t+1,i}&= x_{t,i}-\alpha _t\frac{m_{t,i}}{\sqrt{v_{t,i}}} \nonumber \\&\le x_{t,i} - \frac{\alpha (1-\beta _1)\sqrt{\lambda ^2-\beta _2}}{\sqrt{t}(\lambda -\beta _1)\sqrt{(1-\beta _2)}}\left( 1-\frac{\beta _1}{\lambda }\right) \nonumber \\&\le x_{0,i} - \frac{(1-\beta _1)\sqrt{\lambda ^2-\beta _2}}{(\lambda -\beta _1)\sqrt{(1-\beta _2)}}\left( 1-\frac{\beta _1}{\lambda }\right) \sum _{\tau =1}^t\frac{1}{\sqrt{\tau }}. \end{aligned}$$

(67)

Since that $\sum _{\tau =1}^t\frac{1}{\sqrt{\tau }}$ diverges when $t\rightarrow \infty$, and from Equation (67), Adam would always reach the negative region if t is large enough.

For LightAdam, we also set $\beta _1=0, 0<\sqrt{\beta _2}<\lambda <1,$ and $\alpha _t = \alpha / \sqrt{t},$ where $t\in \{1,\ldots ,T\}$. Then, by Equations (13) and (14), we obtain the following

$$\begin{aligned} \hat{v}_{t,i}&= \frac{\sum _{\tau =1}^t\beta _2(1+\beta _2)^{t-\tau }\left( C\lambda ^{\tau -1}\right) ^2}{(1+\beta _2)^t-1} \nonumber \\&= \frac{(1+\beta _2)^t\beta _2\lambda ^{-2}C^2}{(1+\beta _2)^t-1}\sum _{\tau =1}^t\left( \frac{\lambda ^2}{1+\beta _2}\right) ^{\tau } \nonumber \\&= \frac{\beta _2 C^2}{1+\beta _2-\lambda ^2}\frac{(1+\beta _2)^t-\lambda ^{2t}}{(1+\beta _2)^t-1}. \end{aligned}$$

(68)

By Equation (68), we further have the following

$$\begin{aligned} \alpha _t\frac{m_{t,i}}{\sqrt{\hat{v}_{t,i}}}&= \frac{\alpha \lambda ^t C}{\sqrt{t}\lambda }\frac{\sqrt{1+\beta _2-\lambda ^2}\sqrt{(1+\beta _2)^t-1}}{\sqrt{\beta _2 C^2}\sqrt{(1+\beta _2)^t-\lambda ^{2t}}} \nonumber \\&=\frac{\alpha \lambda ^{t-1}\sqrt{1+\beta _2-\lambda ^2}}{\sqrt{\beta _2 t}}\frac{\sqrt{(1+\beta _2)^t-1}}{\sqrt{\sqrt{(1+\beta _2)^t-\lambda ^{2t}}}} \nonumber \\&\le \alpha \sqrt{\frac{1+\beta _2-\lambda ^2}{\beta _2 t}}\lambda ^{t-1}. \end{aligned}$$

(69)

Moreover, from Equations (15), (18) and (69), we attain the following

$$\begin{aligned} x_{t+1,i}&= x_{t,i}-\alpha _t\frac{m_{t,i}}{\sqrt{v_{t,i}}} \nonumber \\&\ge x_{t,i} - \alpha \sqrt{\frac{1+\beta _2-\lambda ^2}{\beta _2 t}}\lambda ^{t-1} \nonumber \\&\ge x_{0,i} - \alpha \sqrt{\frac{1+\beta _2-\lambda ^2}{\beta _2 t}}\lambda ^{t-1}\sum _{\tau =1}^t\frac{1}{\sqrt{\tau }}. \end{aligned}$$

(70)

From Equation (69), the step size of LightAdam has a lower bound, therefore, its step size is not affected by extreme gradients. In addition, from Equation (70), we can observe that LightAdam is able to converge to the optimal solution if its step size and parameters are initialized suitably. Therefore, the proof of Theorem 2 is completed.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhou, Y., Huang, K., Cheng, C. et al. LightAdam: Towards a Fast and Accurate Adaptive Momentum Online Algorithm. Cogn Comput 14, 764–779 (2022). https://doi.org/10.1007/s12559-021-09985-9

Download citation

Received: 01 November 2021
Accepted: 17 December 2021
Published: 11 January 2022
Issue Date: March 2022
DOI: https://doi.org/10.1007/s12559-021-09985-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

LightAdam: Towards a Fast and Accurate Adaptive Momentum Online Algorithm

Abstract

Access this article

Similar content being viewed by others

Machine Learning: Algorithms, Real-World Applications and Research Directions

Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions

Machine learning and deep learning

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Ethical Approval

Conflict of Interest

Additional information

Publisher’s Note

Appendix of LightAdam

Proof of Lemma 1

Proof

Proof of Lemma 2

Proof

Proof of Lemma 3

Proof

Proof of Theorem 1

Proof

Proof of Theorem 2

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Abstract

Access this article

Similar content being viewed by others

Machine Learning: Algorithms, Real-World Applications and Research Directions

Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions

Machine learning and deep learning

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Ethical Approval

Conflict of Interest

Additional information

Publisher’s Note

Appendix of LightAdam

Appendix of LightAdam

Proof of Lemma 1

Proof

Proof of Lemma 2

Proof

Proof of Lemma 3

Proof

Proof of Theorem 1

Proof

Proof of Theorem 2

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation