Adaptive Momentum Coefficient for Neural Network Optimization

Rashidi, Zana; Ahmadi K. A., Kasra; An, Aijun; Wang, Xiaogang

doi:10.1007/978-3-030-67661-2_3

Adaptive Momentum Coefficient for Neural Network Optimization

Zana Rashidi¹²,
Kasra Ahmadi K. A.¹²,
Aijun An¹² &
…
Xiaogang Wang¹³

Conference paper
First Online: 25 February 2021

1375 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12458))

Abstract

We propose a novel and efficient momentum-based first-order algorithm for optimizing neural networks which uses an adaptive coefficient for the momentum term. Our algorithm, called Adaptive Momentum Coefficient (AMoC), utilizes the inner product of the gradient and the previous update to the parameters, to effectively control the amount of weight put on the momentum term based on the change of direction in the optimization path. The algorithm is easy to implement and its computational overhead over momentum methods is negligible. Extensive empirical results on both convex and neural network objectives show that AMoC performs well in practise and compares favourably with other first and second-order optimization algorithms. We also provide a convergence analysis and a convergence rate for AMoC, showing theoretical guarantees similar to those provided by other efficient first-order methods.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
https://github.com/tensorflow/kfac.

References

Amari, S.I.: Natural gradient works efficiently in learning. Neural Comput. 10(2), 251–276 (1998)
Article Google Scholar
Aujol, J.F., Rondepierre, A., Aujol, J., Dossal, C., et al.: Optimal convergence rates for Nesterov acceleration. arXiv preprint arXiv:1805.05719 (2018)
Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018)
Article MathSciNet Google Scholar
Defazio, A.: On the curved geometry of accelerated optimization. arXiv preprint arXiv:1812.04634 (2018)
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(Jul), 2121–2159 (2011)
Google Scholar
Ghadimi, E., Feyzmahdavian, H.R., Johansson, M.: Global convergence of the heavy-ball method for convex optimization. In: 2015 European Control Conference (ECC), pp. 310–315. IEEE (2015)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Lucas, J., Sun, S., Zemel, R., Grosse, R.: Aggregated momentum: stability through passive damping. arXiv preprint arXiv:1804.00325 (2018)
Martens, J.: Deep learning via hessian-free optimization. In: ICML, vol. 27, pp. 735–742 (2010)
Google Scholar
Martens, J., Grosse, R.: Optimizing neural networks with Kronecker-Factored approximate curvature. In: International Conference on Machine Learning, pp. 2408–2417 (2015)
Google Scholar
Meng, X., Chen, H.: Accelerating Nesterov’s method for strongly convex functions with Lipschitz gradient. arXiv preprint arXiv:1109.6058 (2011)
Merity, S., Keskar, N.S., Socher, R.: Regularizing and optimizing LSTM language models. arXiv preprint arXiv:1708.02182 (2017)
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, vol. 87. Springer, Heidelberg (2013). https://doi.org/10.1007/978-1-4419-8853-9
Book MATH Google Scholar
Nesterov, Y.E.: A method for solving the convex programming problem with convergence rate o $(1/\text{k}^2)$. Dokl. akad. nauk Sssr. 269, 543–547 (1983)
Google Scholar
O’donoghue, B., Candes, E.: Adaptive restart for accelerated gradient schemes. Found. Comput. Math. 15(3), 715–732 (2015)
Google Scholar
Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964)
Article Google Scholar
Reddi, S.J., Kale, S., Kumar, S.: On the convergence of ADAM and beyond (2018)
Google Scholar
Su, W., Boyd, S., Candes, E.: A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights. In: Advances in Neural Information Processing Systems, pp. 2510–2518 (2014)
Google Scholar
Sutskever, I., Martens, J., Dahl, G.E., Hinton, G.E.: On the importance of initialization and momentum in deep learning. In: ICML (3), vol. 28, pp. 1139–1147 (2013)
Google Scholar
Tieleman, T., Hinton, G.: Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw. Mach. Learn. 4(2), 26–31 (2012)
Google Scholar
Wibisono, A., Wilson, A.C., Jordan, M.I.: A variational perspective on accelerated methods in optimization. Proc. Natl. Acad. Sci. 113(47), E7351–E7358 (2016)
Google Scholar
Zeiler, M.D.: Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701 (2012)

Download references

Acknowledgements

We would like to thank Ruth Urner for valuable discussions and insights. We would also like to thank the anonymous reviewers for their useful suggestions and constructive feedback.

Author information

Authors and Affiliations

Department of Electrical Engineering and Computer Science, York University, Toronto, ON, M3J 1P3, Canada
Zana Rashidi, Kasra Ahmadi K. A. & Aijun An
Department of Mathematics and Statistics, York University, Toronto, ON, M3J 1P3, Canada
Xiaogang Wang

Authors

Zana Rashidi
View author publications
You can also search for this author in PubMed Google Scholar
Kasra Ahmadi K. A.
View author publications
You can also search for this author in PubMed Google Scholar
Aijun An
View author publications
You can also search for this author in PubMed Google Scholar
Xiaogang Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zana Rashidi .

Editor information

Editors and Affiliations

Albert-Ludwigs-Universität, Freiburg, Germany
Frank Hutter
TU Darmstadt, Darmstadt, Germany
Kristian Kersting
Ghent University, Ghent, Belgium
Jefrey Lijffijt
Saarland University, Saarbrücken, Germany
Isabel Valera

Appendices

A Proofs

Lemma

For arbitrarily large integer T, there exists $\beta >0$ such that the first $T+1$ elements of the sequence $\{\lambda _k\}$ are positive.

Proof

Consider $\lambda _1$ to $\lambda _k$ are positive and $\lambda _{k+1}$ is negative. We have:

$$\begin{aligned} \begin{aligned} \lambda _{j+1}-\frac{\mu (1+\beta )}{1-\mu (1+\beta )}&=\frac{\lambda _j}{\gamma _j}-\frac{1}{1-\mu (1+\beta )}\\&\ge \frac{\lambda _j}{\mu (1+\beta )}-\frac{1}{1-\mu (1+\beta )}\\&=\frac{\lambda _j-\frac{\mu (1+\beta )}{1-\mu (1+\beta )}}{\mu (1+\beta )} \end{aligned} \end{aligned}$$

Combining all the inequalities for $j=1,...,k$ results:

$$\begin{aligned} \begin{aligned} -\frac{\mu (1+\beta )}{1-\mu (1+\beta )}&\ge \lambda _{k+1}-\frac{\mu (1+\beta )}{1-\mu (1+\beta )}\\&\ge \frac{\lambda _1-\frac{\mu (1+\beta )}{1-\mu (1+\beta )}}{(\mu (1+\beta ))^k}\\&=\frac{-2\mu \beta }{(1-\mu )^2-\mu ^2\beta ^2}/(\mu (1+\beta ))^k \end{aligned} \end{aligned}$$

therefore:

$$\begin{aligned} k\ge ln\Big (\frac{\mu (1+\beta )(1-\mu (1-\beta ))}{2\mu \beta }\Big )/ln\big (\mu (1+\beta )\big ) \end{aligned}$$

The right-hand side of this inequality approaches $+\infty $ as $\beta $ approaches 0. Thus, by choosing a small enough $\beta $, we achieve arbitrarily large number of positive terms.

Theorem

For any differentiable convex function f with L-Lipschitz gradients, the sequence generated by AMoC with sufficiently small $\beta $, $\mu \in [0,\frac{1}{1+\beta })$, and $\epsilon \in (0,\frac{\mu (1+\beta )}{L\lambda _1})$ satisfies the following:

$$\begin{aligned} f(\tilde{\theta }_T)-f(\theta ^{*})\le \frac{\Vert \theta _{1}-\theta ^{*}\Vert ^{2}}{2T(1+\lambda _{T+1})}\Big (\frac{1}{\epsilon }+\frac{\lambda _1^2L}{\mu }\Big ) \end{aligned}$$

(13)

where $\theta ^*$ is the optimal point and $\tilde{\theta }_T=(\sum _{k=1}^T\frac{\lambda _k}{\gamma _k}\theta _k)/(\sum _{k=1}^T\frac{\lambda _k}{\gamma _k})$.

Proof

To prove this theorem, we follow a similar approach as [6]. By definition we have:

$$\begin{aligned} \begin{aligned} d_{k}&=\theta _{k}-\theta _{k-1}\\ g_{k}&=\nabla f\left( \theta _{k}\right) \\ \gamma _{k}&=\mu \left( 1-\beta \bar{g}_{k} \cdot \bar{d}_{k} \right) \\ \theta _{k+1}&=\theta _{k}- \epsilon g_{k}+\gamma _{k} d_{k} \end{aligned} \end{aligned}$$

therefore:

$$\begin{aligned} \begin{aligned} \theta _{k+1}+\lambda _{k+1} d_{k+1}&=\left( \lambda _{k+1}+1\right) \theta _{k+1}-\lambda _{k+1} \theta _{k}\\&=\theta _k-\frac{\epsilon \lambda _k}{\gamma _k}g_k+\lambda _k d_k \end{aligned} \end{aligned}$$

By subtracting $\theta ^*$ and seting $\delta _k=\theta _k-\theta ^*$, we get:

$$\begin{aligned} \begin{aligned} \Vert \delta _{k+1}+\lambda _{k+1} d_{k+1}\Vert ^{2}=&\left\| \delta _{k}+\lambda _{k} d_{k}\right\| ^{2}+\left( \frac{\epsilon \lambda _{k}}{\gamma _{k}}\right) ^{2} \left\| g_{k}\right\| ^{2}\\&-\frac{2 \alpha \lambda _{k}}{\gamma _{k}} \delta _{k} \cdot g_{k}-\frac{2 \epsilon \lambda _{k}^{2}}{\gamma _{k}} g_{k} \cdot d_{k} \end{aligned} \end{aligned}$$

(14)

According to [15, Theorem 2.1.5], $\delta _k\cdot g_k\ge f(\theta _k)-f(\theta ^*)+\frac{1}{2L}\Vert g_k\Vert ^2$ which combined with (14) results:

$$\begin{aligned} \begin{aligned} \Vert \delta _{k+1}+\lambda _{k+1} d_{k+1}\Vert ^{2}\le \,&\left\| \delta _{k}+\lambda _{k} d_{k}\right\| ^{2}+\left( \frac{\epsilon \lambda _{k}}{\gamma _{k}}\right) ^{2} \left\| g_{k}\right\| ^{2}\\&-\frac{2 \epsilon \lambda _{k}}{\gamma _{k}}\left( f(\theta _k)-f(\theta ^*)+\frac{1}{2L}\Vert g_k\Vert ^2 \right) \\&-\frac{2 \epsilon \lambda _{k}^{2}}{\gamma _{k}} g_{k} \cdot d_{k}\\ \le \,&\left\| \delta _{k}+\lambda _{k} d_{k}\right\| ^{2}-\frac{2 \epsilon \lambda _{k}}{\gamma _{k}}(f(\theta _k)-f(\theta ^*))\\&-\frac{2 \epsilon \lambda _{k}^{2}}{\gamma _{k}} g_{k} \cdot d_{k} \end{aligned} \end{aligned}$$

Summing up all the inequalities for $k=1,...,T$ yields:

$$\begin{aligned} \begin{aligned} 0\le \Vert \delta _{T+1}+\lambda _{T+1} d_{T+1}\Vert ^{2}\le \,&\left\| \delta _{1}\right\| ^{2}-2 \epsilon \sum _{k=1}^T\frac{ \lambda _{k}}{\gamma _{k}}\left( f(\theta _k)-f(\theta ^*)\right) \\&-2 \epsilon \sum _{k=1}^T\frac{\lambda _{k}^{2}}{\gamma _{k}} g_{k} \cdot d_{k}\\ \le \,&\left\| \delta _{1}\right\| ^{2}-2\epsilon (\sum _{k=1}^T \frac{ \lambda _{k}}{\gamma _{k}})\left( f(\tilde{\theta }_k)-f(\theta ^*)\right) \\&-\frac{2 \epsilon }{\mu } \sum _{k=1}^{T} \lambda _{k}^2\left\| g_{k}\right\| \left\| d_{k}\right\| \frac{ \bar{g}_{k} \cdot \bar{d}_{k}}{\left( 1-\beta \bar{g}_{k} \cdot \bar{d}_{k} \right) } \end{aligned} \end{aligned}$$

For $\beta <1$, function $\frac{x}{1-\beta x}$ is convex for $x\in [-1,1]$, thus:

$$\begin{aligned} \begin{aligned} 0\le \,&\left\| \delta _{1}\right\| ^{2}-2\epsilon (\sum _{k=1}^T \frac{ \lambda _{k}}{\gamma _{k}})\left( f(\tilde{\theta }_k)-f(\theta ^*)\right) \\&-\frac{2 \epsilon }{\mu } \frac{ \sum _{k=1}^{T} \lambda _{k}^2 g_{k}\cdot d_{k}}{1-\beta \frac{\sum _{k=1}^{T} \lambda _{k}^2 g_{k}\cdot d_{k} }{\sum _{k=1}^{T} \lambda _{k}^2\left\| g_{k}\right\| \left\| d_{k}\right\| } } \end{aligned} \end{aligned}$$

Function $\frac{x}{1-\beta x}$ is also increasing, and $g_k\cdot d_k\ge f(\theta _k)-f(\theta _{k-1})$ [15, Theorem 2.1.5], therefore:

$$\begin{aligned} \begin{aligned} 2\epsilon (\sum _{k=1}^T 1+\lambda _{k+1})\left( f(\tilde{\theta }_k)-f(\theta ^*)\right) \le \left\| \delta _{1}\right\| ^{2}-\frac{2 \epsilon }{\mu } \frac{ \sum _{k=2}^{T} \lambda _{k}^2 \left( f(\theta _k)-f(\theta _{k-1})\right) }{1-\beta \frac{\sum _{k=2}^{T} \lambda _{k}^2 \left( f(\theta _k)-f(\theta _{k-1})\right) }{\sum _{k=1}^{T} \lambda _{k}^2\left\| g_{k}\right\| \left\| d_{k}\right\| } } \end{aligned} \end{aligned}$$

Furthermore, easily one can show that sequence $\{\lambda _k\}$ is decreasing, and

$$\sum _{k=2}^{T} \lambda _{k}^2 \left( f(\theta _k)-f(\theta _{k-1})\right) \ge -\lambda _1^2\left( f(\theta _1)-f(\theta ^*)\right) $$

Therefore:

$$\begin{aligned} \begin{aligned} 2\epsilon T (1+\lambda _{k+1})\left( f(\tilde{\theta }_k)-f(\theta ^*)\right) \le \,&\left\| \delta _{1}\right\| ^{2}+\frac{2 \epsilon }{\mu } \frac{ \lambda _1^2\left( f(\theta _1)-f(\theta ^*)\right) }{1+\beta \frac{\lambda _1^2\left( f(\theta _1)-f(\theta ^*)\right) }{\sum _{k=1}^{T} \lambda _{k}^2\left\| g_{k}\right\| \left\| d_{k}\right\| } }\\ \le \,&\left\| \delta _{1}\right\| ^{2}+\frac{2 \epsilon }{\mu }\lambda _1^2\left( f(\theta _1)-f(\theta ^*)\right) \\ \le \,&\left\| \delta _{1}\right\| ^{2}(1+\frac{\epsilon }{\mu }\lambda _1^2L) \end{aligned} \end{aligned}$$

where the final inequality follows from $f(\theta _1)-f(\theta ^*)\le \frac{L}{2}\Vert \delta _1\Vert ^2$ [15, Theorem 2.1.5], and concludes the proof.

B Inner Product Analysis

In this section we report the value of the inner product ($\bar{g}_t\cdot \bar{d}_t$) for AMoC and AMoC-N per iteration/epoch for each of the experiments in Figs. 5 and 6. The results from the Anisotropic Bowl and the Smooth-BPDN experiments are particularly interesting. In the first case (Fig. 5a), with the inner product oscillating between positive and negative values, we can infer that the algorithm is crossing the optimum multiple times (without overshooting) but is able to bounce back and reach the optimum point eventually and in less iterations than the baselines. The second case (Fig. 5c) behaves in a similar way, except the algorithm seems to be moving close to the optimum but going up again and bouncing back several times (most likely in an oval-shaped trajectory) until it gets close enough that it terminates. In the Ridge Regression problem (Fig. 5b), the inner product drops from values between 0.5 and 1 to values between -0.5 to -1, indicating that the algorithm is moving towards the optimum steadily, with the updates always keeping a low angle with the gradient. In the neural network experiments (Figs. 6a to 6d), the inner product gets closer to 0 by the end of training. We link this behaviour to the algorithm moving in a circular fashion around the optima with the direction of the negative gradient almost perpendicular ($\phi =\pi /2$) to the previous update.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rashidi, Z., Ahmadi K. A., K., An, A., Wang, X. (2021). Adaptive Momentum Coefficient for Neural Network Optimization. In: Hutter, F., Kersting, K., Lijffijt, J., Valera, I. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2020. Lecture Notes in Computer Science(), vol 12458. Springer, Cham. https://doi.org/10.1007/978-3-030-67661-2_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-67661-2_3
Published: 25 February 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-67660-5
Online ISBN: 978-3-030-67661-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)

Abstract

Buying options

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendices

A Proofs

Lemma

Proof

Theorem

Proof

B Inner Product Analysis

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation