We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Skip to main content

Adaptive Momentum Coefficient for Neural Network Optimization

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12458))

Abstract

We propose a novel and efficient momentum-based first-order algorithm for optimizing neural networks which uses an adaptive coefficient for the momentum term. Our algorithm, called Adaptive Momentum Coefficient (AMoC), utilizes the inner product of the gradient and the previous update to the parameters, to effectively control the amount of weight put on the momentum term based on the change of direction in the optimization path. The algorithm is easy to implement and its computational overhead over momentum methods is negligible. Extensive empirical results on both convex and neural network objectives show that AMoC performs well in practise and compares favourably with other first and second-order optimization algorithms. We also provide a convergence analysis and a convergence rate for AMoC, showing theoretical guarantees similar to those provided by other efficient first-order methods.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://github.com/tensorflow/kfac.

References

  1. Amari, S.I.: Natural gradient works efficiently in learning. Neural Comput. 10(2), 251–276 (1998)

    Article  Google Scholar 

  2. Aujol, J.F., Rondepierre, A., Aujol, J., Dossal, C., et al.: Optimal convergence rates for Nesterov acceleration. arXiv preprint arXiv:1805.05719 (2018)

  3. Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018)

    Article  MathSciNet  Google Scholar 

  4. Defazio, A.: On the curved geometry of accelerated optimization. arXiv preprint arXiv:1812.04634 (2018)

  5. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(Jul), 2121–2159 (2011)

    Google Scholar 

  6. Ghadimi, E., Feyzmahdavian, H.R., Johansson, M.: Global convergence of the heavy-ball method for convex optimization. In: 2015 European Control Conference (ECC), pp. 310–315. IEEE (2015)

    Google Scholar 

  7. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  8. Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)

    Google Scholar 

  9. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  10. Lucas, J., Sun, S., Zemel, R., Grosse, R.: Aggregated momentum: stability through passive damping. arXiv preprint arXiv:1804.00325 (2018)

  11. Martens, J.: Deep learning via hessian-free optimization. In: ICML, vol. 27, pp. 735–742 (2010)

    Google Scholar 

  12. Martens, J., Grosse, R.: Optimizing neural networks with Kronecker-Factored approximate curvature. In: International Conference on Machine Learning, pp. 2408–2417 (2015)

    Google Scholar 

  13. Meng, X., Chen, H.: Accelerating Nesterov’s method for strongly convex functions with Lipschitz gradient. arXiv preprint arXiv:1109.6058 (2011)

  14. Merity, S., Keskar, N.S., Socher, R.: Regularizing and optimizing LSTM language models. arXiv preprint arXiv:1708.02182 (2017)

  15. Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, vol. 87. Springer, Heidelberg (2013). https://doi.org/10.1007/978-1-4419-8853-9

    Book  MATH  Google Scholar 

  16. Nesterov, Y.E.: A method for solving the convex programming problem with convergence rate o \((1/\text{k}^2)\). Dokl. akad. nauk Sssr. 269, 543–547 (1983)

    Google Scholar 

  17. O’donoghue, B., Candes, E.: Adaptive restart for accelerated gradient schemes. Found. Comput. Math. 15(3), 715–732 (2015)

    Google Scholar 

  18. Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964)

    Article  Google Scholar 

  19. Reddi, S.J., Kale, S., Kumar, S.: On the convergence of ADAM and beyond (2018)

    Google Scholar 

  20. Su, W., Boyd, S., Candes, E.: A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights. In: Advances in Neural Information Processing Systems, pp. 2510–2518 (2014)

    Google Scholar 

  21. Sutskever, I., Martens, J., Dahl, G.E., Hinton, G.E.: On the importance of initialization and momentum in deep learning. In: ICML (3), vol. 28, pp. 1139–1147 (2013)

    Google Scholar 

  22. Tieleman, T., Hinton, G.: Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw. Mach. Learn. 4(2), 26–31 (2012)

    Google Scholar 

  23. Wibisono, A., Wilson, A.C., Jordan, M.I.: A variational perspective on accelerated methods in optimization. Proc. Natl. Acad. Sci. 113(47), E7351–E7358 (2016)

    Google Scholar 

  24. Zeiler, M.D.: Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701 (2012)

Download references

Acknowledgements

We would like to thank Ruth Urner for valuable discussions and insights. We would also like to thank the anonymous reviewers for their useful suggestions and constructive feedback.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zana Rashidi .

Editor information

Editors and Affiliations

Appendices

A Proofs

Lemma

For arbitrarily large integer T, there exists \(\beta >0\) such that the first \(T+1\) elements of the sequence \(\{\lambda _k\}\) are positive.

Proof

Consider \(\lambda _1\) to \(\lambda _k\) are positive and \(\lambda _{k+1}\) is negative. We have:

$$\begin{aligned} \begin{aligned} \lambda _{j+1}-\frac{\mu (1+\beta )}{1-\mu (1+\beta )}&=\frac{\lambda _j}{\gamma _j}-\frac{1}{1-\mu (1+\beta )}\\&\ge \frac{\lambda _j}{\mu (1+\beta )}-\frac{1}{1-\mu (1+\beta )}\\&=\frac{\lambda _j-\frac{\mu (1+\beta )}{1-\mu (1+\beta )}}{\mu (1+\beta )} \end{aligned} \end{aligned}$$

Combining all the inequalities for \(j=1,...,k\) results:

$$\begin{aligned} \begin{aligned} -\frac{\mu (1+\beta )}{1-\mu (1+\beta )}&\ge \lambda _{k+1}-\frac{\mu (1+\beta )}{1-\mu (1+\beta )}\\&\ge \frac{\lambda _1-\frac{\mu (1+\beta )}{1-\mu (1+\beta )}}{(\mu (1+\beta ))^k}\\&=\frac{-2\mu \beta }{(1-\mu )^2-\mu ^2\beta ^2}/(\mu (1+\beta ))^k \end{aligned} \end{aligned}$$

therefore:

$$\begin{aligned} k\ge ln\Big (\frac{\mu (1+\beta )(1-\mu (1-\beta ))}{2\mu \beta }\Big )/ln\big (\mu (1+\beta )\big ) \end{aligned}$$

The right-hand side of this inequality approaches \(+\infty \) as \(\beta \) approaches 0. Thus, by choosing a small enough \(\beta \), we achieve arbitrarily large number of positive terms.

Theorem

For any differentiable convex function f with L-Lipschitz gradients, the sequence generated by AMoC with sufficiently small \(\beta \), \(\mu \in [0,\frac{1}{1+\beta })\), and \(\epsilon \in (0,\frac{\mu (1+\beta )}{L\lambda _1})\) satisfies the following:

$$\begin{aligned} f(\tilde{\theta }_T)-f(\theta ^{*})\le \frac{\Vert \theta _{1}-\theta ^{*}\Vert ^{2}}{2T(1+\lambda _{T+1})}\Big (\frac{1}{\epsilon }+\frac{\lambda _1^2L}{\mu }\Big ) \end{aligned}$$
(13)

where \(\theta ^*\) is the optimal point and \(\tilde{\theta }_T=(\sum _{k=1}^T\frac{\lambda _k}{\gamma _k}\theta _k)/(\sum _{k=1}^T\frac{\lambda _k}{\gamma _k})\).

Proof

To prove this theorem, we follow a similar approach as [6]. By definition we have:

$$\begin{aligned} \begin{aligned} d_{k}&=\theta _{k}-\theta _{k-1}\\ g_{k}&=\nabla f\left( \theta _{k}\right) \\ \gamma _{k}&=\mu \left( 1-\beta \bar{g}_{k} \cdot \bar{d}_{k} \right) \\ \theta _{k+1}&=\theta _{k}- \epsilon g_{k}+\gamma _{k} d_{k} \end{aligned} \end{aligned}$$

therefore:

$$\begin{aligned} \begin{aligned} \theta _{k+1}+\lambda _{k+1} d_{k+1}&=\left( \lambda _{k+1}+1\right) \theta _{k+1}-\lambda _{k+1} \theta _{k}\\&=\theta _k-\frac{\epsilon \lambda _k}{\gamma _k}g_k+\lambda _k d_k \end{aligned} \end{aligned}$$

By subtracting \(\theta ^*\) and seting \(\delta _k=\theta _k-\theta ^*\), we get:

$$\begin{aligned} \begin{aligned} \Vert \delta _{k+1}+\lambda _{k+1} d_{k+1}\Vert ^{2}=&\left\| \delta _{k}+\lambda _{k} d_{k}\right\| ^{2}+\left( \frac{\epsilon \lambda _{k}}{\gamma _{k}}\right) ^{2} \left\| g_{k}\right\| ^{2}\\&-\frac{2 \alpha \lambda _{k}}{\gamma _{k}} \delta _{k} \cdot g_{k}-\frac{2 \epsilon \lambda _{k}^{2}}{\gamma _{k}} g_{k} \cdot d_{k} \end{aligned} \end{aligned}$$
(14)

According to [15, Theorem 2.1.5], \(\delta _k\cdot g_k\ge f(\theta _k)-f(\theta ^*)+\frac{1}{2L}\Vert g_k\Vert ^2\) which combined with (14) results:

$$\begin{aligned} \begin{aligned} \Vert \delta _{k+1}+\lambda _{k+1} d_{k+1}\Vert ^{2}\le \,&\left\| \delta _{k}+\lambda _{k} d_{k}\right\| ^{2}+\left( \frac{\epsilon \lambda _{k}}{\gamma _{k}}\right) ^{2} \left\| g_{k}\right\| ^{2}\\&-\frac{2 \epsilon \lambda _{k}}{\gamma _{k}}\left( f(\theta _k)-f(\theta ^*)+\frac{1}{2L}\Vert g_k\Vert ^2 \right) \\&-\frac{2 \epsilon \lambda _{k}^{2}}{\gamma _{k}} g_{k} \cdot d_{k}\\ \le \,&\left\| \delta _{k}+\lambda _{k} d_{k}\right\| ^{2}-\frac{2 \epsilon \lambda _{k}}{\gamma _{k}}(f(\theta _k)-f(\theta ^*))\\&-\frac{2 \epsilon \lambda _{k}^{2}}{\gamma _{k}} g_{k} \cdot d_{k} \end{aligned} \end{aligned}$$

Summing up all the inequalities for \(k=1,...,T\) yields:

$$\begin{aligned} \begin{aligned} 0\le \Vert \delta _{T+1}+\lambda _{T+1} d_{T+1}\Vert ^{2}\le \,&\left\| \delta _{1}\right\| ^{2}-2 \epsilon \sum _{k=1}^T\frac{ \lambda _{k}}{\gamma _{k}}\left( f(\theta _k)-f(\theta ^*)\right) \\&-2 \epsilon \sum _{k=1}^T\frac{\lambda _{k}^{2}}{\gamma _{k}} g_{k} \cdot d_{k}\\ \le \,&\left\| \delta _{1}\right\| ^{2}-2\epsilon (\sum _{k=1}^T \frac{ \lambda _{k}}{\gamma _{k}})\left( f(\tilde{\theta }_k)-f(\theta ^*)\right) \\&-\frac{2 \epsilon }{\mu } \sum _{k=1}^{T} \lambda _{k}^2\left\| g_{k}\right\| \left\| d_{k}\right\| \frac{ \bar{g}_{k} \cdot \bar{d}_{k}}{\left( 1-\beta \bar{g}_{k} \cdot \bar{d}_{k} \right) } \end{aligned} \end{aligned}$$

For \(\beta <1\), function \(\frac{x}{1-\beta x}\) is convex for \(x\in [-1,1]\), thus:

$$\begin{aligned} \begin{aligned} 0\le \,&\left\| \delta _{1}\right\| ^{2}-2\epsilon (\sum _{k=1}^T \frac{ \lambda _{k}}{\gamma _{k}})\left( f(\tilde{\theta }_k)-f(\theta ^*)\right) \\&-\frac{2 \epsilon }{\mu } \frac{ \sum _{k=1}^{T} \lambda _{k}^2 g_{k}\cdot d_{k}}{1-\beta \frac{\sum _{k=1}^{T} \lambda _{k}^2 g_{k}\cdot d_{k} }{\sum _{k=1}^{T} \lambda _{k}^2\left\| g_{k}\right\| \left\| d_{k}\right\| } } \end{aligned} \end{aligned}$$

Function \(\frac{x}{1-\beta x}\) is also increasing, and \(g_k\cdot d_k\ge f(\theta _k)-f(\theta _{k-1})\) [15, Theorem 2.1.5], therefore:

$$\begin{aligned} \begin{aligned} 2\epsilon (\sum _{k=1}^T 1+\lambda _{k+1})\left( f(\tilde{\theta }_k)-f(\theta ^*)\right) \le \left\| \delta _{1}\right\| ^{2}-\frac{2 \epsilon }{\mu } \frac{ \sum _{k=2}^{T} \lambda _{k}^2 \left( f(\theta _k)-f(\theta _{k-1})\right) }{1-\beta \frac{\sum _{k=2}^{T} \lambda _{k}^2 \left( f(\theta _k)-f(\theta _{k-1})\right) }{\sum _{k=1}^{T} \lambda _{k}^2\left\| g_{k}\right\| \left\| d_{k}\right\| } } \end{aligned} \end{aligned}$$

Furthermore, easily one can show that sequence \(\{\lambda _k\}\) is decreasing, and

$$\sum _{k=2}^{T} \lambda _{k}^2 \left( f(\theta _k)-f(\theta _{k-1})\right) \ge -\lambda _1^2\left( f(\theta _1)-f(\theta ^*)\right) $$

Therefore:

$$\begin{aligned} \begin{aligned} 2\epsilon T (1+\lambda _{k+1})\left( f(\tilde{\theta }_k)-f(\theta ^*)\right) \le \,&\left\| \delta _{1}\right\| ^{2}+\frac{2 \epsilon }{\mu } \frac{ \lambda _1^2\left( f(\theta _1)-f(\theta ^*)\right) }{1+\beta \frac{\lambda _1^2\left( f(\theta _1)-f(\theta ^*)\right) }{\sum _{k=1}^{T} \lambda _{k}^2\left\| g_{k}\right\| \left\| d_{k}\right\| } }\\ \le \,&\left\| \delta _{1}\right\| ^{2}+\frac{2 \epsilon }{\mu }\lambda _1^2\left( f(\theta _1)-f(\theta ^*)\right) \\ \le \,&\left\| \delta _{1}\right\| ^{2}(1+\frac{\epsilon }{\mu }\lambda _1^2L) \end{aligned} \end{aligned}$$

where the final inequality follows from \(f(\theta _1)-f(\theta ^*)\le \frac{L}{2}\Vert \delta _1\Vert ^2\) [15, Theorem 2.1.5], and concludes the proof.

Fig. 5.
figure 5

Inner product, \(\bar{g}_t\cdot \bar{d}_t=\cos {(\pi -\phi _t)}\), during optimization for both AMoC and AMoC-N for convex experiments in the paper.

Fig. 6.
figure 6

Inner product, \(\bar{g}_t\cdot \bar{d}_t=\cos {(\pi -\phi _t)}\), during training for both AMoC and AMoC-N for NN experiments in the paper.

B Inner Product Analysis

In this section we report the value of the inner product (\(\bar{g}_t\cdot \bar{d}_t\)) for AMoC and AMoC-N per iteration/epoch for each of the experiments in Figs. 5 and 6. The results from the Anisotropic Bowl and the Smooth-BPDN experiments are particularly interesting. In the first case (Fig. 5a), with the inner product oscillating between positive and negative values, we can infer that the algorithm is crossing the optimum multiple times (without overshooting) but is able to bounce back and reach the optimum point eventually and in less iterations than the baselines. The second case (Fig. 5c) behaves in a similar way, except the algorithm seems to be moving close to the optimum but going up again and bouncing back several times (most likely in an oval-shaped trajectory) until it gets close enough that it terminates. In the Ridge Regression problem (Fig. 5b), the inner product drops from values between 0.5 and 1 to values between -0.5 to -1, indicating that the algorithm is moving towards the optimum steadily, with the updates always keeping a low angle with the gradient. In the neural network experiments (Figs. 6a to 6d), the inner product gets closer to 0 by the end of training. We link this behaviour to the algorithm moving in a circular fashion around the optima with the direction of the negative gradient almost perpendicular (\(\phi =\pi /2\)) to the previous update.

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Rashidi, Z., Ahmadi K. A., K., An, A., Wang, X. (2021). Adaptive Momentum Coefficient for Neural Network Optimization. In: Hutter, F., Kersting, K., Lijffijt, J., Valera, I. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2020. Lecture Notes in Computer Science(), vol 12458. Springer, Cham. https://doi.org/10.1007/978-3-030-67661-2_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-67661-2_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-67660-5

  • Online ISBN: 978-3-030-67661-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics