Sketch-Based Empirical Natural Gradient Methods for Deep Learning

Yang, Minghan; Xu, Dong; Wen, Zaiwen; Chen, Mengyun; Xu, Pengxiang

doi:10.1007/s10915-022-01911-x

Sketch-Based Empirical Natural Gradient Methods for Deep Learning

Published: 01 August 2022

Volume 92, article number 94, (2022)
Cite this article

Journal of Scientific Computing Aims and scope Submit manuscript

Minghan Yang¹,
Dong Xu²,
Zaiwen Wen ORCID: orcid.org/0000-0003-1762-0671³,
Mengyun Chen⁴ &
…
Pengxiang Xu⁵

652 Accesses
1 Altmetric
Explore all metrics

Abstract

In this paper, we develop an efficient sketch-based empirical natural gradient method (SENG) for large-scale deep learning problems. The empirical Fisher information matrix is usually low-rank since the sampling is only practical on a small amount of data at each iteration. Although the corresponding natural gradient direction lies in a small subspace, both the computational cost and memory requirement are still not tractable due to the high dimensionality. We design randomized techniques for different neural network structures to resolve these challenges. For layers with a reasonable dimension, sketching can be performed on a regularized least squares subproblem. Otherwise, since the gradient is a vectorization of the product between two matrices, we apply sketching on the low-rank approximations of these matrices to compute the most expensive parts. A distributed version of SENG is also developed for extremely large-scale applications. Global convergence to stationary points is established under mild assumptions and a fast linear convergence is analyzed under the neural tangent kernel (NTK) case. Extensive experiments on convolutional neural networks show the competitiveness of SENG compared with the state-of-the-art methods. On the task ResNet50 with ImageNet-1k, SENG achieves 75.9% Top-1 testing accuracy within 41 epochs. Experiments on the distributed large-batch training Resnet50 with ImageNet-1k show that the scaling efficiency is quite reasonable.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Bolstering stochastic gradient descent with model building

Article Open access 15 April 2024

New logarithmic step size for stochastic gradient descent

Article 12 November 2024

A Stochastic Quasi-Newton Method with Nesterov’s Accelerated Gradient

Data Availability

Enquiries about data availability should be directed to the authors.

Notes

When $\kappa $ is large enough, the approximation (3.10) can be obtained by computing a partial SVD of ${{\hat{G}}}_i$ or ${{\hat{A}}}_i$ regarding to their sizes.
For simplicity, we consider one-dimension output and fix the second layer. However, it is easy to extend the analysis to the multi-dimensional output or the case for jointly training both layers.
https://github.com/yangorwell/SENG/tree/main/Pytorch.
https://pytorch.org/vision/stable/models.html#classification.
https://gitee.com/mindspore/mindspore/blob/r0.7/model_zoo/official/cv/resnet/src/lr_generator.py.

References

Agarwal, N., Anil, R., Hazan, E., Koren, T., Zhang, C.:, Disentangling adaptive gradient methods from learning rates, arXiv preprint arXiv:2002.11803 (2020)
Amari, S.-i.: Neural learning in structured parameter spaces-natural Riemannian gradient. In: Advances in Neural Information Processing Systems pp 127–133 (1997)
Anil, R., Gupta, V., Koren, T., Regan, K., Singer, Y.:, Scalable second order optimization for deep learning, arXiv preprint arXiv:2002.09018 (2020)
Bernacchia, A., Lengyel, M., Hennequin, G.:, Exact natural gradient in deep linear networks and its application to the nonlinear case. In: Advances in Neural Information Processing Systems (2018)
Botev, A., Ritter, H., Barber, D.:, Practical Gauss-Newton optimisation for deep learning, in International Conference on Machine Learning, pp 557–565 (2017)
Byrd, R.H., Hansen, S.L., Nocedal, J., Singer, Y.: A stochastic quasi-Newton method for large-scale optimization. SIAM J. Optim. 26, 1008–1031 (2016)
Article MathSciNet Google Scholar
Cai, T., Gao, R., Hou, J., Chen, S., Wang, D., He, D., Zhang, Z., Wang, L.:, A Gram-Gauss-Newton method learning overparameterized deep neural networks for regression problems, ArXiv: abs/1905.11675 (2019)
Goldfarb, D., Ren, Y., Bahamou, A.:, Practical quasi-Newton methods for training deep neural networks. In: Advances in Neural Information Processing Systems, Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., Lin, H., (eds.) vol. 33, Curran Associates, Inc., pp 2386–2396 (2020)
Gower, R., Kovalev, D., Lieder, F., Richtárik, P. RSN: Randomized subspace Newton, Advances in Neural Information Processing Systems 32 (2019)
Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., He, K.:, Accurate, large minibatch SGD: Training ImageNet in 1 hour. ArXiv:1706.02677 (2017)
Grosse, R., Martens, J.:, A kronecker-factored approximate fisher matrix for convolution layers. In: International Conference on Machine Learning pp. 573–582 (2016)
Gupta, V., Koren, T., Singer, Y.:, Shampoo: Preconditioned stochastic tensor optimization. In: International Conference on Machine Learning pp 1842–1850 (2018)
Hazan, E., Agarwal, A., Kale, S.: Logarithmic regret algorithms for online convex optimization. Mach. Learn. 69, 169–192 (2007)
Article Google Scholar
Karakida, R., Osawa, K.:, Understanding approximate Fisher information for fast convergence of natural gradient descent in wide neural networks. In: Advances in Neural Information Processing Systems (2020)
Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P. T. P.:, On large-batch training for deep learning: Generalization gap and sharp minima. In: International Conference on Learning Representations (2017)
Kingma, D. P., Ba, J.:, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014)
Lacotte, J., Pilanci, M.: Effective dimension adaptive sketching methods for faster regularized least-squares optimization. Adv. Neural. Inf. Process. Syst. 33, 19377–19387 (2020)
Google Scholar
Lacotte, J., Wang, Y., Pilanci, M.: Adaptive Newton sketch: Linear-time optimization with quadratic convergence and effective hessian dimensionality. In: International Conference on Machine Learning, PMLR pp. 5926–5936 (2021)
Mahoney, M.W., et al.: Randomized algorithms for matrices and data, Foundations and Trends®. Mach. Learn. 3, 123–224 (2011)
MATH Google Scholar
Martens, J.: New insights and perspectives on the natural gradient method. J. Mach. Learn. Res. 21, 1–76 (2020)
MathSciNet MATH Google Scholar
Martens, J., Grosse, R.: Optimizing neural networks with kronecker-factored approximate curvature. In: International conference on machine learning pp. 2408–2417 (2015)
Mattson, P., Cheng, C., Coleman, C., Diamos, G., Micikevicius, P., Patterson, D., Tang, H., Wei, G.-Y., Bailis, P., Bittorf, V. et al.:, Mlperf training benchmark, arXiv preprint arXiv:1910.01500 (2019)
Osawa, K., Tsuji, Y., Ueno, Y., Naruse, A., Foo, C.-S., Yokota, R.: Scalable and practical natural gradient for large-scale deep learning, IEEE Transactions on Pattern Analysis and Machine Intelligence (2020)
Pilanci, M., Wainwright, M.J.: Iterative hessian sketch: Fast and accurate solution approximation for constrained least-squares, The. J. Mach. Learn. Res. 17, 1842–1879 (2016)
MathSciNet MATH Google Scholar
Pilanci, M., Wainwright, M.J.: Newton Sketch: A near linear-time optimization algorithm with linear-quadratic convergence. SIAM J. Optim. 27, 205–245 (2017)
Article MathSciNet Google Scholar
Ren, Y., Goldfarb, D.: Efficient subsampled Gauss-Newton and natural gradient methods for training neural networks, arXiv preprint arXiv:1906.02353 (2019)
Ren, Y., Goldfarb, D.: Kronecker-factored quasi-Newton methods for convolutional neural networks, arXiv preprint arXiv:2102.06737 (2021)
Ren, Y., Goldfarb, D.: Tensor normal training for deep learning models, Advances in Neural Information Processing Systems 34 (2021)
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)
Article MathSciNet Google Scholar
Roux, N. L., Manzagol, P.-A., Bengio, Y.: Topmoumoute online natural gradient algorithm, in Advances in Neural Information Processing Systems pp. 849–856 (2008)
Shallue, C.J., Lee, J., Antognini, J., Sohl-Dickstein, J., Frostig, R., Dahl, G.E.: Measuring the effects of data parallelism on neural network training. J. Mach. Learn. Res. 20, 1–49 (2019)
MathSciNet Google Scholar
Sun, R.:, Optimization for deep learning: theory and algorithms, arXiv preprint arXiv:1912.08957 (2019)
Wang, S., Gittens, A., Mahoney, M.W.: Sketched ridge regression: Optimization perspective, statistical perspective, and model averaging. J. Mach. Learn. Res. 18, 8039–8088 (2017)
MathSciNet MATH Google Scholar
Wang, S., Luo, L., Zhang, Z.: Spsd matrix approximation vis column selection: Theories, algorithms, and extensions. J. Mach. Learn. Res. 17, 1697–1745 (2016)
MathSciNet MATH Google Scholar
Wang, X., Ma, S., Goldfarb, D., Liu, W.: Stochastic Quasi-Newton Methods for Nonconvex Stochastic Optimization. SIAM J. Optim. 27, 927–956 (2017)
Article MathSciNet Google Scholar
Yang, M., Milzarek, A., Wen, Z., Zhang, T.: A stochastic extra-step quasi-Newton method for nonsmooth nonconvex optimization, Mathematical Programming pp. 1–47 (2021)
Yang, M., Xu, D., Chen, H., Wen, Z., Chen, M.: Enhance curvature information by structured stochastic quasi-Newton methods. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 10654–10663 (2021)
Yang, M., Xu, D., Cui, Q., Wen, Z., Xu, P.:, NG+: A multi-step matrix-product natural gradient method for deep learning, arXiv preprint arXiv:2106.07454 (2021)
Zhang, G., Martens, J., Grosse, R. B.:, Fast convergence of natural gradient descent for over-parameterized neural networks. In: Advances in Neural Information Processing Systems pp. 8082–8093 (2019)

Download references

Acknowledgements

The authors are grateful to the AE and two anonymous referees for their valuable comments and suggestions.

Funding

M. Yang, D. Xu and Z. Wen are supported in part by Key-Area Research and Development Program of Guangdong Province (No.2019B121204008), the NSFC grants 11831002 and Beijing Academy of Artificial Intelligence.

Author information

Authors and Affiliations

Beijing International Center for Mathematical Research, Peking University, Beijing, China
Minghan Yang
School of Mathematical Sciences, Peking University, Beijing, China
Dong Xu
Beijing International Center for Mathematical Research, College of Engineering and Center for Data Science, Peking University, Beijing, China
Zaiwen Wen
Huawei Technologies Co. Ltd, Shenzhen, China
Mengyun Chen
Peng Cheng Laboratory, Shenzhen, China
Pengxiang Xu

Authors

Minghan Yang
View author publications
You can also search for this author inPubMed Google Scholar
Dong Xu
View author publications
You can also search for this author inPubMed Google Scholar
Zaiwen Wen
View author publications
You can also search for this author inPubMed Google Scholar
Mengyun Chen
View author publications
You can also search for this author inPubMed Google Scholar
Pengxiang Xu
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Zaiwen Wen.

Ethics declarations

Competing interests

The authors have not disclosed any competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, M., Xu, D., Wen, Z. et al. Sketch-Based Empirical Natural Gradient Methods for Deep Learning. J Sci Comput 92, 94 (2022). https://doi.org/10.1007/s10915-022-01911-x

Download citation

Received: 07 December 2021
Revised: 27 May 2022
Accepted: 02 June 2022
Published: 01 August 2022
DOI: https://doi.org/10.1007/s10915-022-01911-x

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sketch-Based Empirical Natural Gradient Methods for Deep Learning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Bolstering stochastic gradient descent with model building

New logarithmic step size for stochastic gradient descent

A Stochastic Quasi-Newton Method with Nesterov’s Accelerated Gradient

Data Availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now