skip to main content
10.1145/3511808.3557307acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Efficient Second-Order Optimization for Neural Networks with Kernel Machines

Published: 17 October 2022 Publication History

Abstract

Second-order optimization has been recently explored in neural network training. However, the recomputation of the Hessian matrix in the second-order optimization posts much extra computation and memory burden in the training. There have been some attempts to address this issue by approximation on the Hessian matrix, which unfortunately degrades the performance of the neural models. In order to tackle this issue, we propose Kernel Stochastic Gradient Descent (Kernel SGD) which solves the optimization problem in a space transformed by the Hessian matrix of the kernel machine. Kernel SGD eliminates the Hessian matrix recomputation in the training and requires a much smaller memory cost which can be controlled via the mini-batch size. We show that Kernel SGD optimization is theoretically guaranteed to converge. Our experimental results on tabular, image and text data confirm that Kernel SGD converges up to 30 times faster than the existing second-order optimization techniques, and achieves the highest test accuracy on all the tasks tested. Kernel SGD even outperforms the first-order optimization baselines in some problems tested in our experiments.

References

[1]
Mikhail Belkin, Siyuan Ma, and Soumik Mandal. 2018. To understand deep learning we need to understand kernel learning. arXiv preprint arXiv:1802.01396 (2018).
[2]
Albert S. Berahas, Jorge Nocedal, and Martin Takávc. 2016. A Multi-Batch L-BFGS Method for Machine Learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems (Barcelona, Spain). 1063--1071.
[3]
Christopher M Bishop and Nasser M Nasrabadi. 2006. Pattern recognition and machine learning. Vol. 4. Springer.
[4]
Liefeng Bo, Xiaofeng Ren, and Dieter Fox. 2010. Kernel descriptors for visual recognition. In Advances in neural information processing systems (NeurIPS). 244--252.
[5]
Youngmin Cho and Lawrence K Saul. 2009. Kernel methods for deep learning. In Advances in neural information processing systems (NeurIPS). 342--350.
[6]
Yann Dauphin, Harm De Vries, and Yoshua Bengio. 2015. Equilibrated adaptive learning rates for non-convex optimization. In Advances in neural information processing systems (NeurIPS). 1504--1512.
[7]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805 (2018).
[8]
Vineet Gupta, Tomer Koren, and Yoram Singer. 2018. Shampoo: Preconditioned stochastic tensor optimization. In International conference on machine learning (ICML). PMLR, 1842--1850.
[9]
Yilong Hao, Kanishka Tyagi, Rohit Rawat, and Michael Manry. 2016. Second order design of multiclass kernel machines. In 2016 International joint conference on neural networks (IJCNN). IEEE, 3233--3240.
[10]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). 770--778.
[11]
Thomas Hofmann, Bernhard Schölkopf, and Alexander J Smola. 2008. Kernel methods in machine learning. The annals of statistics, Vol. 36, 3 (2008), 1171--1220.
[12]
Arthur Jacot, Franck Gabriel, and Clément Hongler. 2018. Neural tangent kernel: Convergence and generalization in neural networks. arXiv preprint arXiv:1806.07572 (2018).
[13]
S Sathiya Keerthi and Chih-Jen Lin. 2003. Asymptotic behaviors of support vector machines with Gaussian kernel. Neural computation, Vol. 15, 7 (2003), 1667--1689.
[14]
Yann A LeCun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. 2012. Efficient backprop. In Neural networks: Tricks of the trade. Springer, 9--48.
[15]
Dong C Liu and Jorge Nocedal. 1989. On the limited memory BFGS method for large scale optimization. Mathematical programming, Vol. 45, 1--3 (1989), 503--528.
[16]
Julien Mairal. 2016. End-to-end kernel learning with supervised convolutional kernel networks. In Advances in neural information processing systems (NeurIPS). 1399--1407.
[17]
Julien Mairal, Piotr Koniusz, Zaid Harchaoui, and Cordelia Schmid. 2014. Convolutional kernel networks. In Advances in neural information processing systems (NeurIPS). 2627--2635.
[18]
James Martens. 2010. Deep learning via hessian-free optimization. In International conference on machine learning (ICML), Vol. 27. 735--742.
[19]
James Martens and Roger Grosse. 2015. Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning (ICML). 2408--2417.
[20]
Nicolò Navarin, Dinh V Tran, and Alessandro Sperduti. 2018. Pre-training graph neural networks with kernels. arXiv preprint arXiv:1811.06930 (2018).
[21]
Dat Quoc Nguyen, Thanh Vu, Afshin Rahimi, Mai Hoang Dao, Linh The Nguyen, and Long Doan. 2020. WNUT-2020 Task 2: Identification of Informative COVID-19 English Tweets. In Proceedings of the 6th workshop on noisy user-generated text (W-NUT). 314--318. https://www.aclweb.org/anthology/2020.wnut-1.41
[22]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. In Advances in neural information processing systems (NeurIPS). 8026--8037. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
[23]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In Empirical methods in natural language processing (EMNLP). 1532--1543. http://www.aclweb.org/anthology/D14-1162
[24]
Tom Schaul, Sixin Zhang, and Yann LeCun. 2013. No more pesky learning rates. In International conference on machine learning (ICML). 343--351.
[25]
Bernhard Schölkopf, Ralf Herbrich, and Alex J Smola. 2001. A generalized representer theorem. In International conference on computational learning theory. Springer, 416--426.
[26]
Nicol N Schraudolph. 2002. Fast curvature matrix-vector products for second-order gradient descent. Neural computation, Vol. 14, 7 (2002), 1723--1738.
[27]
Alex J Smola and Bernhard Schölkopf. 1998. Learning with kernels. Vol. 4. Citeseer.
[28]
Eduardo Soares, Plamen Angelov, Sarah Biaso, Michele Higa Froes, and Daniel Kanda Abe. 2020. SARS-CoV-2 CT-scan dataset: A large dataset of real patients CT scans for SARS-CoV-2 identification. medRxiv (2020). https://doi.org/10.1101/2020.04.24.20078584
[29]
Zeyi Wen, Zhishang Zhou, Hanfeng Liu, Bingsheng He, Xia Li, and Jian Chen. 2021. Enhancing SVMs with Problem Context Aware Pipeline. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 1821--1829.
[30]
Peng Xu, Fred Roosta, and Michael W Mahoney. 2020. Newton-type methods for non-convex optimization under inexact hessian information. Mathematical Programming, Vol. 184, 1 (2020), 35--70.
[31]
Zhewei Yao, Amir Gholami, Daiyaan Arfeen, Richard Liaw, Joseph Gonzalez, Kurt Keutzer, and Michael Mahoney. 2018. Large batch size training of neural networks with adversarial training and second-order information. arXiv preprint arXiv:1810.01021 (2018).
[32]
Zhewei Yao, Amir Gholami, Sheng Shen, Mustafa Mustafa, Kurt Keutzer, and Michael Mahoney. 2021. ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning. Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, 12 (2021), 10665--10673.
[33]
Dongruo Zhou, Pan Xu, and Quanquan Gu. 2019. Stochastic Variance-Reduced Cubic Regularization Methods. J. Mach. Learn. Res., Vol. 20, 134 (2019), 1--47.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '22: Proceedings of the 31st ACM International Conference on Information & Knowledge Management
October 2022
5274 pages
ISBN:9781450392365
DOI:10.1145/3511808
  • General Chairs:
  • Mohammad Al Hasan,
  • Li Xiong
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. kernel machines
  2. neural networks
  3. second-order optimization

Qualifiers

  • Research-article

Conference

CIKM '22
Sponsor:

Acceptance Rates

CIKM '22 Paper Acceptance Rate 621 of 2,257 submissions, 28%;
Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 164
    Total Downloads
  • Downloads (Last 12 months)32
  • Downloads (Last 6 weeks)5
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media