research-article

Efficient Second-Order Optimization for Neural Networks with Kernel Machines

Authors:

Jin HuangAuthors Info & Claims

CIKM '22: Proceedings of the 31st ACM International Conference on Information & Knowledge Management

Pages 242 - 251

https://doi.org/10.1145/3511808.3557307

Published: 17 October 2022 Publication History

Abstract

Second-order optimization has been recently explored in neural network training. However, the recomputation of the Hessian matrix in the second-order optimization posts much extra computation and memory burden in the training. There have been some attempts to address this issue by approximation on the Hessian matrix, which unfortunately degrades the performance of the neural models. In order to tackle this issue, we propose Kernel Stochastic Gradient Descent (Kernel SGD) which solves the optimization problem in a space transformed by the Hessian matrix of the kernel machine. Kernel SGD eliminates the Hessian matrix recomputation in the training and requires a much smaller memory cost which can be controlled via the mini-batch size. We show that Kernel SGD optimization is theoretically guaranteed to converge. Our experimental results on tabular, image and text data confirm that Kernel SGD converges up to 30 times faster than the existing second-order optimization techniques, and achieves the highest test accuracy on all the tasks tested. Kernel SGD even outperforms the first-order optimization baselines in some problems tested in our experiments.

References

[1]

Mikhail Belkin, Siyuan Ma, and Soumik Mandal. 2018. To understand deep learning we need to understand kernel learning. arXiv preprint arXiv:1802.01396 (2018).

[2]

Albert S. Berahas, Jorge Nocedal, and Martin Takávc. 2016. A Multi-Batch L-BFGS Method for Machine Learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems (Barcelona, Spain). 1063--1071.

Digital Library

[3]

Christopher M Bishop and Nasser M Nasrabadi. 2006. Pattern recognition and machine learning. Vol. 4. Springer.

Digital Library

[4]

Liefeng Bo, Xiaofeng Ren, and Dieter Fox. 2010. Kernel descriptors for visual recognition. In Advances in neural information processing systems (NeurIPS). 244--252.

[5]

Youngmin Cho and Lawrence K Saul. 2009. Kernel methods for deep learning. In Advances in neural information processing systems (NeurIPS). 342--350.

[6]

Yann Dauphin, Harm De Vries, and Yoshua Bengio. 2015. Equilibrated adaptive learning rates for non-convex optimization. In Advances in neural information processing systems (NeurIPS). 1504--1512.

[7]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805 (2018).

[8]

Vineet Gupta, Tomer Koren, and Yoram Singer. 2018. Shampoo: Preconditioned stochastic tensor optimization. In International conference on machine learning (ICML). PMLR, 1842--1850.

[9]

Yilong Hao, Kanishka Tyagi, Rohit Rawat, and Michael Manry. 2016. Second order design of multiclass kernel machines. In 2016 International joint conference on neural networks (IJCNN). IEEE, 3233--3240.

[10]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). 770--778.

[11]

Thomas Hofmann, Bernhard Schölkopf, and Alexander J Smola. 2008. Kernel methods in machine learning. The annals of statistics, Vol. 36, 3 (2008), 1171--1220.

[12]

Arthur Jacot, Franck Gabriel, and Clément Hongler. 2018. Neural tangent kernel: Convergence and generalization in neural networks. arXiv preprint arXiv:1806.07572 (2018).

[13]

S Sathiya Keerthi and Chih-Jen Lin. 2003. Asymptotic behaviors of support vector machines with Gaussian kernel. Neural computation, Vol. 15, 7 (2003), 1667--1689.

[14]

Yann A LeCun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. 2012. Efficient backprop. In Neural networks: Tricks of the trade. Springer, 9--48.

Digital Library

[15]

Dong C Liu and Jorge Nocedal. 1989. On the limited memory BFGS method for large scale optimization. Mathematical programming, Vol. 45, 1--3 (1989), 503--528.

[16]

Julien Mairal. 2016. End-to-end kernel learning with supervised convolutional kernel networks. In Advances in neural information processing systems (NeurIPS). 1399--1407.

[17]

Julien Mairal, Piotr Koniusz, Zaid Harchaoui, and Cordelia Schmid. 2014. Convolutional kernel networks. In Advances in neural information processing systems (NeurIPS). 2627--2635.

[18]

James Martens. 2010. Deep learning via hessian-free optimization. In International conference on machine learning (ICML), Vol. 27. 735--742.

[19]

James Martens and Roger Grosse. 2015. Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning (ICML). 2408--2417.

[20]

Nicolò Navarin, Dinh V Tran, and Alessandro Sperduti. 2018. Pre-training graph neural networks with kernels. arXiv preprint arXiv:1811.06930 (2018).

[21]

Dat Quoc Nguyen, Thanh Vu, Afshin Rahimi, Mai Hoang Dao, Linh The Nguyen, and Long Doan. 2020. WNUT-2020 Task 2: Identification of Informative COVID-19 English Tweets. In Proceedings of the 6th workshop on noisy user-generated text (W-NUT). 314--318. https://www.aclweb.org/anthology/2020.wnut-1.41

[22]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. In Advances in neural information processing systems (NeurIPS). 8026--8037. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf

[23]

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In Empirical methods in natural language processing (EMNLP). 1532--1543. http://www.aclweb.org/anthology/D14-1162

[24]

Tom Schaul, Sixin Zhang, and Yann LeCun. 2013. No more pesky learning rates. In International conference on machine learning (ICML). 343--351.

[25]

Bernhard Schölkopf, Ralf Herbrich, and Alex J Smola. 2001. A generalized representer theorem. In International conference on computational learning theory. Springer, 416--426.

[26]

Nicol N Schraudolph. 2002. Fast curvature matrix-vector products for second-order gradient descent. Neural computation, Vol. 14, 7 (2002), 1723--1738.

[27]

Alex J Smola and Bernhard Schölkopf. 1998. Learning with kernels. Vol. 4. Citeseer.

[28]

Eduardo Soares, Plamen Angelov, Sarah Biaso, Michele Higa Froes, and Daniel Kanda Abe. 2020. SARS-CoV-2 CT-scan dataset: A large dataset of real patients CT scans for SARS-CoV-2 identification. medRxiv (2020). https://doi.org/10.1101/2020.04.24.20078584

[29]

Zeyi Wen, Zhishang Zhou, Hanfeng Liu, Bingsheng He, Xia Li, and Jian Chen. 2021. Enhancing SVMs with Problem Context Aware Pipeline. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 1821--1829.

Digital Library

[30]

Peng Xu, Fred Roosta, and Michael W Mahoney. 2020. Newton-type methods for non-convex optimization under inexact hessian information. Mathematical Programming, Vol. 184, 1 (2020), 35--70.

Digital Library

[31]

Zhewei Yao, Amir Gholami, Daiyaan Arfeen, Richard Liaw, Joseph Gonzalez, Kurt Keutzer, and Michael Mahoney. 2018. Large batch size training of neural networks with adversarial training and second-order information. arXiv preprint arXiv:1810.01021 (2018).

[32]

Zhewei Yao, Amir Gholami, Sheng Shen, Mustafa Mustafa, Kurt Keutzer, and Michael Mahoney. 2021. ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning. Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, 12 (2021), 10665--10673.

[33]

Dongruo Zhou, Pan Xu, and Quanquan Gu. 2019. Stochastic Variance-Reduced Cubic Regularization Methods. J. Mach. Learn. Res., Vol. 20, 134 (2019), 1--47.

Index Terms

Efficient Second-Order Optimization for Neural Networks with Kernel Machines
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Kernel methods
      2. Neural networks
2. Theory of computation
  1. Design and analysis of algorithms
    1. Algorithm design techniques
      1. Preconditioning

Recommendations

Nonnegative matrix factorization with constrained second-order optimization

Nonnegative matrix factorization (NMF) solves the following problem: find nonnegative matrices A@?R"+^M^x^R and X@?R"+^R^x^T such that Y@?AX, given only Y@?R^M^x^T and the assigned index R. This method has found a wide spectrum of applications in signal ...
Second-order Karush---Kuhn---Tucker optimality conditions for set-valued optimization

In this paper, we propose the concept of a second-order composed contingent derivative for set-valued maps, discuss its relationship to the second-order contingent derivative and investigate some of its special properties. By virtue of the second-order ...
Neural networks for solving second-order cone constrained variational inequality problem

In this paper, we consider using the neural networks to efficiently solve the second-order cone constrained variational inequality (SOCCVI) problem. More specifically, two kinds of neural networks are proposed to deal with the Karush-Kuhn-Tucker (KKT) ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '22: Proceedings of the 31st ACM International Conference on Information & Knowledge Management

October 2022

5274 pages

ISBN:9781450392365

DOI:10.1145/3511808

General Chairs:
Mohammad Al Hasan
Indiana University Purdue University, Indianapolis, USA
,
Li Xiong
Emory University, Atlanta, USA

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM '22

Sponsor:

CIKM '22: The 31st ACM International Conference on Information and Knowledge Management

October 17 - 21, 2022

GA, Atlanta, USA

Acceptance Rates

CIKM '22 Paper Acceptance Rate 621 of 2,257 submissions, 28%;

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
164
Total Downloads

Downloads (Last 12 months)32
Downloads (Last 6 weeks)5

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten