research-article

On the Sample Complexity of Random Fourier Features for Online Learning: How Many Random Fourier Features Do We Need?

Authors:

Changshui ZhangAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data (TKDD), Volume 8, Issue 3

Article No.: 13, Pages 1 - 19

https://doi.org/10.1145/2611378

Published: 01 June 2014 Publication History

Abstract

We study the sample complexity of random Fourier features for online kernel learning—that is, the number of random Fourier features required to achieve good generalization performance. We show that when the loss function is strongly convex and smooth, online kernel learning with random Fourier features can achieve an O(log T /T) bound for the excess risk with only O(1/λ²) random Fourier features, where T is the number of training examples and λ is the modulus of strong convexity. This is a significant improvement compared to the existing result for batch kernel learning that requires O(T) random Fourier features to achieve a generalization bound O(1/√T). Our empirical study verifies that online kernel learning with a limited number of random Fourier features can achieve similar generalization performance as online learning using full kernel matrix. We also present an enhanced online learning algorithm with random Fourier features that improves the classification performance by multiple passes of training examples and a partial average.

Supplementary Material

a13-lin-apndx.pdf (lin.zip)

Supplemental movie, appendix, image and software files for, On the Sample Complexity of Random Fourier Features for Online Learning: How Many Random Fourier Features Do We Need?

Download
110.11 KB

References

[1]

T. J. Abrahamsen and L. K. Hansen. 2011. A cure for variance inflation in high dimensional kernel principal component analysis. Journal of Machine Learning Research 12, 2027--2044.

Digital Library

[2]

A. Agarwal and J. Duchi. 2011. The generalization ability of online algorithms for dependent data. IEEE Transactions on Information Theory 59, 573--587.

Digital Library

[3]

R. Alexander, S. Ohad, and S. Karthik. 2012. Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the International Conference on Machine Learning. 449--456.

[4]

R. Ali and R. Benjamin. 2007. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, Vol. 20. 1177--1184.

[5]

B. Antoine, E. Seyda, W. Jason, and B. Léon. 2005. Fast kernel classifiers with online and active learning. Journal of Machine Learning Research 6, 1579--1619.

Digital Library

[6]

P. L. Bartlett, V. Dani, T. P. Hayes, S. Kakade, A. Rakhlin, and A. Tewari. 2008. High-probability regret bounds for bandit online linear optimization. In Proceedings of the Annual Conference on Learning Theory. 335--342.

[7]

R. Bekkerman. 2004. Automatic categorization of email into folders: Benchmark experiments on Enron and SRI corpora. Computer Science Department Faculty Publication Series. 218.

[8]

L. Bottou, O. Chapelle, D. DeCoste, and J. Weston. 2007. Large-Scale Kernel Machines. MIT Press.

Digital Library

[9]

G. Cavallanti, N. Cesa-Bianchi, and C. Gentile. 2007. Tracking the best hyperplane with a simple budget Perceptron. Machine Learning 69, 2--3, 143--167.

Digital Library

[10]

N. Cesa-Bianchi, A. Conconi, and C. Gentile. 2004. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory 50, 9, 2050--2057.

Digital Library

[11]

C.-C. Chang and C.-J. Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 27, 27:1--27:27.

Digital Library

[12]

A. Cotter, J. Keshet, and N. Srebro. 2011. Explicit approximations of the Gaussian kernel. arXiv preprint arXiv:1109.4603.

[13]

A. Cotter, S.-SW. Shalev-Shwartz, and N. Srebro. 2013. Learning optimally sparse support vector machines. In Proceedings of the International Conference on Machine Learning. 266--274.

[14]

K. Crammer, M. Dredze, J. Blitzer, and F. Pereira. 2008. Batch performance for an online price. In Proceedings of the NIPS 2007 Workshop on Efficient Machine Learning.

[15]

I. Dagan, Y. Karov, and D. Roth. 1997. Mistake-driven learning in text categorization. In Proceedings of the 2nd Conference on Empirical Methods in Natural Language Processing. 55--63.

[16]

O. Dekel, S. Shalev-Shwartz, and Y. Singer. 2008. The Forgetron: A kernel-based Perceptron on a budget. SIAM Journal on Computing 37, 5, 1342--1372.

Digital Library

[17]

P. Drineas and M. W. Mahoney. 2005. On the Nyström method for approximating a Gram matrix for improved kernel-based learning. Journal of Machine Learning Research 6, 2153--2175.

Digital Library

[18]

J. Duchi and Y. Singer. 2009. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research 10, 2899--2934.

Digital Library

[19]

Y. Freund and R. E. Schapire. 1999. Large margin classification using the Perceptron algorithm. Machine Learning 37, 3, 277--296.

Digital Library

[20]

E. Hazan and S. Kale. 2011. Beyond the regret minimization barrier: An optimal algorithm for stochastic strongly-convex optimization. Journal of Machine Learning Research Proceedings 19, 421--436.

[21]

C.-W. Hsu and C.-J. Lin. 2002. A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks 13, 2, 415--425.

Digital Library

[22]

S. S. Keerthi, O. Chapelle, D. DeCoste, and P. Bennett. 2006. Building support vector machines with reduced classifier complexity. Journal of Machine Learning Research 7, 7, 1493--1515.

Digital Library

[23]

J. Kivinen, A. J. Smola, and R. C. Williamson. 2004. Online learning with kernels. IEEE Transactions on Signal Processing 52, 8, 2165--2176.

Digital Library

[24]

V. Koltchinskii. 2011. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems. Vol. 2033. Springer.

[25]

S. Kumar, M. Mohri, and A. Talwalkar. 2012. Sampling methods for the Nyström method. Journal of Machine Learning Research 13, 981--1006.

Digital Library

[26]

J. Langford, L. Li, and T. Zhang. 2009. Sparse online learning via truncated gradient. Journal of Machine Learning Research 10, 777--801.

Digital Library

[27]

P. Mallapragada, R. Jin, and A. Jain. 2010. Non-parametric mixture models for clustering. In Proceedings of the Joint IAPR International Conference on Structural, Syntactic, and Statistical Pattern Recognition. 334--343.

Digital Library

[28]

O. L. Mangasarian and D. R. Musicant. 2002. Large scale kernel regression via linear programming. Machine Learning 46, 1, 255--269.

Digital Library

[29]

F. Orabona, J. Keshet, and B. Caputo. 2008. The Projectron: A bounded kernel-based Perceptron. In Proceedings of the International Conference on Machine Learning. 720--727.

Digital Library

[30]

E. Osuna and F. Girosi. 1998. Reducing the run-time complexity of support vector machines. In Proceedings of the International Conference on Pattern Recognition. 271--283.

[31]

A. Rahimi and B. Recht. 2008. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Advances in Neural Information Processing Systems, Vol. 21. 1313--1320.

[32]

S. Ross and J. A. Bagnell. 2011. Stability conditions for online learnability. arXiv preprint arXiv:1108.3154.

[33]

B. Schölkopf, S. Mika, C. J. C. Burges, P. Knirsch, K. R. Muller, G. Ratsch, and A. J. Smola. 1999. Input space versus feature space in kernel-based methods. IEEE Transactions on Neural Networks 10, 5, 1000--1017.

Digital Library

[34]

B. Schölkopf, P. Simard, V. Vapnik, and A. J. Smola. 1997. Improving the accuracy and speed of support vector machines. In Advances in Neural Information Processing Systems, Vol. 9. 375--381.

[35]

S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. 2010. Learnability, stability and uniform convergence. Journal of Machine Learning Research 11, 2635--2670.

Digital Library

[36]

S. Shalev-Shwartz, Y. Singer, and N. Srebro. 2007. Pegasos: Primal estimated sub-gradient solver for SVM. In Proceedings of the International Conference on Machine Learning. 807--814.

Digital Library

[37]

O. Shamir and T. Zhang. 2013. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In Proceedings of the International Conference on Machine Learning.

[38]

S. Smale and D. X. Zhou. 2009. Geometry on probability spaces. Constructive Approximation 30, 3, 311--323.

[39]

S. Sonnenburg, G. Rätsch, C. Schäfer, and B. Schölkopf. 2006. Large scale multiple kernel learning. Journal of Machine Learning Research 7, 1531--1565.

Digital Library

[40]

I. Steinwart and A. Christmann. 2008. Support Vector Machines. Springer.

Digital Library

[41]

V. Vapnik. 1998. Statistical Learning Theory. Wiley, New York.

[42]

C. K. I. Williams and M. Seeger. 2001. Using the Nyström method to speed up kernel machines. In Advances in Neural Information Processing Systems, Vol. 13. 682--688.

Digital Library

[43]

M. Wu, B. Schölkopf, and G. Bakır. 2006. A direct method for building sparse kernel learning algorithms. Journal of Machine Learning Research 7, 603--624.

Digital Library

[44]

J. W. Xu, P. P. Pokharel, K. H. Jeong, and J. C. Principe. 2006. An explicit construction of a reproducing Gaussian kernel Hilbert space. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. V.

[45]

C. Yang, R. Duraiswami, and L. Davis. 2005. Efficient kernel machines using the improved fast Gauss transform. In Advances in Neural Information Processing Systems, Vol. 17. 1561--1568.

[46]

T. Yang, Y. F. Li, M. Mahdavi, R. Jin, and Z.-H. Zhou. 2012. Nyström method vs random Fourier features: A theoretical and empirical comparison. In Advances in Neural Information Processing Systems, Vol. 25. 485--493.

[47]

K. Zhang and J. T. Kwok. 2009. Density-weighted Nyström method for computing large kernel eigensystems. Neural Computation 21, 1, 121--146.

Digital Library

[48]

L. Zhang, J. Yi, R. Jin, M. Lin, and X. He. 2013. Online kernel learning with a near optimal sparsity bound. In Proceedings of the International Conference on Machine Learning. 621--629.

[49]

P. Zhao, J. Wang, P. Wu, R. Jin, and S. C. H. Hoi. 2012. Fast bounded online gradient descent algorithms for scalable kernel-based online learning. In Proceedings of the International Conference on Machine Learning. 169--176.

[50]

J. Zhu and T. Hastie. 2005. Kernel logistic regression and the import vector machine. Journal of Computational and Graphical Statistics 14, 1, 185--205.

[51]

M. Zinkevich. 2003. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the International Conference on Machine Learning. 928--936.

Digital Library

Cited By

Chen YYang X(2022)Online Adaptive Kernel Learning with Random Features for Large-scale Nonlinear ClassificationPattern Recognition10.1016/j.patcog.2022.108862131(108862)Online publication date: Nov-2022
https://doi.org/10.1016/j.patcog.2022.108862
Le TNguyen KPhung D(2022)Improving kernel online learning with a snapshot memoryMachine Language10.1007/s10994-021-06075-7111:3(997-1018)Online publication date: 1-Mar-2022
https://dl.acm.org/doi/10.1007/s10994-021-06075-7
Dao TSa CRé C(2017)Gaussian quadrature for kernel featuresProceedings of the 31st International Conference on Neural Information Processing Systems10.5555/3295222.3295359(6109-6119)Online publication date: 4-Dec-2017
https://dl.acm.org/doi/10.5555/3295222.3295359
Show More Cited By

Index Terms

On the Sample Complexity of Random Fourier Features for Online Learning: How Many Random Fourier Features Do We Need?
1. Mathematics of computing
  1. Mathematical analysis
    1. Mathematical optimization
2. Theory of computation
  1. Design and analysis of algorithms
    1. Mathematical optimization

Recommendations

End-to-end kernel learning via generative random Fourier features
Highlights
- A one-stage, end-to-end kernel learning method based on random Fourier features is proposed. This method involves a generative network to learn the distribution of kernel and to build random features, which are then followed by a linear ...
Abstract
Random Fourier features (RFFs) provide a promising way for kernel learning in a spectral case. Current RFFs-based kernel learning methods usually work in a two-stage way. In the first-stage process, learning an optimal feature map is often ...
Random Fourier extreme learning machine with ℓ_{2 , 1} -norm regularization

This paper proposes a novel algorithm, termed random Fourier extreme learning machine with 2 , 1 -norm regularization, to improve the robustness and compactness of the widely used extreme learning machine. In specific, we firstly introduce the random ...
Online kernel learning with nearly constant support vectors

Nyström method has been widely used to improve the computational efficiency of batch kernel learning. The key idea of Nyström method is to randomly sample M support vectors from the collection of T training instances, and learn a kernel classifier in ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Volume 8, Issue 3

June 2014

160 pages

ISSN:1556-4681

EISSN:1556-472X

DOI:10.1145/2630992

Issue’s Table of Contents

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2014

Accepted: 01 June 2009

Revised: 01 March 2009

Received: 01 February 2007

Published in TKDD Volume 8, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
462
Total Downloads

Downloads (Last 12 months)21
Downloads (Last 6 weeks)3

Reflects downloads up to 18 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chen YYang X(2022)Online Adaptive Kernel Learning with Random Features for Large-scale Nonlinear ClassificationPattern Recognition10.1016/j.patcog.2022.108862131(108862)Online publication date: Nov-2022
https://doi.org/10.1016/j.patcog.2022.108862
Le TNguyen KPhung D(2022)Improving kernel online learning with a snapshot memoryMachine Language10.1007/s10994-021-06075-7111:3(997-1018)Online publication date: 1-Mar-2022
https://dl.acm.org/doi/10.1007/s10994-021-06075-7
Dao TSa CRé C(2017)Gaussian quadrature for kernel featuresProceedings of the 31st International Conference on Neural Information Processing Systems10.5555/3295222.3295359(6109-6119)Online publication date: 4-Dec-2017
https://dl.acm.org/doi/10.5555/3295222.3295359
Le TNguyen TNguyen VPhung D(2017)Approximation vector machines for large-scale online learningThe Journal of Machine Learning Research10.5555/3122009.317685518:1(3962-4016)Online publication date: 1-Jan-2017
https://dl.acm.org/doi/10.5555/3122009.3176855
Damodaran BCourty NLefevre S(2017)Sparse Hilbert Schmidt Independence Criterion and Surrogate-Kernel-Based Feature Selection for Hyperspectral Image ClassificationIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2016.264247955:4(2385-2398)Online publication date: Apr-2017
https://doi.org/10.1109/TGRS.2016.2642479
Han ZLiao S(2017)Stochastic Online Kernel Selection with Instantaneous Loss in Random Feature SpaceNeural Information Processing10.1007/978-3-319-70087-8_4(33-42)Online publication date: 14-Nov-2017
https://dl.acm.org/doi/10.1007/978-3-319-70087-8_4
Scardapane SWang D(2017)Randomness in neural networksWiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery10.1002/widm.12007:2(n/a-n/a)Online publication date: 1-Mar-2017
https://dl.acm.org/doi/10.1002/widm.1200
Le TNguyen TNguyen VPhung D(2016)Dual space gradient descent for online learningProceedings of the 30th International Conference on Neural Information Processing Systems10.5555/3157382.3157610(4590-4598)Online publication date: 5-Dec-2016
https://dl.acm.org/doi/10.5555/3157382.3157610
Du NJiang KSawle AFrank MWallace CZhang AJarvis J(2015)Dynamic tracking of functional gene modules in treated juvenile idiopathic arthritisGenome Medicine10.1186/s13073-015-0227-27:1Online publication date: 24-Oct-2015
https://doi.org/10.1186/s13073-015-0227-2
Hu ZLin MZhang C(2015)Dependent Online Kernel Learning With Constant Number of Random Fourier FeaturesIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2014.238731326:10(2464-2476)Online publication date: Oct-2015
https://doi.org/10.1109/TNNLS.2014.2387313

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents