skip to main content
research-article

On the Sample Complexity of Random Fourier Features for Online Learning: How Many Random Fourier Features Do We Need?

Published: 01 June 2014 Publication History

Abstract

We study the sample complexity of random Fourier features for online kernel learning—that is, the number of random Fourier features required to achieve good generalization performance. We show that when the loss function is strongly convex and smooth, online kernel learning with random Fourier features can achieve an O(log T /T) bound for the excess risk with only O(1/λ2) random Fourier features, where T is the number of training examples and λ is the modulus of strong convexity. This is a significant improvement compared to the existing result for batch kernel learning that requires O(T) random Fourier features to achieve a generalization bound O(1/√T). Our empirical study verifies that online kernel learning with a limited number of random Fourier features can achieve similar generalization performance as online learning using full kernel matrix. We also present an enhanced online learning algorithm with random Fourier features that improves the classification performance by multiple passes of training examples and a partial average.

Supplementary Material

a13-lin-apndx.pdf (lin.zip)
Supplemental movie, appendix, image and software files for, On the Sample Complexity of Random Fourier Features for Online Learning: How Many Random Fourier Features Do We Need?

References

[1]
T. J. Abrahamsen and L. K. Hansen. 2011. A cure for variance inflation in high dimensional kernel principal component analysis. Journal of Machine Learning Research 12, 2027--2044.
[2]
A. Agarwal and J. Duchi. 2011. The generalization ability of online algorithms for dependent data. IEEE Transactions on Information Theory 59, 573--587.
[3]
R. Alexander, S. Ohad, and S. Karthik. 2012. Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the International Conference on Machine Learning. 449--456.
[4]
R. Ali and R. Benjamin. 2007. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, Vol. 20. 1177--1184.
[5]
B. Antoine, E. Seyda, W. Jason, and B. Léon. 2005. Fast kernel classifiers with online and active learning. Journal of Machine Learning Research 6, 1579--1619.
[6]
P. L. Bartlett, V. Dani, T. P. Hayes, S. Kakade, A. Rakhlin, and A. Tewari. 2008. High-probability regret bounds for bandit online linear optimization. In Proceedings of the Annual Conference on Learning Theory. 335--342.
[7]
R. Bekkerman. 2004. Automatic categorization of email into folders: Benchmark experiments on Enron and SRI corpora. Computer Science Department Faculty Publication Series. 218.
[8]
L. Bottou, O. Chapelle, D. DeCoste, and J. Weston. 2007. Large-Scale Kernel Machines. MIT Press.
[9]
G. Cavallanti, N. Cesa-Bianchi, and C. Gentile. 2007. Tracking the best hyperplane with a simple budget Perceptron. Machine Learning 69, 2--3, 143--167.
[10]
N. Cesa-Bianchi, A. Conconi, and C. Gentile. 2004. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory 50, 9, 2050--2057.
[11]
C.-C. Chang and C.-J. Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 27, 27:1--27:27.
[12]
A. Cotter, J. Keshet, and N. Srebro. 2011. Explicit approximations of the Gaussian kernel. arXiv preprint arXiv:1109.4603.
[13]
A. Cotter, S.-SW. Shalev-Shwartz, and N. Srebro. 2013. Learning optimally sparse support vector machines. In Proceedings of the International Conference on Machine Learning. 266--274.
[14]
K. Crammer, M. Dredze, J. Blitzer, and F. Pereira. 2008. Batch performance for an online price. In Proceedings of the NIPS 2007 Workshop on Efficient Machine Learning.
[15]
I. Dagan, Y. Karov, and D. Roth. 1997. Mistake-driven learning in text categorization. In Proceedings of the 2nd Conference on Empirical Methods in Natural Language Processing. 55--63.
[16]
O. Dekel, S. Shalev-Shwartz, and Y. Singer. 2008. The Forgetron: A kernel-based Perceptron on a budget. SIAM Journal on Computing 37, 5, 1342--1372.
[17]
P. Drineas and M. W. Mahoney. 2005. On the Nyström method for approximating a Gram matrix for improved kernel-based learning. Journal of Machine Learning Research 6, 2153--2175.
[18]
J. Duchi and Y. Singer. 2009. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research 10, 2899--2934.
[19]
Y. Freund and R. E. Schapire. 1999. Large margin classification using the Perceptron algorithm. Machine Learning 37, 3, 277--296.
[20]
E. Hazan and S. Kale. 2011. Beyond the regret minimization barrier: An optimal algorithm for stochastic strongly-convex optimization. Journal of Machine Learning Research Proceedings 19, 421--436.
[21]
C.-W. Hsu and C.-J. Lin. 2002. A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks 13, 2, 415--425.
[22]
S. S. Keerthi, O. Chapelle, D. DeCoste, and P. Bennett. 2006. Building support vector machines with reduced classifier complexity. Journal of Machine Learning Research 7, 7, 1493--1515.
[23]
J. Kivinen, A. J. Smola, and R. C. Williamson. 2004. Online learning with kernels. IEEE Transactions on Signal Processing 52, 8, 2165--2176.
[24]
V. Koltchinskii. 2011. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems. Vol. 2033. Springer.
[25]
S. Kumar, M. Mohri, and A. Talwalkar. 2012. Sampling methods for the Nyström method. Journal of Machine Learning Research 13, 981--1006.
[26]
J. Langford, L. Li, and T. Zhang. 2009. Sparse online learning via truncated gradient. Journal of Machine Learning Research 10, 777--801.
[27]
P. Mallapragada, R. Jin, and A. Jain. 2010. Non-parametric mixture models for clustering. In Proceedings of the Joint IAPR International Conference on Structural, Syntactic, and Statistical Pattern Recognition. 334--343.
[28]
O. L. Mangasarian and D. R. Musicant. 2002. Large scale kernel regression via linear programming. Machine Learning 46, 1, 255--269.
[29]
F. Orabona, J. Keshet, and B. Caputo. 2008. The Projectron: A bounded kernel-based Perceptron. In Proceedings of the International Conference on Machine Learning. 720--727.
[30]
E. Osuna and F. Girosi. 1998. Reducing the run-time complexity of support vector machines. In Proceedings of the International Conference on Pattern Recognition. 271--283.
[31]
A. Rahimi and B. Recht. 2008. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Advances in Neural Information Processing Systems, Vol. 21. 1313--1320.
[32]
S. Ross and J. A. Bagnell. 2011. Stability conditions for online learnability. arXiv preprint arXiv:1108.3154.
[33]
B. Schölkopf, S. Mika, C. J. C. Burges, P. Knirsch, K. R. Muller, G. Ratsch, and A. J. Smola. 1999. Input space versus feature space in kernel-based methods. IEEE Transactions on Neural Networks 10, 5, 1000--1017.
[34]
B. Schölkopf, P. Simard, V. Vapnik, and A. J. Smola. 1997. Improving the accuracy and speed of support vector machines. In Advances in Neural Information Processing Systems, Vol. 9. 375--381.
[35]
S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. 2010. Learnability, stability and uniform convergence. Journal of Machine Learning Research 11, 2635--2670.
[36]
S. Shalev-Shwartz, Y. Singer, and N. Srebro. 2007. Pegasos: Primal estimated sub-gradient solver for SVM. In Proceedings of the International Conference on Machine Learning. 807--814.
[37]
O. Shamir and T. Zhang. 2013. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In Proceedings of the International Conference on Machine Learning.
[38]
S. Smale and D. X. Zhou. 2009. Geometry on probability spaces. Constructive Approximation 30, 3, 311--323.
[39]
S. Sonnenburg, G. Rätsch, C. Schäfer, and B. Schölkopf. 2006. Large scale multiple kernel learning. Journal of Machine Learning Research 7, 1531--1565.
[40]
I. Steinwart and A. Christmann. 2008. Support Vector Machines. Springer.
[41]
V. Vapnik. 1998. Statistical Learning Theory. Wiley, New York.
[42]
C. K. I. Williams and M. Seeger. 2001. Using the Nyström method to speed up kernel machines. In Advances in Neural Information Processing Systems, Vol. 13. 682--688.
[43]
M. Wu, B. Schölkopf, and G. Bakır. 2006. A direct method for building sparse kernel learning algorithms. Journal of Machine Learning Research 7, 603--624.
[44]
J. W. Xu, P. P. Pokharel, K. H. Jeong, and J. C. Principe. 2006. An explicit construction of a reproducing Gaussian kernel Hilbert space. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. V.
[45]
C. Yang, R. Duraiswami, and L. Davis. 2005. Efficient kernel machines using the improved fast Gauss transform. In Advances in Neural Information Processing Systems, Vol. 17. 1561--1568.
[46]
T. Yang, Y. F. Li, M. Mahdavi, R. Jin, and Z.-H. Zhou. 2012. Nyström method vs random Fourier features: A theoretical and empirical comparison. In Advances in Neural Information Processing Systems, Vol. 25. 485--493.
[47]
K. Zhang and J. T. Kwok. 2009. Density-weighted Nyström method for computing large kernel eigensystems. Neural Computation 21, 1, 121--146.
[48]
L. Zhang, J. Yi, R. Jin, M. Lin, and X. He. 2013. Online kernel learning with a near optimal sparsity bound. In Proceedings of the International Conference on Machine Learning. 621--629.
[49]
P. Zhao, J. Wang, P. Wu, R. Jin, and S. C. H. Hoi. 2012. Fast bounded online gradient descent algorithms for scalable kernel-based online learning. In Proceedings of the International Conference on Machine Learning. 169--176.
[50]
J. Zhu and T. Hastie. 2005. Kernel logistic regression and the import vector machine. Journal of Computational and Graphical Statistics 14, 1, 185--205.
[51]
M. Zinkevich. 2003. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the International Conference on Machine Learning. 928--936.

Cited By

View all

Index Terms

  1. On the Sample Complexity of Random Fourier Features for Online Learning: How Many Random Fourier Features Do We Need?

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Knowledge Discovery from Data
      ACM Transactions on Knowledge Discovery from Data  Volume 8, Issue 3
      June 2014
      160 pages
      ISSN:1556-4681
      EISSN:1556-472X
      DOI:10.1145/2630992
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 01 June 2014
      Accepted: 01 June 2009
      Revised: 01 March 2009
      Received: 01 February 2007
      Published in TKDD Volume 8, Issue 3

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Nyström
      2. kernel learning
      3. sampling complexity

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)21
      • Downloads (Last 6 weeks)3
      Reflects downloads up to 18 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2022)Online Adaptive Kernel Learning with Random Features for Large-scale Nonlinear ClassificationPattern Recognition10.1016/j.patcog.2022.108862131(108862)Online publication date: Nov-2022
      • (2022)Improving kernel online learning with a snapshot memoryMachine Language10.1007/s10994-021-06075-7111:3(997-1018)Online publication date: 1-Mar-2022
      • (2017)Gaussian quadrature for kernel featuresProceedings of the 31st International Conference on Neural Information Processing Systems10.5555/3295222.3295359(6109-6119)Online publication date: 4-Dec-2017
      • (2017)Approximation vector machines for large-scale online learningThe Journal of Machine Learning Research10.5555/3122009.317685518:1(3962-4016)Online publication date: 1-Jan-2017
      • (2017)Sparse Hilbert Schmidt Independence Criterion and Surrogate-Kernel-Based Feature Selection for Hyperspectral Image ClassificationIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2016.264247955:4(2385-2398)Online publication date: Apr-2017
      • (2017)Stochastic Online Kernel Selection with Instantaneous Loss in Random Feature SpaceNeural Information Processing10.1007/978-3-319-70087-8_4(33-42)Online publication date: 14-Nov-2017
      • (2017)Randomness in neural networksWiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery10.1002/widm.12007:2(n/a-n/a)Online publication date: 1-Mar-2017
      • (2016)Dual space gradient descent for online learningProceedings of the 30th International Conference on Neural Information Processing Systems10.5555/3157382.3157610(4590-4598)Online publication date: 5-Dec-2016
      • (2015)Dynamic tracking of functional gene modules in treated juvenile idiopathic arthritisGenome Medicine10.1186/s13073-015-0227-27:1Online publication date: 24-Oct-2015
      • (2015)Dependent Online Kernel Learning With Constant Number of Random Fourier FeaturesIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2014.238731326:10(2464-2476)Online publication date: Oct-2015

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media