skip to main content
research-article

On the Sample Complexity of Random Fourier Features for Online Learning: How Many Random Fourier Features Do We Need?

Published:01 June 2014Publication History
Skip Abstract Section

Abstract

We study the sample complexity of random Fourier features for online kernel learning—that is, the number of random Fourier features required to achieve good generalization performance. We show that when the loss function is strongly convex and smooth, online kernel learning with random Fourier features can achieve an O(log T /T) bound for the excess risk with only O(1/λ2) random Fourier features, where T is the number of training examples and λ is the modulus of strong convexity. This is a significant improvement compared to the existing result for batch kernel learning that requires O(T) random Fourier features to achieve a generalization bound O(1/√T). Our empirical study verifies that online kernel learning with a limited number of random Fourier features can achieve similar generalization performance as online learning using full kernel matrix. We also present an enhanced online learning algorithm with random Fourier features that improves the classification performance by multiple passes of training examples and a partial average.

Skip Supplemental Material Section

Supplemental Material

References

  1. T. J. Abrahamsen and L. K. Hansen. 2011. A cure for variance inflation in high dimensional kernel principal component analysis. Journal of Machine Learning Research 12, 2027--2044. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Agarwal and J. Duchi. 2011. The generalization ability of online algorithms for dependent data. IEEE Transactions on Information Theory 59, 573--587. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Alexander, S. Ohad, and S. Karthik. 2012. Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the International Conference on Machine Learning. 449--456.Google ScholarGoogle Scholar
  4. R. Ali and R. Benjamin. 2007. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, Vol. 20. 1177--1184.Google ScholarGoogle Scholar
  5. B. Antoine, E. Seyda, W. Jason, and B. Léon. 2005. Fast kernel classifiers with online and active learning. Journal of Machine Learning Research 6, 1579--1619. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. P. L. Bartlett, V. Dani, T. P. Hayes, S. Kakade, A. Rakhlin, and A. Tewari. 2008. High-probability regret bounds for bandit online linear optimization. In Proceedings of the Annual Conference on Learning Theory. 335--342.Google ScholarGoogle Scholar
  7. R. Bekkerman. 2004. Automatic categorization of email into folders: Benchmark experiments on Enron and SRI corpora. Computer Science Department Faculty Publication Series. 218.Google ScholarGoogle Scholar
  8. L. Bottou, O. Chapelle, D. DeCoste, and J. Weston. 2007. Large-Scale Kernel Machines. MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. G. Cavallanti, N. Cesa-Bianchi, and C. Gentile. 2007. Tracking the best hyperplane with a simple budget Perceptron. Machine Learning 69, 2--3, 143--167. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. N. Cesa-Bianchi, A. Conconi, and C. Gentile. 2004. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory 50, 9, 2050--2057. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. C.-C. Chang and C.-J. Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 27, 27:1--27:27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Cotter, J. Keshet, and N. Srebro. 2011. Explicit approximations of the Gaussian kernel. arXiv preprint arXiv:1109.4603.Google ScholarGoogle Scholar
  13. A. Cotter, S.-SW. Shalev-Shwartz, and N. Srebro. 2013. Learning optimally sparse support vector machines. In Proceedings of the International Conference on Machine Learning. 266--274.Google ScholarGoogle Scholar
  14. K. Crammer, M. Dredze, J. Blitzer, and F. Pereira. 2008. Batch performance for an online price. In Proceedings of the NIPS 2007 Workshop on Efficient Machine Learning.Google ScholarGoogle Scholar
  15. I. Dagan, Y. Karov, and D. Roth. 1997. Mistake-driven learning in text categorization. In Proceedings of the 2nd Conference on Empirical Methods in Natural Language Processing. 55--63.Google ScholarGoogle Scholar
  16. O. Dekel, S. Shalev-Shwartz, and Y. Singer. 2008. The Forgetron: A kernel-based Perceptron on a budget. SIAM Journal on Computing 37, 5, 1342--1372. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. P. Drineas and M. W. Mahoney. 2005. On the Nyström method for approximating a Gram matrix for improved kernel-based learning. Journal of Machine Learning Research 6, 2153--2175. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Duchi and Y. Singer. 2009. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research 10, 2899--2934. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Y. Freund and R. E. Schapire. 1999. Large margin classification using the Perceptron algorithm. Machine Learning 37, 3, 277--296. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. E. Hazan and S. Kale. 2011. Beyond the regret minimization barrier: An optimal algorithm for stochastic strongly-convex optimization. Journal of Machine Learning Research Proceedings 19, 421--436.Google ScholarGoogle Scholar
  21. C.-W. Hsu and C.-J. Lin. 2002. A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks 13, 2, 415--425. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. S. Keerthi, O. Chapelle, D. DeCoste, and P. Bennett. 2006. Building support vector machines with reduced classifier complexity. Journal of Machine Learning Research 7, 7, 1493--1515. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Kivinen, A. J. Smola, and R. C. Williamson. 2004. Online learning with kernels. IEEE Transactions on Signal Processing 52, 8, 2165--2176. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. V. Koltchinskii. 2011. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems. Vol. 2033. Springer.Google ScholarGoogle Scholar
  25. S. Kumar, M. Mohri, and A. Talwalkar. 2012. Sampling methods for the Nyström method. Journal of Machine Learning Research 13, 981--1006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Langford, L. Li, and T. Zhang. 2009. Sparse online learning via truncated gradient. Journal of Machine Learning Research 10, 777--801. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. P. Mallapragada, R. Jin, and A. Jain. 2010. Non-parametric mixture models for clustering. In Proceedings of the Joint IAPR International Conference on Structural, Syntactic, and Statistical Pattern Recognition. 334--343. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. O. L. Mangasarian and D. R. Musicant. 2002. Large scale kernel regression via linear programming. Machine Learning 46, 1, 255--269. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. F. Orabona, J. Keshet, and B. Caputo. 2008. The Projectron: A bounded kernel-based Perceptron. In Proceedings of the International Conference on Machine Learning. 720--727. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. E. Osuna and F. Girosi. 1998. Reducing the run-time complexity of support vector machines. In Proceedings of the International Conference on Pattern Recognition. 271--283.Google ScholarGoogle Scholar
  31. A. Rahimi and B. Recht. 2008. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Advances in Neural Information Processing Systems, Vol. 21. 1313--1320.Google ScholarGoogle Scholar
  32. S. Ross and J. A. Bagnell. 2011. Stability conditions for online learnability. arXiv preprint arXiv:1108.3154.Google ScholarGoogle Scholar
  33. B. Schölkopf, S. Mika, C. J. C. Burges, P. Knirsch, K. R. Muller, G. Ratsch, and A. J. Smola. 1999. Input space versus feature space in kernel-based methods. IEEE Transactions on Neural Networks 10, 5, 1000--1017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. B. Schölkopf, P. Simard, V. Vapnik, and A. J. Smola. 1997. Improving the accuracy and speed of support vector machines. In Advances in Neural Information Processing Systems, Vol. 9. 375--381.Google ScholarGoogle Scholar
  35. S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. 2010. Learnability, stability and uniform convergence. Journal of Machine Learning Research 11, 2635--2670. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. S. Shalev-Shwartz, Y. Singer, and N. Srebro. 2007. Pegasos: Primal estimated sub-gradient solver for SVM. In Proceedings of the International Conference on Machine Learning. 807--814. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. O. Shamir and T. Zhang. 2013. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In Proceedings of the International Conference on Machine Learning.Google ScholarGoogle Scholar
  38. S. Smale and D. X. Zhou. 2009. Geometry on probability spaces. Constructive Approximation 30, 3, 311--323.Google ScholarGoogle ScholarCross RefCross Ref
  39. S. Sonnenburg, G. Rätsch, C. Schäfer, and B. Schölkopf. 2006. Large scale multiple kernel learning. Journal of Machine Learning Research 7, 1531--1565. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. I. Steinwart and A. Christmann. 2008. Support Vector Machines. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. V. Vapnik. 1998. Statistical Learning Theory. Wiley, New York.Google ScholarGoogle Scholar
  42. C. K. I. Williams and M. Seeger. 2001. Using the Nyström method to speed up kernel machines. In Advances in Neural Information Processing Systems, Vol. 13. 682--688.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. M. Wu, B. Schölkopf, and G. Bakır. 2006. A direct method for building sparse kernel learning algorithms. Journal of Machine Learning Research 7, 603--624. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. J. W. Xu, P. P. Pokharel, K. H. Jeong, and J. C. Principe. 2006. An explicit construction of a reproducing Gaussian kernel Hilbert space. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. V.Google ScholarGoogle Scholar
  45. C. Yang, R. Duraiswami, and L. Davis. 2005. Efficient kernel machines using the improved fast Gauss transform. In Advances in Neural Information Processing Systems, Vol. 17. 1561--1568.Google ScholarGoogle Scholar
  46. T. Yang, Y. F. Li, M. Mahdavi, R. Jin, and Z.-H. Zhou. 2012. Nyström method vs random Fourier features: A theoretical and empirical comparison. In Advances in Neural Information Processing Systems, Vol. 25. 485--493.Google ScholarGoogle Scholar
  47. K. Zhang and J. T. Kwok. 2009. Density-weighted Nyström method for computing large kernel eigensystems. Neural Computation 21, 1, 121--146. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. L. Zhang, J. Yi, R. Jin, M. Lin, and X. He. 2013. Online kernel learning with a near optimal sparsity bound. In Proceedings of the International Conference on Machine Learning. 621--629.Google ScholarGoogle Scholar
  49. P. Zhao, J. Wang, P. Wu, R. Jin, and S. C. H. Hoi. 2012. Fast bounded online gradient descent algorithms for scalable kernel-based online learning. In Proceedings of the International Conference on Machine Learning. 169--176.Google ScholarGoogle Scholar
  50. J. Zhu and T. Hastie. 2005. Kernel logistic regression and the import vector machine. Journal of Computational and Graphical Statistics 14, 1, 185--205.Google ScholarGoogle ScholarCross RefCross Ref
  51. M. Zinkevich. 2003. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the International Conference on Machine Learning. 928--936.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. On the Sample Complexity of Random Fourier Features for Online Learning: How Many Random Fourier Features Do We Need?

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Knowledge Discovery from Data
        ACM Transactions on Knowledge Discovery from Data  Volume 8, Issue 3
        June 2014
        160 pages
        ISSN:1556-4681
        EISSN:1556-472X
        DOI:10.1145/2630992
        Issue’s Table of Contents

        Copyright © 2014 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 1 June 2014
        • Accepted: 1 June 2009
        • Revised: 1 March 2009
        • Received: 1 February 2007
        Published in tkdd Volume 8, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader