Abstract
We study the sample complexity of random Fourier features for online kernel learning—that is, the number of random Fourier features required to achieve good generalization performance. We show that when the loss function is strongly convex and smooth, online kernel learning with random Fourier features can achieve an O(log T /T) bound for the excess risk with only O(1/λ2) random Fourier features, where T is the number of training examples and λ is the modulus of strong convexity. This is a significant improvement compared to the existing result for batch kernel learning that requires O(T) random Fourier features to achieve a generalization bound O(1/√T). Our empirical study verifies that online kernel learning with a limited number of random Fourier features can achieve similar generalization performance as online learning using full kernel matrix. We also present an enhanced online learning algorithm with random Fourier features that improves the classification performance by multiple passes of training examples and a partial average.
Supplemental Material
Available for Download
Supplemental movie, appendix, image and software files for, On the Sample Complexity of Random Fourier Features for Online Learning: How Many Random Fourier Features Do We Need?
- T. J. Abrahamsen and L. K. Hansen. 2011. A cure for variance inflation in high dimensional kernel principal component analysis. Journal of Machine Learning Research 12, 2027--2044. Google ScholarDigital Library
- A. Agarwal and J. Duchi. 2011. The generalization ability of online algorithms for dependent data. IEEE Transactions on Information Theory 59, 573--587. Google ScholarDigital Library
- R. Alexander, S. Ohad, and S. Karthik. 2012. Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the International Conference on Machine Learning. 449--456.Google Scholar
- R. Ali and R. Benjamin. 2007. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, Vol. 20. 1177--1184.Google Scholar
- B. Antoine, E. Seyda, W. Jason, and B. Léon. 2005. Fast kernel classifiers with online and active learning. Journal of Machine Learning Research 6, 1579--1619. Google ScholarDigital Library
- P. L. Bartlett, V. Dani, T. P. Hayes, S. Kakade, A. Rakhlin, and A. Tewari. 2008. High-probability regret bounds for bandit online linear optimization. In Proceedings of the Annual Conference on Learning Theory. 335--342.Google Scholar
- R. Bekkerman. 2004. Automatic categorization of email into folders: Benchmark experiments on Enron and SRI corpora. Computer Science Department Faculty Publication Series. 218.Google Scholar
- L. Bottou, O. Chapelle, D. DeCoste, and J. Weston. 2007. Large-Scale Kernel Machines. MIT Press. Google ScholarDigital Library
- G. Cavallanti, N. Cesa-Bianchi, and C. Gentile. 2007. Tracking the best hyperplane with a simple budget Perceptron. Machine Learning 69, 2--3, 143--167. Google ScholarDigital Library
- N. Cesa-Bianchi, A. Conconi, and C. Gentile. 2004. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory 50, 9, 2050--2057. Google ScholarDigital Library
- C.-C. Chang and C.-J. Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 27, 27:1--27:27. Google ScholarDigital Library
- A. Cotter, J. Keshet, and N. Srebro. 2011. Explicit approximations of the Gaussian kernel. arXiv preprint arXiv:1109.4603.Google Scholar
- A. Cotter, S.-SW. Shalev-Shwartz, and N. Srebro. 2013. Learning optimally sparse support vector machines. In Proceedings of the International Conference on Machine Learning. 266--274.Google Scholar
- K. Crammer, M. Dredze, J. Blitzer, and F. Pereira. 2008. Batch performance for an online price. In Proceedings of the NIPS 2007 Workshop on Efficient Machine Learning.Google Scholar
- I. Dagan, Y. Karov, and D. Roth. 1997. Mistake-driven learning in text categorization. In Proceedings of the 2nd Conference on Empirical Methods in Natural Language Processing. 55--63.Google Scholar
- O. Dekel, S. Shalev-Shwartz, and Y. Singer. 2008. The Forgetron: A kernel-based Perceptron on a budget. SIAM Journal on Computing 37, 5, 1342--1372. Google ScholarDigital Library
- P. Drineas and M. W. Mahoney. 2005. On the Nyström method for approximating a Gram matrix for improved kernel-based learning. Journal of Machine Learning Research 6, 2153--2175. Google ScholarDigital Library
- J. Duchi and Y. Singer. 2009. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research 10, 2899--2934. Google ScholarDigital Library
- Y. Freund and R. E. Schapire. 1999. Large margin classification using the Perceptron algorithm. Machine Learning 37, 3, 277--296. Google ScholarDigital Library
- E. Hazan and S. Kale. 2011. Beyond the regret minimization barrier: An optimal algorithm for stochastic strongly-convex optimization. Journal of Machine Learning Research Proceedings 19, 421--436.Google Scholar
- C.-W. Hsu and C.-J. Lin. 2002. A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks 13, 2, 415--425. Google ScholarDigital Library
- S. S. Keerthi, O. Chapelle, D. DeCoste, and P. Bennett. 2006. Building support vector machines with reduced classifier complexity. Journal of Machine Learning Research 7, 7, 1493--1515. Google ScholarDigital Library
- J. Kivinen, A. J. Smola, and R. C. Williamson. 2004. Online learning with kernels. IEEE Transactions on Signal Processing 52, 8, 2165--2176. Google ScholarDigital Library
- V. Koltchinskii. 2011. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems. Vol. 2033. Springer.Google Scholar
- S. Kumar, M. Mohri, and A. Talwalkar. 2012. Sampling methods for the Nyström method. Journal of Machine Learning Research 13, 981--1006. Google ScholarDigital Library
- J. Langford, L. Li, and T. Zhang. 2009. Sparse online learning via truncated gradient. Journal of Machine Learning Research 10, 777--801. Google ScholarDigital Library
- P. Mallapragada, R. Jin, and A. Jain. 2010. Non-parametric mixture models for clustering. In Proceedings of the Joint IAPR International Conference on Structural, Syntactic, and Statistical Pattern Recognition. 334--343. Google ScholarDigital Library
- O. L. Mangasarian and D. R. Musicant. 2002. Large scale kernel regression via linear programming. Machine Learning 46, 1, 255--269. Google ScholarDigital Library
- F. Orabona, J. Keshet, and B. Caputo. 2008. The Projectron: A bounded kernel-based Perceptron. In Proceedings of the International Conference on Machine Learning. 720--727. Google ScholarDigital Library
- E. Osuna and F. Girosi. 1998. Reducing the run-time complexity of support vector machines. In Proceedings of the International Conference on Pattern Recognition. 271--283.Google Scholar
- A. Rahimi and B. Recht. 2008. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Advances in Neural Information Processing Systems, Vol. 21. 1313--1320.Google Scholar
- S. Ross and J. A. Bagnell. 2011. Stability conditions for online learnability. arXiv preprint arXiv:1108.3154.Google Scholar
- B. Schölkopf, S. Mika, C. J. C. Burges, P. Knirsch, K. R. Muller, G. Ratsch, and A. J. Smola. 1999. Input space versus feature space in kernel-based methods. IEEE Transactions on Neural Networks 10, 5, 1000--1017. Google ScholarDigital Library
- B. Schölkopf, P. Simard, V. Vapnik, and A. J. Smola. 1997. Improving the accuracy and speed of support vector machines. In Advances in Neural Information Processing Systems, Vol. 9. 375--381.Google Scholar
- S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. 2010. Learnability, stability and uniform convergence. Journal of Machine Learning Research 11, 2635--2670. Google ScholarDigital Library
- S. Shalev-Shwartz, Y. Singer, and N. Srebro. 2007. Pegasos: Primal estimated sub-gradient solver for SVM. In Proceedings of the International Conference on Machine Learning. 807--814. Google ScholarDigital Library
- O. Shamir and T. Zhang. 2013. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In Proceedings of the International Conference on Machine Learning.Google Scholar
- S. Smale and D. X. Zhou. 2009. Geometry on probability spaces. Constructive Approximation 30, 3, 311--323.Google ScholarCross Ref
- S. Sonnenburg, G. Rätsch, C. Schäfer, and B. Schölkopf. 2006. Large scale multiple kernel learning. Journal of Machine Learning Research 7, 1531--1565. Google ScholarDigital Library
- I. Steinwart and A. Christmann. 2008. Support Vector Machines. Springer. Google ScholarDigital Library
- V. Vapnik. 1998. Statistical Learning Theory. Wiley, New York.Google Scholar
- C. K. I. Williams and M. Seeger. 2001. Using the Nyström method to speed up kernel machines. In Advances in Neural Information Processing Systems, Vol. 13. 682--688.Google ScholarDigital Library
- M. Wu, B. Schölkopf, and G. Bakır. 2006. A direct method for building sparse kernel learning algorithms. Journal of Machine Learning Research 7, 603--624. Google ScholarDigital Library
- J. W. Xu, P. P. Pokharel, K. H. Jeong, and J. C. Principe. 2006. An explicit construction of a reproducing Gaussian kernel Hilbert space. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. V.Google Scholar
- C. Yang, R. Duraiswami, and L. Davis. 2005. Efficient kernel machines using the improved fast Gauss transform. In Advances in Neural Information Processing Systems, Vol. 17. 1561--1568.Google Scholar
- T. Yang, Y. F. Li, M. Mahdavi, R. Jin, and Z.-H. Zhou. 2012. Nyström method vs random Fourier features: A theoretical and empirical comparison. In Advances in Neural Information Processing Systems, Vol. 25. 485--493.Google Scholar
- K. Zhang and J. T. Kwok. 2009. Density-weighted Nyström method for computing large kernel eigensystems. Neural Computation 21, 1, 121--146. Google ScholarDigital Library
- L. Zhang, J. Yi, R. Jin, M. Lin, and X. He. 2013. Online kernel learning with a near optimal sparsity bound. In Proceedings of the International Conference on Machine Learning. 621--629.Google Scholar
- P. Zhao, J. Wang, P. Wu, R. Jin, and S. C. H. Hoi. 2012. Fast bounded online gradient descent algorithms for scalable kernel-based online learning. In Proceedings of the International Conference on Machine Learning. 169--176.Google Scholar
- J. Zhu and T. Hastie. 2005. Kernel logistic regression and the import vector machine. Journal of Computational and Graphical Statistics 14, 1, 185--205.Google ScholarCross Ref
- M. Zinkevich. 2003. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the International Conference on Machine Learning. 928--936.Google ScholarDigital Library
Index Terms
- On the Sample Complexity of Random Fourier Features for Online Learning: How Many Random Fourier Features Do We Need?
Recommendations
End-to-end kernel learning via generative random Fourier features
Highlights- A one-stage, end-to-end kernel learning method based on random Fourier features is proposed. This method involves a generative network to learn the ...
AbstractRandom Fourier features (RFFs) provide a promising way for kernel learning in a spectral case. Current RFFs-based kernel learning methods usually work in a two-stage way. In the first-stage process, learning an optimal feature map is ...
Random Fourier extreme learning machine with ℓ2 , 1 -norm regularization
This paper proposes a novel algorithm, termed random Fourier extreme learning machine with 2 , 1 -norm regularization, to improve the robustness and compactness of the widely used extreme learning machine. In specific, we firstly introduce the random ...
Online kernel learning with nearly constant support vectors
Nyström method has been widely used to improve the computational efficiency of batch kernel learning. The key idea of Nyström method is to randomly sample M support vectors from the collection of T training instances, and learn a kernel classifier in ...
Comments