Abstract
Online kernel selection is critical to online kernel learning. However, the time complexity of existing online kernel selection algorithms of each round is linear with respect to the number of examples already arrived. This is not efficient for online learning. To address this issue, we propose a novel stochastic online kernel selection algorithm via the random feature mapping and using the instantaneous loss. This algorithm has only constant time complexity at each round and theoretical guarantee. Formally, the algorithm first maps the arriving example into the random feature space. Then the algorithm updates the kernel parameter and the weights of the classifier simultaneously using SGD (stochastic gradient descent) to minimize the instantaneous loss. We also prove that the algorithm enjoys a sub-linear regret bound. Experimental results on benchmark datasets demonstrate that the proposed algorithm is effective and efficient.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Yang, T., Mahdavi, M., Jin, R., Yi, J., Hoi, S.C.: Online kernel selection: algorithms and evaluations. In: Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, pp. 1197–1203. AAAI Press (2012)
Chapelle, O., Vapnik, V., Bousquet, O., Mukherjee, S.: Choosing multiple parameters for support vector machines. Mach. Learn. 46(1), 131–159 (2002)
Cristianini, N., Elisseeff, A., Shawe-Taylor, J., Kandola, J.: On kernel-target alignment. In: Advances in Neural Information Processing Systems (2001)
Chen, B., Liang, J., Zheng, N., Príncipe, J.C.: Kernel least mean square with adaptive kernel size. Neurocomputing 191, 95–106 (2016)
Fan, H., Song, Q., Shrestha, S.B.: Kernel online learning with adaptive kernel width. Neurocomputing 175, 233–242 (2016)
Yang, T., Li, Y.F., Mahdavi, M., Jin, R., Zhou, Z.H.: Nyström method vs random fourier features: a theoretical and empirical comparison. In: Advances in Neural Information Processing Systems, pp. 476–484 (2012)
Dekel, O., Shalev-Shwartz, S., Singer, Y.: The forgetron: a kernel-based perceptron on a budget. SIAM J. Comput. 37(5), 1342–1372 (2008)
Hu, J., Yang, H., King, I., Lyu, M.R., So, A.M.C.: Kernelized online imbalanced learning with fixed budgets. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, pp. 2666–2672 (2015)
Lin, M., Weng, S., Zhang, C.: On the sample complexity of random fourier features for online learning: how many random fourier features do we need? ACM Trans. Knowl. Discov. Data 8(3), 13 (2014)
Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems, pp. 1177–1184 (2007)
Rahimi, A., Recht, B.: Weighted sums of random kitchen sinks: replacing minimization with randomization in learning. In: Advances in Neural Information Processing Systems, pp. 1313–1320 (2009)
Forster, J., Warmuth, M.K.: Relative expected instantaneous loss bounds. J. Comput. Syst. Sci. 64(1), 76–102 (2002)
Lu, J., Hoi, S.C., Wang, J., Zhao, P., Liu, Z.Y.: Large scale online kernel learning. J. Mach. Learn. Res. 17(47), 1–43 (2016)
Acknowledgments
The work was supported in part by the National Natural Science Foundation of China under grant No. 61673293.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
In this appendix, we provide the proof details of Theorem 1.
Proof
Let \(f_{*}(\varvec{x})= (\varvec{w}^{*})^\top \phi (\varvec{x},\gamma ^{*})\) be the optimal classifier in the random feature space that minimizes the expected loss. The desired inequality can be rewritten as
First of all, consider \(\ell _t\) as a function of \(\gamma \). From the convexity of the loss function, we obtain
Summing the above over \(t=1,\ldots T\) leads to
where \(L_1=\max _{t\in [T]}\Vert \nabla \ell _{t}(\gamma _{t})\Vert ^{2}\). We adopt a similar procedure and it suffices to show that
From the result of [13], we get with probability at least
Recalling the definition of M, we can derive \( \varvec{M}[j]\sim \mathcal {N}(0, \Vert \varvec{x}\Vert ^{2}). \) This easily leads to the upper bound of \(|\nabla \ell (\gamma )|\), i.e.
By the property of Gaussian variable, we therefore obtain
By Chernoff inequality, we have, with probability at least
where \(C=\max _{t\in \{1,\ldots ,T\}}\Vert \varvec{w}_t\Vert ^2\Vert \varvec{x}_t\Vert ^{2}.\) From the basic relationship between the sine and the cosine, it follows that \( \Vert \nabla \ell (\varvec{w})\Vert _2^2=1. \)
We now conclude our proof. \(\square \)
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Han, Z., Liao, S. (2017). Stochastic Online Kernel Selection with Instantaneous Loss in Random Feature Space. In: Liu, D., Xie, S., Li, Y., Zhao, D., El-Alfy, ES. (eds) Neural Information Processing. ICONIP 2017. Lecture Notes in Computer Science(), vol 10634. Springer, Cham. https://doi.org/10.1007/978-3-319-70087-8_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-70087-8_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-70086-1
Online ISBN: 978-3-319-70087-8
eBook Packages: Computer ScienceComputer Science (R0)