Online Kernel Selection with Multiple Bandit Feedbacks in Random Feature Space

Li, Junfan; Liao, Shizhong

doi:10.1007/978-3-319-99247-1_27

Junfan Li¹⁶ &
Shizhong Liao¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11062))

Included in the following conference series:

International Conference on Knowledge Science, Engineering and Management

1379 Accesses
2 Citations

Abstract

Online kernel selection is critical to online kernel learning, and must address the exploration-exploitation dilemma, where we explore new kernels to find the best one and exploit the kernel that showed the best performance in the past. In this paper, we propose a novel multi-armed bandit solution to the exploration-exploitation dilemma in online kernel selection. We first correspond each candidate kernel to an arm of a multi-armed bandit problem. Different from typical multi-armed bandit models where only one kernel is selected at each round, we sample multiple kernels with replacement according to a probability distribution. Then, we make prediction with the hypotheses learned in the random feature spaces specified by the selected kernels, and incur multiple losses referred to as multiple bandit feedbacks. Finally, we use all the feedbacks to update the probability distribution. We prove that the proposed approach enjoys a sub-linear expected regret bound. Experimental results on benchmark datasets show that the proposed approach has a comparable performance with existing online kernel selection methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1), 48–77 (2002)
Article MathSciNet Google Scholar
Bubeck, S., Cesa-Bianchi, N.: Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Found. Trends$\textregistered $ Mach. Learn. 5(1), 1–122 (2012)
Google Scholar
Chang, C., Lin, C.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Tech. 2(3), 1–27 (2011)
Article Google Scholar
Chen, B., Liang, J., Zheng, N., Príncipe, J.C.: Kernel least mean square with adaptive kernel size. Neurocomputing 191, 95–106 (2016)
Article Google Scholar
Dekel, O., Shalev-Shwartz, S., Singer, Y.: The Forgetron: a kernel-based perceptron on a fixed budget. In: Proceedings of the 19th Annual Conference on Neural Information Processing Systems (NIPS), pp. 259–266 (2005)
Google Scholar
Fan, H., Song, Q., Shrestha, S.B.: Kernel online learning with adaptive kernel width. Neurocomputing 175, 233–242 (2016)
Article Google Scholar
Foster, D.J., Kale, S., Mohri, M., Sridharan, K.: Parameter-free online learning via model selection. In: Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS), pp. 6022–6032 (2017)
Google Scholar
Han, Z., Liao, S.: Stochastic online kernel selection with instantaneous loss in random feature space. In: Liu, D., Xie, S., Li, Y., El-Alfy, E.S. (eds.) ICONIP 2017, vol. 10634, pp. 33–42. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-70087-8_4
Chapter Google Scholar
Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml
Lu, J., Hoi, S.C.H., Wang, J., Zhao, P., Liu, Z.: Large scale online kernel learning. J. Mach. Learn. Res. 17, 1–43 (2016)
MathSciNet MATH Google Scholar
Nguyen, T.D., Le, T., Bui, H., Phung, D.: Large-scale online kernel learning with random feature reparameterization. In: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI), pp. 2543–2549 (2017)
Google Scholar
Rahimi, A., Recht, B.: Random features for large-scale kernel machine. In: Proceedings of the 21st Annual Conference on Neural Information Processing Systems (NIPS), pp. 1177–1184 (2007)
Google Scholar
Shalev-Shwartz, S.: Online learning and online convex optimization. Found. Trends$\textregistered $ Mach. Learn. 4(2), 107–194 (2012)
Google Scholar
Tossou, A.C.Y., Dimitrakakis, C.: Achieving privacy in the adversarial multi-armed bandit. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI), pp. 2653–2659 (2017)
Google Scholar
Yang, T., Mahdavi, M., Jin, R., Yi, J., Hoi, S.C.H.: Online kernel selection: algorithms and evaluations. In: Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence (AAAI), pp. 1197–1202 (2012)
Google Scholar

Download references

Acknowledgments

The work was supported in part by the National Natural Science Foundation of China under grant No. 61673293.

Author information

Authors and Affiliations

School of Computer Science and Technology, Tianjin University, Tianjin, 300350, China
Junfan Li & Shizhong Liao

Authors

Junfan Li
View author publications
You can also search for this author in PubMed Google Scholar
Shizhong Liao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shizhong Liao .

Editor information

Editors and Affiliations

University of Bristol, Bristol, United Kingdom
Weiru Liu
Università di Trento, Povo, Italy
Fausto Giunchiglia
Jilin University, Changchun, China
Bo Yang

Appendix: Proof Sketch of Theorem 1

Proof

Let $\ell _{t,i} \in [0,B],~B \ge 1$. We first give the next two facts

$$ \sum ^K_{i=1}p_{t,i}\widehat{\ell }_{t,i} = \sum _{\kappa _i \in S_t} \frac{\beta \ell _{t,i}}{\vert S_t\vert },~\sum ^K_{i=1}p_{t,i}\widehat{\ell }^2_{t,i}\le \beta B\sum ^K_{i=1}\widehat{\ell }_{t,i}. $$

Let $W_t = \sum ^K_{i=1}\omega _{t,i}$. With the proof of Theorem 3.1 in [1], we obtain

$$ \frac{W_{t+1}}{W_t}\le 1-\frac{\gamma }{K(1-\gamma )}\sum _{\kappa _i \in S_t} \frac{\beta \ell _{t,i}}{\vert S_t\vert } + \frac{(2+\beta B)\gamma ^2}{2K^2(1-\gamma )}\sum ^K_{i=1}\widehat{\ell }_{t,i}, $$

where we utilize the fact $\forall x \ge 0, e^{-x} \le 1-x + \frac{x^2}{2}$. Furthermore, with the fact $\forall x \in \mathbb {R}, 1+x \le e^x$, taking logarithms and summing over t gives

$$\begin{aligned} \ln \frac{W_{T+1}}{W_1} \le -\frac{\gamma }{K(1-\gamma )}\sum ^T_{t=1}\sum _{\kappa _i \in S_t}\frac{\beta \ell _{t,i}}{\vert S_t\vert } + \frac{(2+\beta B)\gamma ^2}{2K^2(1-\gamma )}\sum ^T_{t=1}\sum ^K_{i=1}\widehat{\ell }_{t,i}. \end{aligned}$$

(11)

Besides, $\forall \kappa _j\in \mathcal {K}$,

$$\begin{aligned} \ln \frac{W_{T+1}}{W_1} \ge \ln \frac{w_{T+1,j}}{W_1}=-\frac{\gamma }{K}\sum ^T_{t=1}\widehat{\ell }_{t,j} - \ln {K}. \end{aligned}$$

(12)

Combining (11) and (12), we obtain

$$\begin{aligned} \begin{aligned} \sum ^T_{t=1}\sum _{\kappa _i \in S_t}\frac{\ell _{t,i}}{\vert S_t\vert } \le (1-\gamma )\sum ^T_{t=1}\frac{\widehat{\ell }_{t,j}}{\beta } + \frac{K\ln (K)}{\beta \gamma }+ \frac{(2+\beta B)\gamma }{2K\beta }\sum ^T_{t=1}\sum ^K_{i=1}\widehat{\ell }_{t,i}. \end{aligned} \end{aligned}$$

(13)

Let $S_t = \{\kappa _{i_1}, \kappa _{i_2}, \ldots , \kappa _{i_{\vert S_t\vert }}\}$ and $i_1 \ne i_2\ne \ldots \ne i_{\vert S_t\vert }$. Then, we have $p(\forall \kappa _j \in S_t) = p_{t,j}\cdot \delta _{t,j}$. If $\vert S_t\vert < m$,

$$ \delta _{t,j} = \vert S_t\vert \sum ^K_{i_2 = 1,i_2\ne j}\ldots \sum ^K_{i_{\vert S_t\vert }=1,i_{\vert S_t\vert }\ne j}\prod _{i\in S_t, i\ne j}p_{t,i}\sum _{r\in S_t}p_{t,r}. $$

Otherwise, if $\vert S_t\vert = m$,

$$ \delta _{t,j} = \vert S_t\vert \sum ^K_{i_2 = 1,i_2\ne j}\ldots \sum ^K_{i_{\vert S_t\vert }=1,i_{\vert S_t\vert }\ne j}\prod _{i\in S_t, i\ne j}p_{t,i}. $$

We can bound $\delta _{t,j} \le \vert S_t\vert $. For clear analysis, we denote $\ell _{t,j}$ as $\ell (\mathbf {w}_{t,j})$ and introduce the notation $\mathbb {I}^t_{j} = \mathbbm {1}(\kappa _j \in S_t)$. Let $\mathbf {w}^*_j \in \mathcal {H}_{R,j}$ be the best linear model. According to the standard analysis of online convex optimization, we have

$$ \nabla \ell _{\mathbf {w}_{t,j}}\cdot \left( \mathbf {w}_{t,j} - \mathbf {w}^*_j\right) \mathbb {I}^t_{j} =p_{t,j}\vert S_t\vert \frac{\Vert \mathbf {w}_{t,j} - \mathbf {w}^*_j\Vert ^2 - \Vert \mathbf {w}_{t+1,j} - \mathbf {w}^*_j\Vert ^2}{2\eta } + \frac{\eta \nabla \ell ^2_{\mathbf {w}_{t,j}}}{2p_{t,j}\vert S_t\vert }\mathbb {I}^t_{j}. $$

Then, we get

$$ \sum ^T_{t=1}\frac{\ell (\mathbf {w}_{t,j}) - \ell (\mathbf {w}^*_j)}{p_{t,j}\vert S_t\vert }\mathbb {I}^t_{j} \le \sum ^T_{t=1}\frac{\Vert \mathbf {w}_{t,j} - \mathbf {w}^*_j\Vert ^2 - \Vert \mathbf {w}_{t+1,j} - \mathbf {w}^*_j\Vert ^2}{2\eta }+ \sum ^T_{t=1}\frac{\eta \nabla \ell ^2_{\mathbf {w}_{t,j}}}{2p^2_{t,j}\vert S_t\vert ^2}\mathbb {I}^t_{j}. $$

Taking expectation with respect to $S_1, S_2, \ldots , S_t$ gives

$$\begin{aligned} \sum ^T_{t=1}\mathbb {E}\left[ \frac{\ell (\mathbf {w}_{t,j})}{p_{t,j}\vert S_t\vert }\mathbb {I}^t_{j}\right] \le \sum ^T_{t=1}\ell (\mathbf {w}^*_j) + \frac{\Vert \mathbf {w}^*_j\Vert ^2}{2\eta } +\frac{K\eta L^2T}{2\gamma }. \end{aligned}$$

(14)

In which, we apply the facts $p_{t,j}>\frac{\gamma }{K}, \delta _{t,j} \le \vert S_t\vert $ and

$$ \sum ^T_{t=1}\mathbb {E}\left[ \frac{\ell (\mathbf {w}^*_j)}{p_{t,j}\vert S_t\vert }\mathbb {I}^t_{j}\right] = \sum ^T_{t=1}\mathbb {E}\left[ p_{t,j}\delta _{t,j}\frac{\ell (\mathbf {w}^*_j)}{p_{t,j}\vert S_t\vert }\right] \le \sum ^T_{t=1}\ell (\mathbf {w}^*_j). $$

We also have

$$\begin{aligned} \mathbb {E}\left[ \ell _{t,I_t}\right] = \frac{1}{\vert S_t\vert }\sum _{\kappa _i \in S_t}\mathbb {E}\left[ \ell _{t,i}\right] . \end{aligned}$$

(15)

Let $\eta = \sqrt{\frac{\Vert \mathbf {w}^*_j\Vert ^2\gamma }{KL^2T}}$. According to (13), (14) and (15), we obtain

$$ \sum ^T_{t=1}\mathbb {E}\left[ \ell _{t,I_t}\right] \le \sum ^T_{t=1}\ell (\mathbf {w}^*_j) + \sqrt{\frac{\Vert \mathbf {w}^*_j \Vert ^2KL^2 T}{\gamma }} + \frac{K\ln K}{\gamma \beta }+ \frac{(2+\beta B)\gamma BT}{2}, $$

Let $\gamma = a^{\frac{1}{3}}_1(2b_1)^{-\frac{2}{3}}T^{-\frac{1}{3}}, a_1 = \Vert \mathbf {w}^*_j \Vert ^2KL^2$ and $ b_1 = \frac{(2+\beta B)B}{2}$. Then, we get

$$\begin{aligned} \sum ^T_{t=1}\mathbb {E}\left[ \ell _{t,I_t}\right] \le \sum ^T_{t=1}\ell (\mathbf {w}^*_j)+2(a_1b_1)^{\frac{1}{3}}T^{\frac{2}{3}} + \frac{a^{-\frac{1}{3}}_1(2b_1)^{\frac{2}{3}}K\ln (K)T^{\frac{1}{3}}}{\beta }. \end{aligned}$$

(16)

Next, we bound the difference between $\sum ^T_{t=1}\ell (\mathbf {w}^*_j)$ and $\sum ^T_{t=1}\ell (f^*_j)$, where $f^*_j \in \mathcal {H}_j$. With the analysis of “Fourier Online Gradient Descent” [10], we have

$$\begin{aligned} \sum ^T_{t=1}\ell (\mathbf {w}^*_j) - \sum ^T_{t=1}\ell (f^*_j)\le L\,\epsilon \,T \Vert f^*_j \Vert _1, \end{aligned}$$

(17)

and $\Vert \mathbf {w}^*_j \Vert ^2 \le (1+\epsilon )\Vert f^*_j \Vert ^2_1$ with high probability according to claim 1 in [12]. Combining (16) and (17) yields

$$ \sum ^T_{t=1}\mathbb {E}\left[ \ell _{t,I_t}\right] \le \sum ^T_{t=1}\ell (f^*_j) + L\,\epsilon \,T\Vert f^*_j \Vert _1 +2(a_2b_1)^{\frac{1}{3}}T^{\frac{2}{3}} + \frac{{a_2}^{-\frac{1}{3}}(2b_1)^{\frac{2}{3}}K\ln (K)T^{\frac{1}{3}}}{\beta }, $$

where $a_2 = (1+\epsilon )\Vert f^*_j \Vert _1^2KL^2$, which completes the proof.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, J., Liao, S. (2018). Online Kernel Selection with Multiple Bandit Feedbacks in Random Feature Space. In: Liu, W., Giunchiglia, F., Yang, B. (eds) Knowledge Science, Engineering and Management. KSEM 2018. Lecture Notes in Computer Science(), vol 11062. Springer, Cham. https://doi.org/10.1007/978-3-319-99247-1_27

Download citation

DOI: https://doi.org/10.1007/978-3-319-99247-1_27
Published: 11 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99246-4
Online ISBN: 978-3-319-99247-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Online Kernel Selection with Multiple Bandit Feedbacks in Random Feature Space

Abstract

Access this chapter

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix: Proof Sketch of Theorem 1

Appendix: Proof Sketch of Theorem 1

Proof

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation