A Novel Multiple Kernel Learning Method Based on the Kullback–Leibler Divergence

Liang, Zhizheng; Zhang, Lei; Liu, Jin

doi:10.1007/s11063-014-9392-3

A Novel Multiple Kernel Learning Method Based on the Kullback–Leibler Divergence

Published: 25 October 2014

Volume 42, pages 745–762, (2015)
Cite this article

Neural Processing Letters Aims and scope Submit manuscript

Zhizheng Liang^1,2,
Lei Zhang¹ &
Jin Liu¹

284 Accesses
4 Citations
Explore all metrics

Abstract

In this paper we develop a novel multiple kernel learning (MKL) model that is based on the idea of the multiplicative perturbation of data in a new feature space in the framework of uncertain convex programs (UCPs). In the proposed model, we utilize the Kullback–Leibler divergence to measure the difference between the estimated kernel weights and ideal kernel weights. Instead of directly handling the proposed model in the primal, we obtain the optimistic counterpart of its Langrage dual in terms of the theory of UCPs and solve it by using the alternating optimization. In the case of a varying parameter, the proposed model gives the solution path from a robust combined kernel to some combined kernel corresponding to the initially ideal kernel weights. In addition, we also give a simple strategy to select the initial kernel weights as the ideal kernel weights if any prior knowledge of kernel weights is not available. Experimental results on several data sets show that the proposed model can obtain competitive performance with some of the state-of-the-art MKL algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Feature selection techniques for machine learning: a survey of more than two decades of research

Article 01 December 2023

Survey on SVM and their application in image classification

Article 11 January 2018

PolieDRO: a novel classification and regression framework with non-parametric data-driven regularization

Article 15 April 2024

References

Aflalo J, Ben-Tal A, Bhattacharyya C, Nath JS, Raman S (2001) Variable sparsity kernel learning. J Mach Learn Res 12:565–592
MathSciNet Google Scholar
Gonen M, Alpaydm E (2011) Multiple kernel learning algorithms. J Mach Learn Res 12:2211–2268
MathSciNet Google Scholar
Bach FR (2008) Consistency of group Lasso and multiple kernel learning. J Mach Learn Res 9:1179–1225
MathSciNet MATH Google Scholar
Kloft M, Brefeld U, Sonnenburg S, Zien A (2011) $L_{p}$-norm multiple kernel learning. J Mach Learn Res 12:953–997
MathSciNet MATH Google Scholar
Lanckriet GRG, Cristianini N, Bartlett P, Jordan MI (2004) Learning the kernel matrix with semidefinite programming. J Mach Learn Res 5:27–72
MathSciNet MATH Google Scholar
Bach FR, Lanckriet GRG, Jordan MJ (2004) Multiple kernel learning, conic duality and SMO algorithms. In: Proceedings of the 21st, ICML, ACM
Alizadeh F, Goldfarb D (2003) Second-order cone programming. Math Program Ser B 95:3–51
Article MathSciNet MATH Google Scholar
Sonnenburg S, Ratsch G, Schafer C, Scholkopf B (2006) Large scale multiple kernel learning. J Mach Learn Res 7:1531–1565
MathSciNet MATH Google Scholar
Szafranski M, Grandvalet Y, Rakotomamonjy A (2010) Composite kernel learning. Mach Learn 79(1–2):73–103
Rakotomamonjy A, Bach FR, Canu S, Grandvalet Y (2008) Simple MKL. J Mach Learn Res 9:1125–1179
MathSciNet Google Scholar
Xu Z, Jin R, Ye J, King I, Lyu M (2010) Simple and efficient multiple kernel learning by group Lasso. In: The 27th International Conference on Machine Learning, pp 1175–1182
Yang H, Xu X, Ye J, King I, Lyu M (2011) Efficient sparse generalized multiple kernel learning. IEEE Trans Neural Netw 22(3):433–446
Article Google Scholar
Vishwanathan SVN, Sun Z, Ampornputn N, Varma M (2010) Multiple kernel learning and the SMO algorithm. In: Proceedings of NIPS, pp 2361–2369
Bertsekas DP (1999) Nonlinear programming, 2nd edn. Athena Scientific, Belmont
MATH Google Scholar
Liang ZZ, Xia S, Zhou Y, Zhang L (2013) Training Lp norm multiple kernel learning in the primal. Neural Netw 46:172–182
Article MATH Google Scholar
Xu X, Tsang I, Xu D (2013) Soft margin multiple kernel learning. IEEE Trans Neural Netw Learn 24(5):749–761
Article Google Scholar
Xu Z, Jin R, Zhu S, Lyu M, King I (2010) Smooth optimization for effective multiple kernel learning. In: The 24th AAAI Conference on Artificial Intelligence, pp 637–642
Beck A, Ben-Tal A (2009) Duality in robust optimization: primal worst equals dual best. Oper Res Lett 37:1–6
Article MathSciNet MATH Google Scholar
Ben-Tal A, Ghaoui LE, Nemirovski A (2009) Robust optimization. Princeton series in applied mathematics. Princeton, Princeton University Press
Google Scholar
JeyakumarV Li G (2011) Strong duality in robust convex programming: complete characterizations. SIAM J Optim 20(6):3384–3407
Article Google Scholar
Li G, Jeyakumar V, Lee GM (2011) Robust conjugate duality for convex optimization under uncertainty with applications to data classification. Nonlinear Anal Theory Methods Appl 74(6):2327–2341
Article MathSciNet MATH Google Scholar
Ben-Tal A, Bhadra S, Bhattacharyya C, Nemirovski A (2012) Efficient methods for robust classification under uncertainty in kernel matrices. J Mach Learn Res 13:2923–2954
MathSciNet MATH Google Scholar
Jeyakumar V, Li G (2012) Support vector machine classifier with uncertain knowledge sets via robust optimization. Optimization, pp 1–18
Goyal V, Ravi R (2013) An FPTAS for minimizing a class of quasi-concave functions over a convex set. Oper Res Lett 41(2):191–196
Article MathSciNet MATH Google Scholar
Duchi J, Shalev-Shwartz S, Singer Y, Chandra T (2008) Efficient projections onto the l1-ball for learning in high dimensions. In: ICML, pp 272–279
http://www.csie.ntu.edu.tw/cjlin/libsvm/. Accessed 10 April 2013
http://www.mosek.com/. Accessed 20 June 2013
Blake C, Merz C (1998) UCI repository of machine learning databases. http://archive.ics.uci.edu/ml. Accessed 9 July 2009
Nemsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–26
MathSciNet Google Scholar

Download references

Acknowledgments

This work is partially supported by the National Natural Science Foundation of P.R.China (61003169, 61303182).

Author information

Authors and Affiliations

Department of Computer Science and Technology, China University of Mining and Technology, Xuzhou, China
Zhizheng Liang, Lei Zhang & Jin Liu
School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, 221116, China
Zhizheng Liang

Authors

Zhizheng Liang
View author publications
You can also search for this author in PubMed Google Scholar
Lei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jin Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhizheng Liang.

Appendix: The derivation of Eq. (13)

$$\begin{aligned}&\mathop {\max }\limits _{{\varvec{\alpha }},{\varvec{\theta }}} f({\varvec{\alpha }},{\varvec{\theta }}):=-\frac{1}{2}\sum _{k=1}^d {\theta _k } {\varvec{\alpha }}^{T}diag({{\varvec{y}}})K_k diag({{\varvec{y}}}){\varvec{\alpha }}+{\varvec{\alpha }}^{T}{{\varvec{e}}}- \gamma \sum _{k=1}^d {\theta _k } \log \frac{\theta _k }{q_k }\nonumber \\&\hbox {s.t.} \sum _{i=1}^n {\alpha _i y_i =0} ,C\ge \alpha _i \ge 0,i=1,\ldots ,n,\nonumber \\&\quad \sum _{k=1}^d {\theta _k } =1,\theta _k \ge 0, \end{aligned}$$

(18)

From Eq. (18), for fixed $\alpha _i (i=1,\ldots ,n)$, $f({\varvec{\alpha }},{\varvec{\theta }})$ is strict concave with respect to $\theta $. Thus maximizing $f({\varvec{\alpha }},{\varvec{\theta }})$ over the simplex yields a unique solution. We define the following partial Lagrangian function.

$$\begin{aligned} L({\varvec{\theta }}):=-\frac{1}{2}\sum _{k=1}^d {\theta _k } {\varvec{\alpha }}^{T}diag({{\varvec{y}}})K_k diag({{\varvec{y}}}){\varvec{\alpha }}+{\varvec{\alpha }}^{T}{{\varvec{e}}}- \gamma \sum _{k=1}^d {\theta _k } \log \frac{\theta _k }{q_k }+\lambda \left( 1-\sum _{k=1}^d {\theta _k } \right) \nonumber \\ \end{aligned}$$

(19)

Setting the derivative of $L({\varvec{\theta }})$ with respect to $\theta _k $ to zero gives

$$\begin{aligned} \frac{\partial L({\varvec{\theta }})}{\partial \theta _k }= -\frac{1}{2}{\varvec{\alpha }}^{T}diag({{\varvec{y}}})K_k diag({{\varvec{y}}}){\varvec{\alpha }}-\gamma \log \theta _k +\gamma \log q_k -\lambda =0. \end{aligned}$$

(20)

From (20), one has

$$\begin{aligned} -\frac{1}{2}{\varvec{\alpha }}^{T}diag({{\varvec{y}}})K_k diag({{\varvec{y}}}){\varvec{\alpha }}-\lambda =\gamma \log \frac{\theta _k }{q_k } \end{aligned}$$

(21)

From Eq. (21), one has

$$\begin{aligned} q_k e^{-\frac{1}{2\gamma }{\varvec{\alpha }}^{T}diag({{\varvec{y}}})K_k diag({{\varvec{y}}}){\varvec{\alpha }}-\frac{\lambda }{\gamma }}=\theta _k . \end{aligned}$$

(22)

Note that $\sum _{k=1}^d {\theta _k } =1$. We have

$$\begin{aligned}&\sum _{k=1}^d {q_k } e^{-\frac{1}{2\gamma }{\varvec{\alpha }}^{T}diag({{\varvec{y}}})K_k diag({{\varvec{y}}}){\varvec{\alpha }}-\frac{\lambda }{\gamma }}=\sum _{k=1}^d {\theta _k } =1.\end{aligned}$$

(23)

$$\begin{aligned}&e^{-\frac{\lambda }{\gamma }}\sum _{k=1}^d {q_k } e^{-\frac{1}{2\gamma }{\varvec{\alpha }}^{T}diag({{\varvec{y}}})K_k diag({{\varvec{y}}}){\varvec{\alpha }}}=1. \end{aligned}$$

(24)

Substituting Eq. (24) into Eq. (22), one has

$$\begin{aligned} \theta _k =\frac{q_k e^{\frac{-\sum _{i,j} {\alpha _i \alpha _j y_i y_j K_k ({{\varvec{x}}}_{{\varvec{i}}} ,\mathbf{x}_\mathbf{j} )} }{2\gamma }}}{\sum _{s=1}^d {q_s e^{\frac{-\sum _{i,j} {\alpha _i \alpha _j y_i y_j K_s (\mathbf{x}_i ,\mathbf{x}_j )} }{2\gamma }}} },(k=1,\ldots ,d). \end{aligned}$$

(25)

From Eq. (25), one can see that the non-negativity of $\theta $ can be automatically guaranteed.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liang, Z., Zhang, L. & Liu, J. A Novel Multiple Kernel Learning Method Based on the Kullback–Leibler Divergence. Neural Process Lett 42, 745–762 (2015). https://doi.org/10.1007/s11063-014-9392-3

Download citation

Published: 25 October 2014
Issue Date: December 2015
DOI: https://doi.org/10.1007/s11063-014-9392-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Novel Multiple Kernel Learning Method Based on the Kullback–Leibler Divergence

Abstract

Access this article

Similar content being viewed by others

Feature selection techniques for machine learning: a survey of more than two decades of research

Survey on SVM and their application in image classification

PolieDRO: a novel classification and regression framework with non-parametric data-driven regularization

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix: The derivation of Eq. (13)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Novel Multiple Kernel Learning Method Based on the Kullback–Leibler Divergence

Abstract

Access this article

Similar content being viewed by others

Feature selection techniques for machine learning: a survey of more than two decades of research

Survey on SVM and their application in image classification

PolieDRO: a novel classification and regression framework with non-parametric data-driven regularization

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix: The derivation of Eq. (13)

Appendix: The derivation of Eq. (13)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation