Skip to main content
Log in

A Novel Multiple Kernel Learning Method Based on the Kullback–Leibler Divergence

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

In this paper we develop a novel multiple kernel learning (MKL) model that is based on the idea of the multiplicative perturbation of data in a new feature space in the framework of uncertain convex programs (UCPs). In the proposed model, we utilize the Kullback–Leibler divergence to measure the difference between the estimated kernel weights and ideal kernel weights. Instead of directly handling the proposed model in the primal, we obtain the optimistic counterpart of its Langrage dual in terms of the theory of UCPs and solve it by using the alternating optimization. In the case of a varying parameter, the proposed model gives the solution path from a robust combined kernel to some combined kernel corresponding to the initially ideal kernel weights. In addition, we also give a simple strategy to select the initial kernel weights as the ideal kernel weights if any prior knowledge of kernel weights is not available. Experimental results on several data sets show that the proposed model can obtain competitive performance with some of the state-of-the-art MKL algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

  1. Aflalo J, Ben-Tal A, Bhattacharyya C, Nath JS, Raman S (2001) Variable sparsity kernel learning. J Mach Learn Res 12:565–592

    MathSciNet  Google Scholar 

  2. Gonen M, Alpaydm E (2011) Multiple kernel learning algorithms. J Mach Learn Res 12:2211–2268

    MathSciNet  Google Scholar 

  3. Bach FR (2008) Consistency of group Lasso and multiple kernel learning. J Mach Learn Res 9:1179–1225

    MathSciNet  MATH  Google Scholar 

  4. Kloft M, Brefeld U, Sonnenburg S, Zien A (2011) \(L_{p}\)-norm multiple kernel learning. J Mach Learn Res 12:953–997

    MathSciNet  MATH  Google Scholar 

  5. Lanckriet GRG, Cristianini N, Bartlett P, Jordan MI (2004) Learning the kernel matrix with semidefinite programming. J Mach Learn Res 5:27–72

    MathSciNet  MATH  Google Scholar 

  6. Bach FR, Lanckriet GRG, Jordan MJ (2004) Multiple kernel learning, conic duality and SMO algorithms. In: Proceedings of the 21st, ICML, ACM

  7. Alizadeh F, Goldfarb D (2003) Second-order cone programming. Math Program Ser B 95:3–51

    Article  MathSciNet  MATH  Google Scholar 

  8. Sonnenburg S, Ratsch G, Schafer C, Scholkopf B (2006) Large scale multiple kernel learning. J Mach Learn Res 7:1531–1565

    MathSciNet  MATH  Google Scholar 

  9. Szafranski M, Grandvalet Y, Rakotomamonjy A (2010) Composite kernel learning. Mach Learn 79(1–2):73–103

  10. Rakotomamonjy A, Bach FR, Canu S, Grandvalet Y (2008) Simple MKL. J Mach Learn Res 9:1125–1179

    MathSciNet  Google Scholar 

  11. Xu Z, Jin R, Ye J, King I, Lyu M (2010) Simple and efficient multiple kernel learning by group Lasso. In: The 27th International Conference on Machine Learning, pp 1175–1182

  12. Yang H, Xu X, Ye J, King I, Lyu M (2011) Efficient sparse generalized multiple kernel learning. IEEE Trans Neural Netw 22(3):433–446

    Article  Google Scholar 

  13. Vishwanathan SVN, Sun Z, Ampornputn N, Varma M (2010) Multiple kernel learning and the SMO algorithm. In: Proceedings of NIPS, pp 2361–2369

  14. Bertsekas DP (1999) Nonlinear programming, 2nd edn. Athena Scientific, Belmont

    MATH  Google Scholar 

  15. Liang ZZ, Xia S, Zhou Y, Zhang L (2013) Training Lp norm multiple kernel learning in the primal. Neural Netw 46:172–182

    Article  MATH  Google Scholar 

  16. Xu X, Tsang I, Xu D (2013) Soft margin multiple kernel learning. IEEE Trans Neural Netw Learn 24(5):749–761

    Article  Google Scholar 

  17. Xu Z, Jin R, Zhu S, Lyu M, King I (2010) Smooth optimization for effective multiple kernel learning. In: The 24th AAAI Conference on Artificial Intelligence, pp 637–642

  18. Beck A, Ben-Tal A (2009) Duality in robust optimization: primal worst equals dual best. Oper Res Lett 37:1–6

    Article  MathSciNet  MATH  Google Scholar 

  19. Ben-Tal A, Ghaoui LE, Nemirovski A (2009) Robust optimization. Princeton series in applied mathematics. Princeton, Princeton University Press

    Google Scholar 

  20. JeyakumarV Li G (2011) Strong duality in robust convex programming: complete characterizations. SIAM J Optim 20(6):3384–3407

    Article  Google Scholar 

  21. Li G, Jeyakumar V, Lee GM (2011) Robust conjugate duality for convex optimization under uncertainty with applications to data classification. Nonlinear Anal Theory Methods Appl 74(6):2327–2341

    Article  MathSciNet  MATH  Google Scholar 

  22. Ben-Tal A, Bhadra S, Bhattacharyya C, Nemirovski A (2012) Efficient methods for robust classification under uncertainty in kernel matrices. J Mach Learn Res 13:2923–2954

    MathSciNet  MATH  Google Scholar 

  23. Jeyakumar V, Li G (2012) Support vector machine classifier with uncertain knowledge sets via robust optimization. Optimization, pp 1–18

  24. Goyal V, Ravi R (2013) An FPTAS for minimizing a class of quasi-concave functions over a convex set. Oper Res Lett 41(2):191–196

    Article  MathSciNet  MATH  Google Scholar 

  25. Duchi J, Shalev-Shwartz S, Singer Y, Chandra T (2008) Efficient projections onto the l1-ball for learning in high dimensions. In: ICML, pp 272–279

  26. http://www.csie.ntu.edu.tw/cjlin/libsvm/. Accessed 10 April 2013

  27. http://www.mosek.com/. Accessed 20 June 2013

  28. Blake C, Merz C (1998) UCI repository of machine learning databases. http://archive.ics.uci.edu/ml. Accessed 9 July 2009

  29. Nemsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–26

    MathSciNet  Google Scholar 

Download references

Acknowledgments

This work is partially supported by the National Natural Science Foundation of P.R.China (61003169, 61303182).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhizheng Liang.

Appendix: The derivation of Eq. (13)

Appendix: The derivation of Eq. (13)

$$\begin{aligned}&\mathop {\max }\limits _{{\varvec{\alpha }},{\varvec{\theta }}} f({\varvec{\alpha }},{\varvec{\theta }}):=-\frac{1}{2}\sum _{k=1}^d {\theta _k } {\varvec{\alpha }}^{T}diag({{\varvec{y}}})K_k diag({{\varvec{y}}}){\varvec{\alpha }}+{\varvec{\alpha }}^{T}{{\varvec{e}}}- \gamma \sum _{k=1}^d {\theta _k } \log \frac{\theta _k }{q_k }\nonumber \\&\hbox {s.t.} \sum _{i=1}^n {\alpha _i y_i =0} ,C\ge \alpha _i \ge 0,i=1,\ldots ,n,\nonumber \\&\quad \sum _{k=1}^d {\theta _k } =1,\theta _k \ge 0, \end{aligned}$$
(18)

From Eq. (18), for fixed \(\alpha _i (i=1,\ldots ,n)\), \(f({\varvec{\alpha }},{\varvec{\theta }})\) is strict concave with respect to \(\theta \). Thus maximizing \(f({\varvec{\alpha }},{\varvec{\theta }})\) over the simplex yields a unique solution. We define the following partial Lagrangian function.

$$\begin{aligned} L({\varvec{\theta }}):=-\frac{1}{2}\sum _{k=1}^d {\theta _k } {\varvec{\alpha }}^{T}diag({{\varvec{y}}})K_k diag({{\varvec{y}}}){\varvec{\alpha }}+{\varvec{\alpha }}^{T}{{\varvec{e}}}- \gamma \sum _{k=1}^d {\theta _k } \log \frac{\theta _k }{q_k }+\lambda \left( 1-\sum _{k=1}^d {\theta _k } \right) \nonumber \\ \end{aligned}$$
(19)

Setting the derivative of \(L({\varvec{\theta }})\) with respect to \(\theta _k \) to zero gives

$$\begin{aligned} \frac{\partial L({\varvec{\theta }})}{\partial \theta _k }= -\frac{1}{2}{\varvec{\alpha }}^{T}diag({{\varvec{y}}})K_k diag({{\varvec{y}}}){\varvec{\alpha }}-\gamma \log \theta _k +\gamma \log q_k -\lambda =0. \end{aligned}$$
(20)

From (20), one has

$$\begin{aligned} -\frac{1}{2}{\varvec{\alpha }}^{T}diag({{\varvec{y}}})K_k diag({{\varvec{y}}}){\varvec{\alpha }}-\lambda =\gamma \log \frac{\theta _k }{q_k } \end{aligned}$$
(21)

From Eq. (21), one has

$$\begin{aligned} q_k e^{-\frac{1}{2\gamma }{\varvec{\alpha }}^{T}diag({{\varvec{y}}})K_k diag({{\varvec{y}}}){\varvec{\alpha }}-\frac{\lambda }{\gamma }}=\theta _k . \end{aligned}$$
(22)

Note that \(\sum _{k=1}^d {\theta _k } =1\). We have

$$\begin{aligned}&\sum _{k=1}^d {q_k } e^{-\frac{1}{2\gamma }{\varvec{\alpha }}^{T}diag({{\varvec{y}}})K_k diag({{\varvec{y}}}){\varvec{\alpha }}-\frac{\lambda }{\gamma }}=\sum _{k=1}^d {\theta _k } =1.\end{aligned}$$
(23)
$$\begin{aligned}&e^{-\frac{\lambda }{\gamma }}\sum _{k=1}^d {q_k } e^{-\frac{1}{2\gamma }{\varvec{\alpha }}^{T}diag({{\varvec{y}}})K_k diag({{\varvec{y}}}){\varvec{\alpha }}}=1. \end{aligned}$$
(24)

Substituting Eq. (24) into Eq. (22), one has

$$\begin{aligned} \theta _k =\frac{q_k e^{\frac{-\sum _{i,j} {\alpha _i \alpha _j y_i y_j K_k ({{\varvec{x}}}_{{\varvec{i}}} ,\mathbf{x}_\mathbf{j} )} }{2\gamma }}}{\sum _{s=1}^d {q_s e^{\frac{-\sum _{i,j} {\alpha _i \alpha _j y_i y_j K_s (\mathbf{x}_i ,\mathbf{x}_j )} }{2\gamma }}} },(k=1,\ldots ,d). \end{aligned}$$
(25)

From Eq. (25), one can see that the non-negativity of \(\theta \) can be automatically guaranteed.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liang, Z., Zhang, L. & Liu, J. A Novel Multiple Kernel Learning Method Based on the Kullback–Leibler Divergence. Neural Process Lett 42, 745–762 (2015). https://doi.org/10.1007/s11063-014-9392-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-014-9392-3

Keywords

Navigation