Skip to main content
Log in

A survey of the state of the art in learning the kernels

  • Survey Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

In recent years, the machine learning community has witnessed a tremendous growth in the development of kernel-based learning algorithms. However, the performance of this class of algorithms greatly depends on the choice of the kernel function. Kernel function implicitly represents the inner product between a pair of points of a dataset in a higher dimensional space. This inner product amounts to the similarity between points and provides a solid foundation for nonlinear analysis in kernel-based learning algorithms. The most important challenge in kernel-based learning is the selection of an appropriate kernel for a given dataset. To remedy this problem, algorithms to learn the kernel have recently been proposed. These methods formulate a learning algorithm that finds an optimal kernel for a given dataset. In this paper, we present an overview of these algorithms and provide a comparison of various approaches to find an optimal kernel. Furthermore, a list of pivotal issues that lead to efficient design of such algorithms will be presented.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Abbasnejad ME, Ramachandram D, Mandava R (2009) Optimizing kernel functions using transfer learning from unlabeled data. In: Machine Vision, 2009. ICMV ’09. Second international conference on, pp 111–117. doi:10.1109/ICMV.2009.10

  2. Abbasnejad ME, Ramachandram D, Mandava R (2010) An unsupervised approach to learn the kernel functions: from global influence to local similarity. Neural Comput Appl. doi:10.1007/s00521-010-0411-7

  3. Adankon MM, Cheriet M (2007) Optimizing resources in model selection for support vector machine. Pattern Recognit 40(3): 953–963. doi:10.1016/j.patcog.2006.06.012

    Article  MATH  Google Scholar 

  4. Amari S, Wu S (1999) Improving support vector machine classifiers by modifying kernal functions. Neural Netw 12: 783–789. doi:10.1016/S0893-6080(99)00032-5

    Article  Google Scholar 

  5. Argyriou A, Micchelli CA, Pontil M (2005) Learning convex combinations of continuously parameterized basic kernels. In: Auer P, Meir R (eds) Learning theory, Lecture Notes in Computer Science, vol 3559.. Springer, Heidelberg, pp 338–352. doi:10.1007/11503415_23

    Google Scholar 

  6. Argyriou A, Hauser R, Micchelli CA, Pontil M (2006) A dc-programming algorithm for kernel selection. In: ICML ’06: Proceedings of the 23rd international conference on machine learning, ACM, New York, NY, USA, pp 41–48. doi:10.1145/1143844.1143850

  7. Argyriou A, Evgeniou T, Pontil M, Argyriou A, Evgeniou T, Pontil M (2007) Multi-task feature learning. In: Schölkopf B, Platt J, Hoffman T (eds) Advances in neural information processing systems, vol 19.. MIT Press, Cambridge, MA, pp 41–48

    Google Scholar 

  8. Aronszajn N (1950) Theory of reproducing kernels. Trans Ame Math Soc 68: 337–404

    Article  MathSciNet  MATH  Google Scholar 

  9. Ayat N, Cheriet M, Suen C (2005) Automatic model selection for the optimization of svm kernels. Pattern Recognit 38(10): 1733–1745. doi:10.1016/j.patcog.2005.03.011

    Article  Google Scholar 

  10. Bach FR, Lanckriet GRG, Jordan MI (2004) Multiple kernel learning, conic duality, and the smo algorithm. In: ICML ’04: Proceedings of the twenty-first international conference on machine learning, ACM, New York, NY, USA. doi:10.1145/1015330.1015424

  11. Ben-Hur A, Horn D, Siegelmann HT, Vapnik V (2002) Support vector clustering. J Mach Learn Res 2: 125–137

    MATH  Google Scholar 

  12. Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press, New York

    MATH  Google Scholar 

  13. Burges CJC (1999) Geometry and invariance in kernel based methods. MIT Press, Cambridge, pp 89–116

    Google Scholar 

  14. Caruana R (1997) Multitask learning. Mach Learn 28(1): 41–75

    Article  MathSciNet  Google Scholar 

  15. Chen B, Liu H, Bao Z (2008) A kernel optimization method based on the localized kernel fisher criterion. Pattern Recognit 41(3): 1098–1109. doi:10.1016/j.patcog.2007.08.009

    Article  MATH  Google Scholar 

  16. Collins M, Schapire RE, Singer Y (2002) Logistic regression, adaboost and bregman distances. Mach Learn 48(1–3): 253–285

    Article  MATH  Google Scholar 

  17. Cortes C, Mohri M, Rostamizadeh A (2009) Learning non-linear combinations of kernels. In: Bengio Y, Schuurmans D, Lafferty J, Williams CKI, Culotta A (eds) NIPS: advances in neural information processing systems, vol 22, pp 396–404

  18. Cristianini N, Shawe-taylor J, Elissee A, Kandola J (2002) On kernel-target alignment. In: Advances in neural information processing systems, vol 14, MIT Press, pp 367–373

  19. Cristianini N, Kandola J, Elisseeff A, Shawe-Taylor J (2003) On optimizing kernel alignment. Tech. rep., UC Davis Department of Statistics

  20. Davis JV, Kulis B, Jain P, Sra S, Dhillon IS (2007) Information-theoretic metric learning. In: ICML ’07: Proceedings of the 24th international conference on machine learning, ACM, New York, NY, USA, pp 209–216. doi:10.1145/1273496.1273523

  21. Domeniconi C, Peng J, Yan B (2010) Composite kernels for semi-supervised clustering. Knowl Inf Syst, pp 1–18

  22. Duan K, Keerthi SS, Poo AN (2003) Evaluation of simple performance measures for tuning svm hyperparameters. Neurocomputing 51: 41–59. doi:10.1016/S0925-2312(02)00601-X

    Article  Google Scholar 

  23. Evgeniou T, Micchelli CA, Pontil M (2005) Learning multiple tasks with kernel methods. JMLR Org 6: 615–637

    MathSciNet  MATH  Google Scholar 

  24. Florian A, Potra SJW (2000) Interior-point methods. J Comput Appl Math 124(1–2): 281–302. doi:10.1016/S0377-0427(00)00433-7

    MathSciNet  MATH  Google Scholar 

  25. Freund RM, Mizuno S (1996) Interior point methods: current status and future directions. Working papers 3924-96., Massachusetts Institute of Technology (MIT), Sloan School of Management. http://ideas.repec.org/p/mit/sloanp/2634.html

  26. Fung G, Rosales R, Rao RB (2008) Feature selection and kernel design via linear programming. In: IJCAI’07: Proceedings of the 20th international joint conference on artifical intelligence, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 786–791

  27. Gang Wu EYC (2003) Adaptive feature-space conformal transformation for imbalanced-data learning. In: Proceedings of IEEE international conference on machine learning, pp 816–823

  28. Gehler P, Nowozin S (2008) Infinite kernel learning. Tech. rep., Max Planck Institute For Biological Cybernetics, Tuebingen, Germany

  29. Grant M, Boyd S (2008) Graph implementations for nonsmooth convex programs. Recent Adv Learn Control 371: 95–110. doi:10.1007/978-1-84800-155-8_7

    Article  MathSciNet  Google Scholar 

  30. Grant M, Boyd S (2009) Cvx: Matlab software for disciplined convex programming. web page and software. http://stanford.edu/~boyd/cvx

  31. Herbster M, Pontil M, Wainer L (2005) Online learning over graphs. In: ICML ’05: Proceedings of the 22nd international conference on machine learning, ACM, New York, NY, USA, pp 305–312. doi:10.1145/1102351.1102390

  32. Hertz T, Hillel AB, Weinshall D (2006) Learning a kernel function for classification with small training samples. In: ICML ’06: Proceedings of the 23rd international conference on machine learning, ACM, New York, NY, USA, pp 401–408. doi:10.1145/1143844.1143895

  33. Hoi SCH, Jin R (2008) Active kernel learning. In: ICML ’08: Proceedings of the 25th international conference on machine learning, ACM, New York, NY, USA, pp 400–407. doi:10.1145/1390156.1390207

  34. Hoi SCH, Lyu MR, Chang EY (2006) Learning the unified kernel machines for classification. In: KDD ’06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, New York, NY, USA, pp 187–196. doi:10.1145/1150402.1150426

  35. Hoi SCH, Jin R, Lyu MR (2007) Learning nonparametric kernel matrices from pairwise constraints. In: ICML ’07: Proceedings of the 24th international conference on machine learning, ACM, New York, NY, USA, pp 361–368. doi:10.1145/1273496.1273542

  36. Horst R, Thoai NV (1999) Dc programming: overview. J Optim Theory Appl 103(1): 1–4310.1023/A:1021765131316

    Article  MathSciNet  Google Scholar 

  37. Huang K, Ying Y, Campbell C (2010) Generalized sparse metric learning with relative comparisons. Knowl Inf Syst

  38. Jaakkola TS, Haussler D (1999) Exploiting generative models in discriminative classifiers. In: Proceedings of the 1998 conference on advances in neural information processing systems II, MIT Press, Cambridge, MA, USA, pp 487–493

  39. Jebara T (2004) Multi-task feature and kernel selection for svms. In: ICML ’04: Proceedings of the twenty-first international conference on machine learning, ACM, New York, NY, USA. doi:10.1145/1015330.1015426

  40. Jebara T, Kondor R, Howard A (2004) Probability product kernels. J Mach Learn Res 5: 819–844

    MathSciNet  MATH  Google Scholar 

  41. Jieping Ye JC, Shuiwang Ji (2007) Multi-class discriminant kernel learning via convex programming. J Mach Learn Res 9: 719–758

    Google Scholar 

  42. Joseph KC, Keshet J, Singer Y (2002) Kernel design using boosting. In: Advances in Neural Information Processing Systems, vol 15, MIT Press, pp 537–544

  43. Kandola J, Shawe-Taylor J (2002a) On the extensions of kernel alignment. Tech. Rep. NC-TR-2002-120, University of Southampton (United Kingdom). http://eprints.ecs.soton.ac.uk/9745/

  44. Kandola J , Shawe-Taylor J (2002b) Optimizing kernel alignment over combinations of kernel. Tech. Rep. NC-TR-2002-121. http://eprints.ecs.soton.ac.uk/9746/

  45. Kapoor A, Alan Qi Y, Ahn H, Picard RW (2005) Hyperparameter and kernel learning for graph based semi-supervised classification

  46. Kim SJ, Magnani A, Boyd S (2006) Optimal kernel selection in kernel fisher discriminant analysis. In: ICML ’06: Proceedings of the 23rd international conference on machine learning, ACM, New York, NY, USA, pp 465–472. doi:10.1145/1143844.1143903

  47. Kim SJ, Zymnis A, Magnani A, Koh K, Boyd S (2008) Learning the kernel via convex optimization. In: Acoustics, speech and signal processing, 2008. ICASSP 2008. IEEE international conference on, pp 1997–2000. doi:10.1109/ICASSP.2008.4518030

  48. Kondor RI, Lafferty J (2002) Diffusion kernels on graphs and other discrete structures. In: Proceedings of the ICML ’02: Proceedings of the 23rd international conference on machine learning, pp 315–322

  49. Kristin P, Bennett MJE Michinari M (2002) Mark: a boosting algorithm for heterogeneous kernel models. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, pp 24–31. doi:10.1145/775047.775051

  50. Kulis B, Sustik M, Dhillon I (2006) Learning low-rank kernel matrices. In: ICML ’06: Proceedings of the 23rd international conference on machine learning, ACM, New York, NY, USA, pp 505–512. doi:10.1145/1143844.1143908

  51. Kulis B, Sustik MA, Dhillon IS (2009) Low-rank kernel learning with bregman matrix divergences. J Mach Learn Res 10: 341–376

    MathSciNet  Google Scholar 

  52. Lanckriet G, Deng M, Cristianini N, Jordan MI, Noble WS (2004a) Kernel-based data fusion and its application to protein function prediction in yeast. Pac Symp Biocomput, vol 11

  53. Lanckriet GRG, Cristianini N, Bartlett P, Ghaoui LE, Jordan MI (2004) Learning the kernel matrix with semidefinite programming. J Mach Learn Res 5: 27–72

    MATH  Google Scholar 

  54. Li F, Fu Y, Dai YH, Sminchisescu C, Jue W (2009) Kernel learning by unconstrained optimization. J Mach Learn Res 5: 328–335

    Google Scholar 

  55. Lu J, Plataniotis KN, Venetsanopoulos AN (2005) Kernel discriminant learning with application to face recognition. In: Wang L (ed) Support vector machines: theory and applications, Springer, Heidelberg, Studies in Fuzziness and Soft Computing, pp 275–296. doi:10.1007/10984697_13

  56. MacKay JCD (1997) Introduction to gaussian processes. NATO ASI Ser F: Comput Syst Sci 168: 33–165

    Google Scholar 

  57. Mercer J (1909) Functions of positive and negative type and their connection with the theory of integral equations. Philos Trans R Soc 209: 415–446. doi:10.1098%2Frsta.1909.0016

    Article  MATH  Google Scholar 

  58. Micchelli CA, Pontil M (2005) Learning the kernel function via regularization. J Mach Learn Res 6: 1099–1125

    MathSciNet  MATH  Google Scholar 

  59. Micchelli CA, Pontil M (2007) Feature space perspectives for learning the kernel. Mach Learn 66(2–3): 297–319. doi:10.1007/s10994-006-0679-0

    Article  Google Scholar 

  60. Mika S, Ratsch G, Weston J, Scholkopf B, Mullers K (1999) Fisher discriminant analysis with kernels pp 41–48. doi:10.1109/NNSP.1999.788121

  61. Moreno PJ, Ho P, Vasconcelos N (2003) A kullback-leibler divergence based kernel for svm classification in multimedia applications. In: Thrun S, Saul LK, Schölkopf B, Thrun S, Saul LK, Schölkopf B (eds) NIPS, MIT Press. http://dblp.uni-trier.de/rec/bibtex/conf/nips/MorenoHV03

  62. Nguyen CH, Ho TB (2008) An efficient kernel matrix evaluation measure. Pattern Recognit 41(11): 3366–3372. doi:10.1016/j.patcog.2008.04.005

    Article  MATH  Google Scholar 

  63. Platt JC (1999a) Fast training of support vector machines using sequential minimal optimization. Adv Kernel Methods Support Vector Learn, pp 185–208

  64. Platt JC (1999b) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv Large Margin Classif, pp 61–74

  65. Raina R, Battle A, Lee H (2007) Self-taught learning: transfer learning from unlabeled data. 24th International conference on machine learning

  66. Rakotomamonjy A, Bach F, Canu S, Grandvalet Y (2007) More efficiency in multiple kernel learning. In: ICML ’07: Proceedings of the 24th international conference on machine learning, ACM, New York, NY, USA, pp 775–782. doi:10.1145/1273496.1273594

  67. Rakotomamonjy A, Bach FR, Canu S, Grandvalet Y (2008) Simplemkl. J Mach Learn Res 9: 2491–2521

    MathSciNet  MATH  Google Scholar 

  68. Rasmussen CE (2003) Gaussian processes in machine learning 3176:63–71 doi:10.1007/978-3-540-28650-9_4

  69. Rückert U, Kramer S (2008) Kernel-based inductive transfer. In: Machine learning and knowledge discovery in databases, Springer, Berlin, Heidelberg, pp 220–233. doi:10.1007/978-3-540-87481-2_15

  70. Saunders C, Gammerman A, Vovk V (1998) Ridge regression learning algorithm in dual variables. In: ICML ’98: Proceedings of the fifteenth international conference on machine learning, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 515–521

  71. Schapire RE, Singer Y (1999) Improved boosting algorithms using confidence-rated predictions. J Mach Learn 37(3): 297–336. doi:10.1023/A:1007614523901

    Article  MATH  Google Scholar 

  72. Scholkopf B, Smola AJ (2001) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, Cambridge

    Google Scholar 

  73. Scholkopf B, Smola A, Muller KR (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Comp 10(5):1299–1319 http://neco.mitpress.org/cgi/content/abstract/10/5/1299

    Google Scholar 

  74. Shaw B, Jebara T (2009) Structure preserving embedding. In: ICML ’09: Proceedings of the 26th annual international conference on machine learning, ACM, New York, NY, USA, pp 937–944. doi:10.1145/1553374.1553494

  75. Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press, New York

    Book  Google Scholar 

  76. Sindhwani V, Niyogi P, Belkin M (2005) Beyond the point cloud: from transductive to semi-supervised learning. In: ICML ’05: Proceedings of the 22nd international conference on machine learning, ACM, New York, NY, USA, pp 824–831. doi:10.1145/1102351.1102455

  77. Smola A, Hofmann T, Scholkopf B (2007) Kernel methods in machine learning. Annal Stat 36:1171–1220. http://eprints.pascal-network.org/archive/00003984/

    Google Scholar 

  78. Szafranski M, Grandvalet Y, Rakotomamonjy A (2008) Composite kernel learning. In: ICML ’08: Proceedings of the 25th international conference on machine learning, ACM, New York, NY, USA, pp 1040–1047. doi:10.1145/1390156.1390287

  79. Tsuda K, Noble WS (2004) Learning kernels from biological networks by maximizing entropy. Bioinformatics 20(1): 326–333. doi:10.1093/bioinformatics/bth906

    Article  Google Scholar 

  80. Tsuda K, Kin T, Asai K (2002) Marginalized kernels for biological sequences. Bioinformatics (Oxford, England) 19: 2149–2156. doi:10.1093/bioinformatics/18.suppl_1.S268

    Google Scholar 

  81. Vapnik V (1999) An overview of statistical learning theory. IEEE Trans Neural Netw 10(5): 988–999. doi:10.1109/72.788640

    Article  Google Scholar 

  82. Wang G, Yeung DY, Lochovsky FH (2007) A kernel path algorithm for support vector machines. In: ICML ’07: Proceedings of the 24th international conference on machine learning, ACM, New York, NY, USA, pp 951–958. doi:10.1145/1273496.1273616

  83. Wang J, Lu H, Plataniotis K, Lu J (2009) Gaussian kernel optimization for pattern classification. Pattern Recognit 42(7): 1237–1247. doi:10.1016/j.patcog.2008.11.024

    Article  MATH  Google Scholar 

  84. Weinberger KQ, Sha F, Saul LK (2004) Learning a kernel matrix for nonlinear dimensionality reduction. In: ICML ’04: Proceedings of the twenty-first international conference on machine learning, ACM, New York, NY, USA

  85. Weinberger KQ, Packer BD, Saul LK (2005) Nonlinear dimensionality reduction by semidefinite programming and kernel matrix factorization. In: Proceedings of the tenth international workshop on artificial intelligence and statistics, pp 381–388

  86. Williams CKI, Barber D (1998) Bayesian classification with gaussian processes. IEEE Trans Pattern Anal Mach Intell 20(12): 1342–1351. doi:10.1109/34.735807

    Article  Google Scholar 

  87. Williams P, Li S, Feng J, Wu S (2007) A geometrical method to improve performance of the support vector machine. IEEE Trans Neural Netw 18(3): 942–947. doi:10.1109/TNN.2007.891625

    Article  Google Scholar 

  88. Wu S, Amari SI (2002) Conformal transformation of kernel functions: a data-dependent way to improve support vector machine classifiers. Neural Process Lett 15(1): 59–67. doi:10.1023/A:1013848912046

    Article  MATH  Google Scholar 

  89. Xiong H, Swamy M, Ahmad M (2005) Optimizing the kernel in the empirical feature space. IEEE Trans Neural Netw 16(2): 460–474. doi:10.1109/TNN.2004.841784

    Article  Google Scholar 

  90. Xu Z, Zhu J, Lyu MR, King I (2007) Maximum margin based semi-supervised spectral kernel learning. In: Proceedings of international joint conference on neural networks, Orlando, Florida, USA

  91. Xu Z, Jin R, Yang H, King I, Lyu MR (2010) Simple and efficient multiple kernel learning by group lasso. In: ICML ’10: Proceedings of the 27th international conference on machine learning, ACM

  92. Yang J, Cheung WK, Chen X (2008) Learning element similarity matrix for semi-structured document analysis. Knowl Inf Syst

  93. Yang L, Jin R (May, 2006) Distance metric learning: a comprehensive survey. Department of Computer Science and Engineering, Michigan State University

  94. Yeung DY, Chang H, Dai G (2007) Learning the kernel matrix by maximizing a kfd-based class separability criterion. Pattern Recognit 40(7): 2021–2028. doi:10.1016/j.patcog.2006.12.031

    Article  MATH  Google Scholar 

  95. Yeung DY, Chang H, Dai G (2008) A scalable kernel-based semisupervised metric learning algorithm with out-of-sample generalization ability. Neural Comput 20(11): 2839–2861. doi:10.1162/neco.2008.05-07-528

    Article  MATH  Google Scholar 

  96. Yuan M, Yuan M, Lin Y, Lin Y (2006) Model selection and estimation in regression with grouped variables. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.79.2062

  97. Zhang Z, Yeung DY, Kwok JT (2004) Bayesian inference for transductive learning of kernel matrix using the tanner-wong data augmentation algorithm. In: Brodley CE (ed) ICML, machine learning, proceedings of the twenty-first international conference (ICML 2004), Banff, Alberta, Canada, July 4–8, 2004, ACM, ACM International Conference Proceeding Series, vol 69

  98. Zhang Z, Kwok JT, Yeung DY (2006) Model-based transductive learning of the kernel matrix. Mach Learn 63(1): 69–101

    Article  MATH  Google Scholar 

  99. Zhang Z, Yeung DY, , Kwok JT, Kwok JT (2007) Probabilistic kernel matrix learning with a mixture model of kernels

  100. Zhengdong Lu PJ, Dhillon I (2009) Geometry-aware metric learning. In: ICML ’09: Proceedings of the 26nd international conference on machine learning, ACM

  101. Zhu X, Kandola J, Ghahramani Z, Lafferty J (2005) Nonparametric transforms of graph kernels for semi-supervised learning. In: Nonparametric transforms of graph kernels for semi-supervised learning, no. 17 in advances in neural information processing systems

  102. Zhuang J, Tsang IW, Hoi SCH (2009) Simplenpkl: simple non-parametric kernel learning. In: ICML ’09: Proceedings of the 26th annual international conference on machine learning, ACM, New York, NY, USA, pp 1273–1280. doi:10.1145/1553374.1553537

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to M. Ehsan Abbasnejad.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Abbasnejad, M.E., Ramachandram, D. & Mandava, R. A survey of the state of the art in learning the kernels. Knowl Inf Syst 31, 193–221 (2012). https://doi.org/10.1007/s10115-011-0404-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-011-0404-6

Keywords

Navigation