Elsevier

Neurocomputing

Volume 48, Issues 1–4, October 2002, Pages 63-84
Neurocomputing

Kernel methods: a survey of current techniques

https://doi.org/10.1016/S0925-2312(01)00643-9Get rights and content

Abstract

Kernel methods have become an increasingly popular tool for machine learning tasks such as classification, regression or novelty detection. They exhibit good generalization performance on many real-life datasets, there are few free parameters to adjust and the architecture of the learning machine does not need to be found by experimentation. In this tutorial, we survey this subject with a principal focus on the most well-known models based on kernel substitution, namely, support vector machines.

Introduction

Support vector machines (SVMs) have been successfully applied to a number of applications ranging from particle identification, face identification and text categorization to engine-knock detection, bioinformatics and database marketing [17]. The approach is systematic and properly motivated by statistical learning theory [58]. Training involves optimization of a convex cost function: there are no local minima to complicate the learning process. The approach has many other benefits, for example, the model constructed has an explicit dependence on a subset of the datapoints (the support vectors), hence interpretation is straightforward and data cleaning [16] could be implemented to improve performance. SVMs are the most well known of a class of algorithms which use the idea of kernel substitution and which we will broadly refer to as kernel methods.

In this tutorial, we introduce this subject, describing the application of kernel methods to classification, regression and novelty detection and the different optimization techniques that may be used during training. This tutorial is not exhaustive and many alternative kernel-based approaches (e.g. kernel PCA [42], density estimation [64], etc) have not been considered here. More thorough treatments are contained in the books by Cristianini and Shawe-Taylor [11], Vapnik's classic textbook on statistical learning theory [58], recent edited volumes [41], [49] and a special issue of Machine Learning [9].

Section snippets

Learning with support vectors

To introduce the subject we will begin by outlining the application of SVMs to the simplest case of binary classification. From the perspective of statistical learning theory the motivation for considering binary classifier SVMs comes from theoretical bounds on the generalization error [58], [11] (the theoretical generalization performance on new data). These generalization bounds have two important features (Appendix A). Firstly, the upper bound on the generalization error does not depend on

Regression

For real-valued outputs the learning task can also be theoretically motivated from statistical learning theory (Appendix A). Instead of (3) we now use constraints yiw·xi−b⩽ε and w·xi+b−yi⩽ε to allow for some deviation ε between the eventual targets yi and the function f(x)=w·x+b, modelling the data. We can visualize this as a band or tube of size ±(θγ) around the hypothesis function f(x) and any points outside this tube can be viewed as training errors. The structure of the tube is defined by

Algorithmic approaches

So far the methods we have considered have involved linear or quadratic programming. Linear programming can be implemented using column generation techniques [32] and many packages are available, e.g. CPLEX. Existing LP packages based on simplex or interior point methods can handle problems of moderate size (up to thousands of datapoints). For quadratic programming there are also many applicable techniques including conjugate gradient and primal-dual interior point methods [26]. Certain QP

Further techniques based on kernel representations

So far we have considered methods based on linear and quadratic programming. Here we shall consider further kernel-based approaches which may utilize general non-linear programming and other techniques. In particular, we will consider approaches to two issues: how to improve generalization performance over standard SVMs and how to create hypotheses which are sparse.

Algorithms leading to dense hypotheses. Taking the geometric dual of input space we find datapoints become hyperplanes and

Conclusion

The approach we have considered is very general in that it can be applied to a wide range of machine learning tasks and can be used to generate many possible learning machine architectures (RBF networks, feedforward neural networks) through an appropriate choice of kernel. A variety of optimization techniques can be used during the training process which typically involves optimization of a convex function. Above all, kernel methods have been found to work well in practice. The subject is still

References (66)

  • M. Anthony et al.

    Learning in neural networks: theoretical foundations

    (1999)
  • P. Bradley, O. Mangasarian, D. Musicant, Optimization in massive datasets, in: J. Abello, P. Pardalos, M. Resende...
  • C. Burges

    A tutorial on support vector machines for pattern recognition

    Data Mining and Knowledge Discovery

    (1998)
  • C. Campbell, K.P. Bennett, A linear programming approach to novelty detection, Advances in Neural Information...
  • C. Campbell, N. Cristianini, Simple training algorithms for support vector machines, Technical Report, Bristol...
  • O. Chapelle et al.

    Model selection for support vector machines, to appear in Advances in Neural Information Processing Systems, vol. 12

    (2000)
  • R. Collobert, S. Bengio, SVMTorch web page:...
  • C. Cortes et al.

    Support vector networks

    Machine Learning

    (1995)
  • N. Cristianini, C. Campbell, C. Burges (Eds.), Support vector machines and kernel methods, Machine Learning, 2001, to...
  • N. Cristianini, C. Campbell, J. Shawe-Taylor, Dynamically adapting kernels in support vector machines, Advances in...
  • N. Cristianini et al.

    An introduction to support vector machines and other kernel-based learning methods

    (2000)
  • R.O. Duda et al.

    Pattern classification and scene analysis

    (1973)
  • M. Ferris, T. Munson, Interior point methods for massive support vector machines, Data Mining Institute Technical...
  • M. Ferris, T. Munson, Semi-smooth support vector machines, Data Mining Institute Technical Report 00-09, Computer...
  • T.-T. Friess, N. Cristianini, C. Campbell, The kernel adatron algorithm: a fast and simple learning procedure for...
  • I. Guyon et al.

    Discovering informative patterns and data cleaning

  • Cf:...
  • D. Haussler, Convolution Kernels on Discrete Structures, UC Santa Cruz Technical Report UCS-CRL-99-10,...
  • R. Herbrich, T. Graepel, C. Campbell, Bayes Point Machines, J. Machine Learning Res. (2001) to...
  • R. Herbrich, T. Graepel, C. Campbell, Robust bayes point machines, Proceedings of ESANN2000, D-Facto Publications,...
  • T. Jaakolla, D. Haussler, Probabilistic kernel regression models, Proceedings of the 1999 Conference on AI and...
  • T. Joachims, Estimating the generalization performance of an SVM efficiently, Proceedings of the Seventeenth...
  • T. Joachims, Web Page for SVMLight Software:...
  • S. Keerthi, S. Shevade, C. Bhattacharyya, K. Murthy, Improvements to Platt's SMO algorithm for SVM classifier design,...
  • S. Keerthi et al.

    A fast iterative nearest point algorithm for support vector machine classifier design

    IEEE Trans. Neural Networks

    (2000)
  • D. Luenberger

    Linear and Nonlinear Programming

    (1984)
  • O.L. Mangasarian

    Linear and non-linear separation of patterns by linear programming

    Oper. Res.

    (1965)
  • O. Mangasarian, D. Musicant, Lagrangian support vector regression, Data mining Institute Technical Report 00-06, June...
  • E. Mayoraz, E. Alpaydin, Support vector machines for multiclass classification, Proceedings of the International...
  • J. Mercer

    Functions of positive and negative type and their connection with the theory of integral equations

    Philos. Trans. Roy. Soc. London

    (1909)
  • S. Mika et al.

    Fisher discriminant analysis with kernels

    Proceedings of IEEE Neural Networks for Signal Processing Workshop

    (1999)
  • S. Nash et al.

    Linear and non-linear programming

    (1996)
  • M. Opper et al.

    Generalization performance of bayes optimal classification algorithm for learning a perceptron

    Phys. Rev. Lett.

    (1991)
  • Cited by (224)

    • A classification method based on a cloud of spheres

      2023, EURO Journal on Computational Optimization
    • An efficient and effective deep convolutional kernel pseudoinverse learner with multi-filter

      2021, Neurocomputing
      Citation Excerpt :

      The success of support vector machines (SVM) has arous-ed researchers’ interest in the study of kernel methods [14,15]. The basic idea of the kernel method is to map the input data to a high-dimensional space with nonlinear transformation, so that the previously indistinguishable data can be linearly separable in high-dimensional space [15,16]. Kernel functions can greatly reduce computation and realize linear divisibility.

    • HARFE: hard-ridge random feature expansion

      2023, Sampling Theory, Signal Processing, and Data Analysis
    • Fast Vehicle Routing via Knowledge Transfer in a Reproducing Kernel Hilbert Space

      2023, IEEE Transactions on Systems, Man, and Cybernetics: Systems
    View all citing articles on Scopus
    View full text