Kernel methods: a survey of current techniques
Introduction
Support vector machines (SVMs) have been successfully applied to a number of applications ranging from particle identification, face identification and text categorization to engine-knock detection, bioinformatics and database marketing [17]. The approach is systematic and properly motivated by statistical learning theory [58]. Training involves optimization of a convex cost function: there are no local minima to complicate the learning process. The approach has many other benefits, for example, the model constructed has an explicit dependence on a subset of the datapoints (the support vectors), hence interpretation is straightforward and data cleaning [16] could be implemented to improve performance. SVMs are the most well known of a class of algorithms which use the idea of kernel substitution and which we will broadly refer to as kernel methods.
In this tutorial, we introduce this subject, describing the application of kernel methods to classification, regression and novelty detection and the different optimization techniques that may be used during training. This tutorial is not exhaustive and many alternative kernel-based approaches (e.g. kernel PCA [42], density estimation [64], etc) have not been considered here. More thorough treatments are contained in the books by Cristianini and Shawe-Taylor [11], Vapnik's classic textbook on statistical learning theory [58], recent edited volumes [41], [49] and a special issue of Machine Learning [9].
Section snippets
Learning with support vectors
To introduce the subject we will begin by outlining the application of SVMs to the simplest case of binary classification. From the perspective of statistical learning theory the motivation for considering binary classifier SVMs comes from theoretical bounds on the generalization error [58], [11] (the theoretical generalization performance on new data). These generalization bounds have two important features (Appendix A). Firstly, the upper bound on the generalization error does not depend on
Regression
For real-valued outputs the learning task can also be theoretically motivated from statistical learning theory (Appendix A). Instead of (3) we now use constraints and to allow for some deviation ε between the eventual targets yi and the function , modelling the data. We can visualize this as a band or tube of size ±(θ−γ) around the hypothesis function and any points outside this tube can be viewed as training errors. The structure of the tube is defined by
Algorithmic approaches
So far the methods we have considered have involved linear or quadratic programming. Linear programming can be implemented using column generation techniques [32] and many packages are available, e.g. CPLEX. Existing LP packages based on simplex or interior point methods can handle problems of moderate size (up to thousands of datapoints). For quadratic programming there are also many applicable techniques including conjugate gradient and primal-dual interior point methods [26]. Certain QP
Further techniques based on kernel representations
So far we have considered methods based on linear and quadratic programming. Here we shall consider further kernel-based approaches which may utilize general non-linear programming and other techniques. In particular, we will consider approaches to two issues: how to improve generalization performance over standard SVMs and how to create hypotheses which are sparse.
Algorithms leading to dense hypotheses. Taking the geometric dual of input space we find datapoints become hyperplanes and
Conclusion
The approach we have considered is very general in that it can be applied to a wide range of machine learning tasks and can be used to generate many possible learning machine architectures (RBF networks, feedforward neural networks) through an appropriate choice of kernel. A variety of optimization techniques can be used during the training process which typically involves optimization of a convex function. Above all, kernel methods have been found to work well in practice. The subject is still
References (66)
- et al.
Learning in neural networks: theoretical foundations
(1999) - P. Bradley, O. Mangasarian, D. Musicant, Optimization in massive datasets, in: J. Abello, P. Pardalos, M. Resende...
A tutorial on support vector machines for pattern recognition
Data Mining and Knowledge Discovery
(1998)- C. Campbell, K.P. Bennett, A linear programming approach to novelty detection, Advances in Neural Information...
- C. Campbell, N. Cristianini, Simple training algorithms for support vector machines, Technical Report, Bristol...
- et al.
Model selection for support vector machines, to appear in Advances in Neural Information Processing Systems, vol. 12
(2000) - R. Collobert, S. Bengio, SVMTorch web page:...
- et al.
Support vector networks
Machine Learning
(1995) - N. Cristianini, C. Campbell, C. Burges (Eds.), Support vector machines and kernel methods, Machine Learning, 2001, to...
- N. Cristianini, C. Campbell, J. Shawe-Taylor, Dynamically adapting kernels in support vector machines, Advances in...
An introduction to support vector machines and other kernel-based learning methods
Pattern classification and scene analysis
Discovering informative patterns and data cleaning
A fast iterative nearest point algorithm for support vector machine classifier design
IEEE Trans. Neural Networks
Linear and Nonlinear Programming
Linear and non-linear separation of patterns by linear programming
Oper. Res.
Functions of positive and negative type and their connection with the theory of integral equations
Philos. Trans. Roy. Soc. London
Fisher discriminant analysis with kernels
Proceedings of IEEE Neural Networks for Signal Processing Workshop
Linear and non-linear programming
Generalization performance of bayes optimal classification algorithm for learning a perceptron
Phys. Rev. Lett.
Cited by (224)
A classification method based on a cloud of spheres
2023, EURO Journal on Computational OptimizationAn efficient and effective deep convolutional kernel pseudoinverse learner with multi-filter
2021, NeurocomputingCitation Excerpt :The success of support vector machines (SVM) has arous-ed researchers’ interest in the study of kernel methods [14,15]. The basic idea of the kernel method is to map the input data to a high-dimensional space with nonlinear transformation, so that the previously indistinguishable data can be linearly separable in high-dimensional space [15,16]. Kernel functions can greatly reduce computation and realize linear divisibility.
HARFE: hard-ridge random feature expansion
2023, Sampling Theory, Signal Processing, and Data AnalysisUniversal Graph Random Features
2023, arXivFast Vehicle Routing via Knowledge Transfer in a Reproducing Kernel Hilbert Space
2023, IEEE Transactions on Systems, Man, and Cybernetics: Systems