Optimally regularised kernel Fisher discriminant classification
Introduction
In recent years the “kernel trick” has been applied to constructing non-linear equivalents of a wide range of classical linear statistical models, for instance ridge regression (Hoerl and Kennard, 1970, Saunders et al., 1998), principal component analysis (Jolliffe, 2002, Schölkopf et al., 1999) and Fisher’s linear discriminant (Fisher, 1936, Mika et al., 1999), in addition to more modern techniques, such as the maximal margin classifier (Boser et al., 1992, Cortes and Vapnik, 1995) (for an introduction to kernel learning methods, see Schölkopf and Smola (2002) or Shawe-Taylor and Cristianini (2004)). An important advantage of kernel models is that the parameters of the model are typically given by the solution of a convex optimisation problem, with a single, global optimum (Boyd & Vandenberghe, 2004). The generalisation properties of kernel models are however typically governed by a small number of regularisation and kernel parameters. Good values for these parameters must be determined during the model selection process. There is generally no guarantee that the model selection criterion is unimodal, and so simple grid-based search procedures are often employed in practical applications. In this paper, we propose a simple and computationally efficient method for choosing the regularisation parameter in kernel Fisher discriminant analysis so as to minimise an approximation to the leave-one-out cross-validation error. The resulting optimally regularised kernel Fisher discriminant (ORKFD) analysis algorithm then becomes attractive for small to medium-scale applications (currently anything less than a few thousand training patterns) as the algorithm is easily implemented (only 15 lines of code in the MATLAB programming environment) and inherently resistant to over-fitting.
The remainder of this paper is structured as follows: Section 2 reviews the kernel Fisher discriminant classifier and introduces the notation used throughout. Section 3 then proposes an efficient algorithm for selecting the regularisation for a KFD classifier, so as to minimise the leave-one-out cross-validation error, with a computational complexity of only operations instead of the operations of direct methods1 (Cawley & Talbot, 2003b). Section 4 presents results obtained on a range of real-world benchmark datasets. The extension of this approach to closely related forms of least-squares kernel learning is discussed in Section 5. Finally, the work is summarised in Section 6.
Section snippets
The kernel Fisher discriminant classifier
Assume we are given training data , where is a set of patterns belonging to class and similarly is a set of patterns belonging to class ; Fisher’s linear discriminant (FLD) attempts to find a linear combination of input variables, , that maximises the average separation of the projections of points belonging to and , whilst minimising the within class variance of the projections of those points. The Fisher
Method
In this section, we describe a training algorithm for the kernel Fisher discriminant classifier in which the system of linear equations (4) is solved in canonical form. This allows the model parameters to be updated following a change in the value of the regularisation parameter with a computational complexity of only operations. This also permits the extension of an existing analytic method (Cawley & Talbot, 2003b) for re-evaluation of the leave-one-out cross-validation error in only
Results
The runtime for model selection based on the proposed optimally regularised kernel Fisher discriminant classifier is evaluated over a series of randomly generated synthetic datasets. In each case, approximately one quarter of the data belong to class and three-quarters to class . The patterns comprising class are drawn from a bivariate Normal distribution with zero mean and unit variance. The patterns forming class form an annulus; the radii of the data are drawn from a normal
Optimal regularisation for related formulations
The use of the eigen-decomposition, or equivalently the singular value decomposition, to isolate the effect of the regularisation term can also be used to develop optimally regularised variants of kernel ridge regression (Saunders et al., 1998) (also known as the regularization network (Poggio & Girosi, 1990) and regularised least squares (Rifkin, 2002)) and the least-squares support vector machine (Suykens & Vandewalle, 1999). Kernel ridge regression (Saunders et al., 1998) constructs a kernel
Summary
Model selection, the optimal choice of the values for a small number of regularisation and kernel parameters, is the key step in maximising generalisation performance using kernel learning methods. Conventional -fold and leave-one-out cross-validation strategies provide computationally expensive, but highly effective solutions. In this paper, we extend an existing analytic method for efficient evaluation of the leave-one-out cross-validation error of a kernel Fisher discriminant classifier (
Acknowledgements
We thank the anonymous reviewers for their helpful and constructive comments. This work was supported by Biological and Biotechnology Research Council (BBSRC) of the United Kingdom (grant number 83/EGM16128) and the Royal Society (grant number RSRG-22270).
References (37)
- et al.
Improved sparse least-squares support vector machines
Neurocomputing
(2002) - et al.
Efficient leave-one-out cross-validation of kernel Fisher discriminant classifiers
Pattern Recognition
(2003) - Bay, S. D. (1999). The UCI KDD archive (http://kdd.ics.uci.edu/). Irvine, CA: University of California, Department of...
- et al.
A training algorithm for optimal margin classifiers
- et al.
Convex optimization
(2004) - et al.
A greedy training algorithm for sparse least-squares support vector machines
- Cawley, G. C., & Talbot, N. L. C. (2003a). Efficient cross-validation of kernel Fisher discriminant classifiers. In...
- et al.
Preventing over-fitting during model selection via Bayesian regularisation of the hyper-parameters
Journal of Machine Learning Research
(2007) - et al.
Choosing multiple parameters for support vector machines
Machine Learning
(2002) - et al.
Support vector networks
Machine Learning
(1995)