Optimally regularised kernel Fisher discriminant classification

doi:10.1016/j.neunet.2007.05.005

Neural Networks

Volume 20, Issue 7, September 2007, Pages 832-841

https://doi.org/10.1016/j.neunet.2007.05.005 Get rights and content

Abstract

Mika, Rätsch, Weston, Schölkopf and Müller [Mika, S., Rätsch, G., Weston, J., Schölkopf, B., & Müller, K.-R. (1999). Fisher discriminant analysis with kernels. In Neural networks for signal processing: Vol. IX (pp. 41–48). New York: IEEE Press] introduce a non-linear formulation of Fisher’s linear discriminant, based on the now familiar “kernel trick”, demonstrating state-of-the-art performance on a wide range of real-world benchmark datasets. In this paper, we extend an existing analytical expression for the leave-one-out cross-validation error [Cawley, G. C., & Talbot, N. L. C. (2003b). Efficient leave-one-out cross-validation of kernel Fisher discriminant classifiers. Pattern Recognition, 36(11), 2585–2592] such that the leave-one-out error can be re-estimated following a change in the value of the regularisation parameter with a computational complexity of only $O (ℓ^{2})$ operations, which is substantially less than the $O (ℓ^{3})$ operations required for the basic training algorithm. This allows the regularisation parameter to be tuned at an essentially negligible computational cost. This is achieved by performing the discriminant analysis in canonical form. The proposed method is therefore a useful component of a model selection strategy for this class of kernel machines that alternates between updates of the kernel and regularisation parameters. Results obtained on real-world and synthetic benchmark datasets indicate that the proposed method is competitive with model selection based on $k$ -fold cross-validation in terms of generalisation, whilst being considerably faster.

Introduction

In recent years the “kernel trick” has been applied to constructing non-linear equivalents of a wide range of classical linear statistical models, for instance ridge regression (Hoerl and Kennard, 1970, Saunders et al., 1998), principal component analysis (Jolliffe, 2002, Schölkopf et al., 1999) and Fisher’s linear discriminant (Fisher, 1936, Mika et al., 1999), in addition to more modern techniques, such as the maximal margin classifier (Boser et al., 1992, Cortes and Vapnik, 1995) (for an introduction to kernel learning methods, see Schölkopf and Smola (2002) or Shawe-Taylor and Cristianini (2004)). An important advantage of kernel models is that the parameters of the model are typically given by the solution of a convex optimisation problem, with a single, global optimum (Boyd & Vandenberghe, 2004). The generalisation properties of kernel models are however typically governed by a small number of regularisation and kernel parameters. Good values for these parameters must be determined during the model selection process. There is generally no guarantee that the model selection criterion is unimodal, and so simple grid-based search procedures are often employed in practical applications. In this paper, we propose a simple and computationally efficient method for choosing the regularisation parameter in kernel Fisher discriminant analysis so as to minimise an approximation to the leave-one-out cross-validation error. The resulting optimally regularised kernel Fisher discriminant (ORKFD) analysis algorithm then becomes attractive for small to medium-scale applications (currently anything less than a few thousand training patterns) as the algorithm is easily implemented (only 15 lines of code in the MATLAB programming environment) and inherently resistant to over-fitting.

The remainder of this paper is structured as follows: Section 2 reviews the kernel Fisher discriminant classifier and introduces the notation used throughout. Section 3 then proposes an efficient algorithm for selecting the regularisation for a KFD classifier, so as to minimise the leave-one-out cross-validation error, with a computational complexity of only $O (ℓ^{2})$ operations instead of the $O (ℓ^{3})$ operations of direct methods¹ (Cawley & Talbot, 2003b). Section 4 presents results obtained on a range of real-world benchmark datasets. The extension of this approach to closely related forms of least-squares kernel learning is discussed in Section 5. Finally, the work is summarised in Section 6.

Section snippets

The kernel Fisher discriminant classifier

Assume we are given training data $X = {x_{1}, x_{2}, \dots, x_{ℓ}} = {X_{1}, X_{2}} \subset R^{d}$ , where $X_{1} = {x_{1}^{1}, x_{2}^{1}, \dots, x_{ℓ_{1}}^{1}}$ is a set of patterns belonging to class $C_{1}$ and similarly $X_{2} = {x_{1}^{2}, x_{2}^{2}, \dots, x_{ℓ_{2}}^{2}}$ is a set of patterns belonging to class $C_{2}$ ; Fisher’s linear discriminant (FLD) attempts to find a linear combination of input variables, $w \cdot x$ , that maximises the average separation of the projections of points belonging to $C_{1}$ and $C_{2}$ , whilst minimising the within class variance of the projections of those points. The Fisher

Method

In this section, we describe a training algorithm for the kernel Fisher discriminant classifier in which the system of linear equations (4) is solved in canonical form. This allows the model parameters to be updated following a change in the value of the regularisation parameter with a computational complexity of only $O (ℓ)$ operations. This also permits the extension of an existing analytic method (Cawley & Talbot, 2003b) for re-evaluation of the leave-one-out cross-validation error in only $O (ℓ^{2})$

Results

The runtime for model selection based on the proposed optimally regularised kernel Fisher discriminant classifier is evaluated over a series of randomly generated synthetic datasets. In each case, approximately one quarter of the data belong to class $C_{1}$ and three-quarters to class $C_{2}$ . The patterns comprising class $C_{1}$ are drawn from a bivariate Normal distribution with zero mean and unit variance. The patterns forming class $C_{2}$ form an annulus; the radii of the data are drawn from a normal

Optimal regularisation for related formulations

The use of the eigen-decomposition, or equivalently the singular value decomposition, to isolate the effect of the regularisation term can also be used to develop optimally regularised variants of kernel ridge regression (Saunders et al., 1998) (also known as the regularization network (Poggio & Girosi, 1990) and regularised least squares (Rifkin, 2002)) and the least-squares support vector machine (Suykens & Vandewalle, 1999). Kernel ridge regression (Saunders et al., 1998) constructs a kernel

Summary

Model selection, the optimal choice of the values for a small number of regularisation and kernel parameters, is the key step in maximising generalisation performance using kernel learning methods. Conventional $k$ -fold and leave-one-out cross-validation strategies provide computationally expensive, but highly effective solutions. In this paper, we extend an existing analytic method for efficient evaluation of the leave-one-out cross-validation error of a kernel Fisher discriminant classifier (

Acknowledgements

We thank the anonymous reviewers for their helpful and constructive comments. This work was supported by Biological and Biotechnology Research Council (BBSRC) of the United Kingdom (grant number 83/EGM16128) and the Royal Society (grant number RSRG-22270).

References (37)

G.C. Cawley et al.
Improved sparse least-squares support vector machines
Neurocomputing
(2002)
G.C. Cawley et al.
Efficient leave-one-out cross-validation of kernel Fisher discriminant classifiers
Pattern Recognition
(2003)
Bay, S. D. (1999). The UCI KDD archive (http://kdd.ics.uci.edu/). Irvine, CA: University of California, Department of...
B.E. Boser et al.
A training algorithm for optimal margin classifiers
S. Boyd et al.
Convex optimization
(2004)
G.C. Cawley et al.
A greedy training algorithm for sparse least-squares support vector machines
Cawley, G. C., & Talbot, N. L. C. (2003a). Efficient cross-validation of kernel Fisher discriminant classifiers. In...
G.C. Cawley et al.
Preventing over-fitting during model selection via Bayesian regularisation of the hyper-parameters
Journal of Machine Learning Research
(2007)
O. Chapelle et al.
Choosing multiple parameters for support vector machines
Machine Learning
(2002)
C. Cortes et al.
Support vector networks
Machine Learning
(1995)

N. Cristianini et al.

An introduction to support vector machines (and other kernel-based learning methods)

(2000)

S. Fine et al.

Efficient SVM training using low-rank kernel representations

Journal of Machine Learning Research

(2001)

R.A. Fisher

The use of multiple measurements in taxonomic problems

Annals of Eugenics

(1936)

Y. Freund et al.

Experiments with a new boosting algorithm

G.H. Golub et al.

Matrix computations

(1996)

Grove, A. J., & Schuurmans, D. (1998). Boosting in the limit: Maximizing the margin of learned features. In Proceedings...

W.W. Hager

Updating the inverse of a matrix

SIAM Review

(1989)

T. Hastie et al.

The elements of statistical learning — Data mining, inference and prediction

Cited by (0)

View full text

Optimally regularised kernel Fisher discriminant classification

Abstract

Introduction

Section snippets

The kernel Fisher discriminant classifier

Method

Results

Optimal regularisation for related formulations

Summary

Acknowledgements

Neurocomputing

Pattern Recognition

A training algorithm for optimal margin classifiers

Convex optimization

A greedy training algorithm for sparse least-squares support vector machines

Preventing over-fitting during model selection via Bayesian regularisation of the hyper-parameters

Journal of Machine Learning Research

Choosing multiple parameters for support vector machines

Machine Learning

Support vector networks

Machine Learning

An introduction to support vector machines (and other kernel-based learning methods)

Efficient SVM training using low-rank kernel representations

Journal of Machine Learning Research

The use of multiple measurements in taxonomic problems

Annals of Eugenics

Experiments with a new boosting algorithm

Matrix computations

Updating the inverse of a matrix

SIAM Review

The elements of statistical learning — Data mining, inference and prediction