Elsevier

Neural Networks

Volume 14, Issue 10, December 2001, Pages 1447-1461
Neural Networks

Contributed article
Oriented principal component analysis for large margin classifiers

https://doi.org/10.1016/S0893-6080(01)00106-XGet rights and content

Abstract

Large margin classifiers (such as MLPs) are designed to assign training samples with high confidence (or margin) to one of the classes. Recent theoretical results of these systems show why the use of regularisation terms and feature extractor techniques can enhance their generalisation properties. Since the optimal subset of features selected depends on the classification problem, but also on the particular classifier with which they are used, global learning algorithms for large margin classifiers that use feature extractor techniques are desired. A direct approach is to optimise a cost function based on the margin error, which also incorporates regularisation terms for controlling capacity. These terms must penalise a classifier with the largest margin for the problem at hand. Our work shows that the inclusion of a PCA term can be employed for this purpose. Since PCA only achieves an optimal discriminatory projection for some particular distribution of data, the margin of the classifier can then be effectively controlled. We also propose a simple constrained search for the global algorithm in which the feature extractor and the classifier are trained separately. This allows a degree of flexibility for including heuristics that can enhance the search and the performance of the computed solution. Experimental results demonstrate the potential of the proposed method.

Introduction

In classification, observations belong to different classes. Based on some prior knowledge about the problem and the training set, a classifier is constructed for assigning future observations to one of the existing classes. Typically, these systems are designed to minimise the number of misclassifications in the training set. However, it has been shown recently that, in order to ensure a small generalisation error, we should also take into account the confidence or margin of the classifications. Consequently, classifiers must also be computed to have a large margin distribution of the training samples, i.e. the training samples must be assigned to the correct class with high confidence on average. A large margin distribution helps to stabilise the solution better, so capacity can be controlled.

Two recent examples of large margin classifiers are support vector learning machines (SVM) (Vapnik, 1998) and boosting classifiers (Schapire, 1999, Schapire et al., 1998). However, other learning machines like multilayer perceptrons (MLPs) also belong to this category because the minimisation of the mean squared error (MSE) leads in practice to maximising the margin, as Barlett (1998) suggests. In this latter class of classifiers, their capacity increases as the input space does. Accordingly, the use of a feature extractor, which is common in practice, is strongly justified from a theoretical point of view.

Feature extraction and selection techniques form linear (or non-linear) combinations of the original variables and then select the most relevant combination. Therefore, the complexity of the original problem is reduced by projecting the input patterns to a lower-dimensional space. Clearly, the optimal subset of features selected and the transformation to compute these features depend on the classification problem and the particular classifier with which they are used (Bishop, 1995, p. 304). This dependency advocates the use of a global training process (e.g. Bottou and Gallinari, 1991, LeCun et al., 1999).

However, the traditional design process of a pattern recogniser is often based on training each module individually. Furthermore, the feature extractor is sometimes completely handcrafted since it is rather specific to the problem. The main obstacle with this ad-hoc approach is that classification accuracy largely depends on how well the manual feature selection is performed. The advent of cheap computers, large databases, and new powerful learning machines has changed this way of thinking over the last decade. Automatic feature extraction has now been employed widely in many difficult real-world problems such as handwritten digit recognition (e.g. Choe et al., 1996, Sckölkopf et al., 1998, Sirosh, 1995).

The use of unsupervised techniques for automatic feature extraction is still very common at present, but since they take no account of class information it is difficult to predict how well they will work as pre-processing for a given classification problem. In some cases, the representation of the input space through an economical description (e.g. through compression) can involve a vital loss of information to classify. For instance, principal component analysis (PCA) (Jolliffe, 1986) for feature extraction is based on projecting the input space using the directions in which data vary most (high-variance PCs) in order to reconstruct the original point from the transformed space with the minimum mean squared error. However, there is no reason to believe that the separation between classes will be in the direction of the high-variance PCs for any classification problem. The first few PCs will only be useful in those cases in which the intra- and inter-class variations have the same dominant directions or inter-class variations are clearly larger than intra-class variations. Otherwise, PCA will lead to a partial (or even complete) loss of discriminatory information. Attempts to overcome these limitations using class information have been proposed. The most popular group of techniques for pattern recognition based on PCA is that of subspace classifiers (Fukunaga and Koontz, 1970, Oja, 1983, Wold, 1976), which compute PCs separately for each class and describe them by a low-dimensional principal component subspace. The main drawback of these methods is that they do not retain information about the relative differences between classes, since populations with similar PCs lead to very poor classification accuracy. Other efforts include the inclusion of supervised terms in the learning process, but they mainly rely on heuristics. Similar strategies have also appeared in other unsupervised systems such as vector quantisation (Gray and Olshen, 1997, Oehler and Gray, 1995, Perlmutter et al., 1996).

In this paper, we derive a global training algorithm in which a large margin classifier employs a linear transformation for the feature extraction. Since feature extraction in the context of large margin classifiers must also enhance the margin, an overall cost function, which involves the whole set of parameters of the system, is defined to maximise it. However, regularisation terms based on PCA are introduced to better control the capacity of the pattern recogniser, since the limits of PCA for achieving an optimal projection for classification purposes prevent the classifier from getting the largest margin, a requisite for controlling capacity effectively. The computed linear projections can thus be understood as oriented principal components since margin information rotates projections from the PCA solution. The global learning algorithm presented here uses a two-step strategy in which the training of the linear feature extractor and the classifier are two separate phases. This feature offers a degree of versatility that is not present in a learning algorithm based on varying all the parameters at the same time, such as the inclusion of additional learning procedures to enhance the performance of the learning system.

In the next section, we will introduce the basics of our approach to the oriented PCA (OPCA) for classification. Section 3 presents a large margin classifier that we use in the experimental section called the adaptive soft k-NN classifier (Bermejo and Cabestany, 1999, Bermejo and Cabestany, 2001a). In Section 4, we compare our linear feature method with PCA, ICA and LDA on one artificial problem and six real data sets. Finally, some conclusions are given in Section 5.

Section snippets

OPCA for classification with large margin

In pattern recognition, observations, which belong to the input space X, are assigned to one of the c existing classes according to a mapping (or classifier) g(x;w):Rp→{1,…,c}, where the vector w denotes the classifier's tuneable parameters. The classifier makes an error when g(x;w)≠y, where y indicates the class label of the pattern x. If all the classes have the same risk, then the performance of g can be measured with the probability of classification error defined asL(g)=P{g(x;w)≠y)where P

The adaptive soft k-nearest-neighbour classifier

As we pointed out in the previous section, we need a classifier in order to apply the global training of Section 2.3.1. This section briefly introduces the adaptive soft k-NN classifier, which is the large margin classifier employed in the experimental section. For further details of this classifier, see Bermejo and Cabestany, 1999, Bermejo and Cabestany, 2001a.

The so-called adaptive soft k-NN classifier is simply a soft k-NN rule+a learning algorithm based on minimising Eq. (8). The soft k-NN

Experimental results

In this section we introduce the experiments in which we compare the global training algorithm based on OPCA+adaptive soft 2-NN classifier with other pattern recognisers on six real data sets and one artificial problem (Table 1). First, we review the linear projections for comparison with OPCA and an alternative classifier to the adaptive soft 2-NN. Then, we include some practical considerations on the application of our algorithm. Finally, we present the results of our experiments.

Conclusions

Large margin classifiers such as MLPs or adaptive soft k-NN classifiers are designed in the learning phase to assign input patterns with high confidence to one of the classes. Practical ways of ensuring good generalisation properties for these systems, which are supported by theoretical results, involve among others the use of regularisation terms in the cost functions of the learning algorithm and feature extractor techniques.

A direct approach for designing a linear feature extractor+large

References (44)

  • S. Bermejo et al.

    Adaptive soft k-nearest neighbour classifiers

    Pattern Recognition

    (1999)
  • S. Wold

    Pattern recognition by means of disjoint principal component models

    Pattern Recognition

    (1976)
  • P.L. Barlett

    The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network

    IEEE Transaction on Information Theory

    (1998)
  • Bermejo, S., & Cabestany, J. (2001a). Adaptive soft k-nearest neighbour classifiers. To be published in Pattern...
  • S. Bermejo et al.

    Finite-sample convergence properties of the LVQ1 algorithm and the BLVQ1 algorithm

    Neural Processing Letters

    (2001)
  • C.M. Bishop

    Neural networks for pattern recognition

    (1995)
  • L. Bottou et al.

    Convergence properties of k-means

  • L. Bottou et al.

    A framework for the cooperation of learning algorithms

  • L. Bottou et al.

    Local learning algorithms

    Neural Computation

    (1992)
  • Y. Choe et al.

    Laterally interconnected self-organizing maps in hand-written digit recognition

  • Cortes, C. (1995). Prediction of generalization ability in learning machines. PhD Thesis, New York: University of...
  • G. Deco et al.

    An information-theoretic approach to neural computing

    (1995)
  • J.E. Dennis et al.

    A view of unconstrained optimization

  • L. Devroye et al.

    A probabilistic theory of pattern recognition

    (1996)
  • K.I. Diamantaras et al.

    Principal component neural networks, theory and applications

    (1996)
  • R.O. Duda et al.

    Pattern classification and scene analysis

    (1973)
  • K. Fukunaga et al.

    Applications of the karhunen loeve expansion to feature extraction and ordering

    IEEE Transactions on Computers

    (1970)
  • J.C. Gower

    Discussion of the paper what is projection pursuit?

    Journal of the Royal Statistical Society A

    (1987)
  • Gray, R.M., & Olshen, R.A. (1997). Vector quantization and density estimation. Technical Report. Stanford, CA: Stanford...
  • T. Hastie et al.

    Discussion of the paper what is projection pursuit?

    Journal of the Royal Statistical Society A

    (1987)
  • A. Hyvärinen

    Survey on independent component analysis

    Neural Computing Surveys

    (1999)
  • Hyvärinen, A., & Oja, E. (1999). Independent component analysis: a tutorial. In Proceedings of the International joint...
  • Cited by (55)

    View all citing articles on Scopus
    View full text