Contributed articleOriented principal component analysis for large margin classifiers
Introduction
In classification, observations belong to different classes. Based on some prior knowledge about the problem and the training set, a classifier is constructed for assigning future observations to one of the existing classes. Typically, these systems are designed to minimise the number of misclassifications in the training set. However, it has been shown recently that, in order to ensure a small generalisation error, we should also take into account the confidence or margin of the classifications. Consequently, classifiers must also be computed to have a large margin distribution of the training samples, i.e. the training samples must be assigned to the correct class with high confidence on average. A large margin distribution helps to stabilise the solution better, so capacity can be controlled.
Two recent examples of large margin classifiers are support vector learning machines (SVM) (Vapnik, 1998) and boosting classifiers (Schapire, 1999, Schapire et al., 1998). However, other learning machines like multilayer perceptrons (MLPs) also belong to this category because the minimisation of the mean squared error (MSE) leads in practice to maximising the margin, as Barlett (1998) suggests. In this latter class of classifiers, their capacity increases as the input space does. Accordingly, the use of a feature extractor, which is common in practice, is strongly justified from a theoretical point of view.
Feature extraction and selection techniques form linear (or non-linear) combinations of the original variables and then select the most relevant combination. Therefore, the complexity of the original problem is reduced by projecting the input patterns to a lower-dimensional space. Clearly, the optimal subset of features selected and the transformation to compute these features depend on the classification problem and the particular classifier with which they are used (Bishop, 1995, p. 304). This dependency advocates the use of a global training process (e.g. Bottou and Gallinari, 1991, LeCun et al., 1999).
However, the traditional design process of a pattern recogniser is often based on training each module individually. Furthermore, the feature extractor is sometimes completely handcrafted since it is rather specific to the problem. The main obstacle with this ad-hoc approach is that classification accuracy largely depends on how well the manual feature selection is performed. The advent of cheap computers, large databases, and new powerful learning machines has changed this way of thinking over the last decade. Automatic feature extraction has now been employed widely in many difficult real-world problems such as handwritten digit recognition (e.g. Choe et al., 1996, Sckölkopf et al., 1998, Sirosh, 1995).
The use of unsupervised techniques for automatic feature extraction is still very common at present, but since they take no account of class information it is difficult to predict how well they will work as pre-processing for a given classification problem. In some cases, the representation of the input space through an economical description (e.g. through compression) can involve a vital loss of information to classify. For instance, principal component analysis (PCA) (Jolliffe, 1986) for feature extraction is based on projecting the input space using the directions in which data vary most (high-variance PCs) in order to reconstruct the original point from the transformed space with the minimum mean squared error. However, there is no reason to believe that the separation between classes will be in the direction of the high-variance PCs for any classification problem. The first few PCs will only be useful in those cases in which the intra- and inter-class variations have the same dominant directions or inter-class variations are clearly larger than intra-class variations. Otherwise, PCA will lead to a partial (or even complete) loss of discriminatory information. Attempts to overcome these limitations using class information have been proposed. The most popular group of techniques for pattern recognition based on PCA is that of subspace classifiers (Fukunaga and Koontz, 1970, Oja, 1983, Wold, 1976), which compute PCs separately for each class and describe them by a low-dimensional principal component subspace. The main drawback of these methods is that they do not retain information about the relative differences between classes, since populations with similar PCs lead to very poor classification accuracy. Other efforts include the inclusion of supervised terms in the learning process, but they mainly rely on heuristics. Similar strategies have also appeared in other unsupervised systems such as vector quantisation (Gray and Olshen, 1997, Oehler and Gray, 1995, Perlmutter et al., 1996).
In this paper, we derive a global training algorithm in which a large margin classifier employs a linear transformation for the feature extraction. Since feature extraction in the context of large margin classifiers must also enhance the margin, an overall cost function, which involves the whole set of parameters of the system, is defined to maximise it. However, regularisation terms based on PCA are introduced to better control the capacity of the pattern recogniser, since the limits of PCA for achieving an optimal projection for classification purposes prevent the classifier from getting the largest margin, a requisite for controlling capacity effectively. The computed linear projections can thus be understood as oriented principal components since margin information rotates projections from the PCA solution. The global learning algorithm presented here uses a two-step strategy in which the training of the linear feature extractor and the classifier are two separate phases. This feature offers a degree of versatility that is not present in a learning algorithm based on varying all the parameters at the same time, such as the inclusion of additional learning procedures to enhance the performance of the learning system.
In the next section, we will introduce the basics of our approach to the oriented PCA (OPCA) for classification. Section 3 presents a large margin classifier that we use in the experimental section called the adaptive soft k-NN classifier (Bermejo and Cabestany, 1999, Bermejo and Cabestany, 2001a). In Section 4, we compare our linear feature method with PCA, ICA and LDA on one artificial problem and six real data sets. Finally, some conclusions are given in Section 5.
Section snippets
OPCA for classification with large margin
In pattern recognition, observations, which belong to the input space X, are assigned to one of the c existing classes according to a mapping (or classifier) , where the vector w denotes the classifier's tuneable parameters. The classifier makes an error when , where y indicates the class label of the pattern x. If all the classes have the same risk, then the performance of g can be measured with the probability of classification error defined aswhere P
The adaptive soft k-nearest-neighbour classifier
As we pointed out in the previous section, we need a classifier in order to apply the global training of Section 2.3.1. This section briefly introduces the adaptive soft k-NN classifier, which is the large margin classifier employed in the experimental section. For further details of this classifier, see Bermejo and Cabestany, 1999, Bermejo and Cabestany, 2001a.
The so-called adaptive soft k-NN classifier is simply a soft k-NN rule+a learning algorithm based on minimising Eq. (8). The soft k-NN
Experimental results
In this section we introduce the experiments in which we compare the global training algorithm based on OPCA+adaptive soft 2-NN classifier with other pattern recognisers on six real data sets and one artificial problem (Table 1). First, we review the linear projections for comparison with OPCA and an alternative classifier to the adaptive soft 2-NN. Then, we include some practical considerations on the application of our algorithm. Finally, we present the results of our experiments.
Conclusions
Large margin classifiers such as MLPs or adaptive soft k-NN classifiers are designed in the learning phase to assign input patterns with high confidence to one of the classes. Practical ways of ensuring good generalisation properties for these systems, which are supported by theoretical results, involve among others the use of regularisation terms in the cost functions of the learning algorithm and feature extractor techniques.
A direct approach for designing a linear feature extractor+large
References (44)
- et al.
Adaptive soft k-nearest neighbour classifiers
Pattern Recognition
(1999) Pattern recognition by means of disjoint principal component models
Pattern Recognition
(1976)The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network
IEEE Transaction on Information Theory
(1998)- Bermejo, S., & Cabestany, J. (2001a). Adaptive soft k-nearest neighbour classifiers. To be published in Pattern...
- et al.
Finite-sample convergence properties of the LVQ1 algorithm and the BLVQ1 algorithm
Neural Processing Letters
(2001) Neural networks for pattern recognition
(1995)- et al.
Convergence properties of k-means
- et al.
A framework for the cooperation of learning algorithms
- et al.
Local learning algorithms
Neural Computation
(1992) - et al.
Laterally interconnected self-organizing maps in hand-written digit recognition
An information-theoretic approach to neural computing
A view of unconstrained optimization
A probabilistic theory of pattern recognition
Principal component neural networks, theory and applications
Pattern classification and scene analysis
Applications of the karhunen loeve expansion to feature extraction and ordering
IEEE Transactions on Computers
Discussion of the paper what is projection pursuit?
Journal of the Royal Statistical Society A
Discussion of the paper what is projection pursuit?
Journal of the Royal Statistical Society A
Survey on independent component analysis
Neural Computing Surveys
Cited by (55)
A transformer-based multi-task framework for joint detection of aggression and hate on social media data
2023, Natural Language EngineeringEmotion-Aware Music Recommendation
2023, Proceedings of the 37th AAAI Conference on Artificial Intelligence, AAAI 2023Trading volume and realized volatility forecasting: Evidence from the China stock market
2023, Journal of ForecastingBReG-NeXt: Facial Affect Computing Using Adaptive Residual Networks with Bounded Gradient
2022, IEEE Transactions on Affective ComputingCut points and contexts
2021, Cancer