Elsevier

Pattern Recognition

Volume 38, Issue 10, October 2005, Pages 1746-1758
Pattern Recognition

SVM decision boundary based discriminative subspace induction

https://doi.org/10.1016/j.patcog.2005.01.016Get rights and content

Abstract

We study the problem of linear dimension reduction for classification, with a focus on sufficient dimension reduction, i.e., finding subspaces without loss of discrimination power. First, we formulate the concept of sufficient subspace for classification in parallel terms as for regression. Then we present a new method to estimate the smallest sufficient subspace based on an improvement of decision boundary analysis (DBA). The main idea is to combine DBA with support vector machines (SVM) to overcome the inherent difficulty of DBA in small sample size situations while keeping DBA's estimation simplicity. The compact representation of SVM boundary results in a significant gain in both speed and accuracy over previous DBA implementations. Alternatively, this technique can be viewed as a way to reduce the run-time complexity of SVM itself. Comparative experiments on one simulated and four real-world benchmark datasets highlight the superior performance of the proposed approach.

Introduction

Dimension reduction is widely accepted as an analysis and modeling tool to deal with high-dimensional spaces. There are several reasons to keep the dimension as low as possible. For instance, it is desirable to reduce the system complexity, to avoid the curse of dimensionality, and to enhance data understanding. In general, dimension reduction can be defined as the search for a low-dimensional linear or nonlinear subspace that preserves some intrinsic properties of the original high-dimensional data. However, different applications have different preferences of what properties should be preserved in the reduction process. At least we can identify three cases:

  • 1.

    Visualization and exploration, where the challenge is to embed a set of high-dimensional observations into a low-dimensional Euclidian space that preserves as closely as possible their intrinsic global/local metric structure [1], [2], [3].

  • 2.

    Regression, in which the goal is to reduce the dimension of the predictor vector with the minimum loss in its capacity to infer about the conditional distribution of the response variable [4], [5], [6].

  • 3.

    Classification, where we seek reductions that minimize the lowest attainable classification error in the transformed space [7].

Such disparate interpretations might thereby cast a strong influence on the design and choice of an appropriate dimension reduction algorithm for a given task as far as optimality is concerned.

In this paper we study the problem of dimensionality reduction for classification, which is commonly referred to as feature extraction in pattern recognition literature [8], [9]. Particularly, we restrict ourselves to linear dimension reduction, i.e., seeking linear mapping that minimizes the lowest attainable classification error, i.e. the Bayes error, in the reduced subspace. Linear mapping is mathematically tractable and computationally simple, with certain regularization ability that sometimes makes it outperform nonlinear models. In addition, it may be nonlinearly extended, for example, through global coordination of local linear models (e.g., Refs. [10], [11]) or kernel mapping (e.g., Refs. [12], [13]).

PCA, ICA and LDA are typical linear dimension reduction techniques used in the pattern recognition community, which simultaneously generate a set of nested subspaces of all possible dimensions. However, they are not directly related to classification accuracy since their optimality criteria are based on variance, independence and likelihood. Various other dimension reduction methods have also been proposed, which intend to better reflect the classification goal by iteratively optimizing some criteria that either approximate or bound the Bayes error in the reduced subspace [7], [14], [15], [16], [17], [18]. Such methods exclusively assume a given output dimension, and usually have the problem of local minima. Even though one can find the optimal solution for a given dimension, several questions still remain. How much discriminative information is lost in the reduction process? Which dimension should we choose next to get a better reduction? What is the smallest possible subspace that loses nothing from the original space as far as classification accuracy is concerned? Is there any efficient way to estimate this critical subspace other than the brute force approach, i.e. enumerating every optimal subspace for every possible dimension? The motivation for the present work is to explore possible answers to these questions.

For recognition tasks, finding lower dimensional feature subspaces without loss of discriminative information is especially attractive. We call this process sufficient dimension reduction, borrowing terminology from regression graphics [6]. The knowledge of smallest sufficient subspace enables the classifier designer to have a deeper understanding of the problem at hand, and thus to carry out the classification in a more effective manner. However, among existing dimension reduction algorithms, few have formally incorporated the notion of sufficiency [19].

In the first part of this paper, we formulate the concept of sufficient subspace for classification in parallel terms as for regression [6]. Our initial attempt is to explore a potential parallelism between classification and regression on the common problem of sufficient dimension reduction. In the second part, we discuss how to estimate the smallest sufficient subspace, or more formally, the intrinsic discriminative subspace (IDS). Decision boundary analysis (DBA), originally proposed by Lee and Landgrebe in 1993 [19], is such a technique that is promised, in theory, to recover the true IDS. Unfortunately, conditions for their method to work appear to be quite restrictive [20]. The main weakness of DBA is its dependence on nonparametric functional estimation in the full-dimensional space, which is a hard problem due to the curse of dimensionality. Similar problems have been observed in average derivative estimation (ADE) [21], [22], a dimension reduction technique for regression in analogy of DBA for classification.

However, recent discovery and elaboration of kernel methods for classification and regression seem to suggest that learning in very high dimensions is not necessarily a terrible mistake. Several successful algorithms (e.g., Refs. [23], [24], [25]) have been demonstrated with direct dependence on the intrinsic generalization ability of kernel machines in high dimensional spaces. In the same spirit, we will show in this paper that the marriage of DBA and kernel methods may lead to a superior reduction algorithm that shares the appealing properties of both. More precisely, we propose to combine DBA with support vector machines (SVM), a powerful kernel-based learning algorithm that has been successfully applied to many applications. The resultant SVM–DBA algorithm is able to overcome the difficulty of DBA in small sample size situations, and at the same time keep the simplicity of DBA with respect to IDS estimation. Thanks to the compact representation of SVM, our algorithm also achieves a significant gain in both estimation accuracy and computational efficiency over previous DBA implementations. From another perspective, the proposed method can be seen as a natural way to reduce the run-time complexity of SVM itself.

Section snippets

Brief review of existing linear dimension reduction methods

There are two basic approaches to dimensionality reduction, supervised and unsupervised. In the context of classification, a supervised approach is generally believed to be more effective. However, there are strong evidences that this is not always true (e.g., PCA and ICA might outperform LDA in face identification [26], [27]). In this paper, we focus on supervised methods. According to the choice of criterion function, we further divide supervised methods into likelihood-based and error-based

Sufficient dimension reduction

This section serves two purposes: (1) to formulate the concept of sufficient subspace for classification in rigorous mathematical form, and (2) to reveal the potential parallelism between classification and regression on the common problem of sufficient dimension reduction. To these ends, we closely follow the recent work of Cook and Li (2002) [40].

Consider a Q-class classification problem with the underlying joint distribution P(x,y), where xRd is a d-dimensional random vector (feature), and y

Estimation of intrinsic discriminative subspace

Given an original feature space of dimension d, one brute-force procedure to estimate its IDS can be carried out as follows. First solve d independent reduction problems corresponding to all d possible subspace dimensions, resulting in a total of d subspaces {Φm}m=1d, each of which is optimized for a particular subspace dimension m. Then choose one of them as the final estimate via, e.g., hypothesis testing, cross validation or other model selection techniques. The assumption behind this

Datasets

We evaluate the proposed linear dimension reduction algorithm by one simulated and four real-world datasets drawn from the UCI Machine Learning Repository. Their basic information is summarized in Table 3.

WAVE-40 is a modified version of the simulated example from the CART book. It is a three-class problem with 40 attributes. The first 21 attributes of each class are generated from a combination of two of three “base” waves in Gaussian noise, xi=ub1(i)+(1-u)b2(i)+εi,Class 1,xi=ub1(i)+(1-u)b3(i)+

Discussion

Our concept formulation in Section 3 is largely inspired by the work of Cook et al. on sufficient dimension reduction for regression [6], [40]. The rigorous statistical language they used allows us to treat the sufficient dimension reduction problem for classification in a coherent way. We expect our concept formulation to serve as a good starting point for further investigations of parallelism in estimation methodologies between these two similar problems. For example, using SVM–DBA to

Conclusion

We formulate the concept of sufficient dimension reduction for classification in parallel terms as for regression. A new method is proposed to estimate IDS, the smallest sufficient discriminative subspace for a given classification problem. The main idea is to combine DBA with SVM in order to overcome the difficulty of DBA in small sample size situations, and at the same time keep the simplicity of DBA in regard to IDS estimation. It also achieves a significant gain in both estimation accuracy

About the Author—JIAYONG ZHANG received the B.E. and M.S. degrees in Electronic Engineering from Tsinghua University in 1998 and 2001, respectively. He is currently a Ph.D. candidate in the Robotics Institute, Carnegie Mellon University. His research interests include computer vision, pattern recognition, image processing, machine learning, human motion analysis, character recognition and medical applications.

References (44)

  • R. Lotlikar et al.

    Adaptive linear dimensionality reduction for classification

    Pattern Recognition

    (2000)
  • N. Kumar et al.

    Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition

    Speech Commun.

    (1998)
  • A.K. Jain et al.

    Algorithms for Clustering Data

    (1988)
  • J.B. Tenenbaum et al.

    A global geometric framework for nonlinear dimensionality reduction

    Science

    (2001)
  • S.T. Roweis et al.

    Nonlinear dimensionality reduction by locally linear embedding

    Science

    (2001)
  • K.-C. Li

    Sliced inverse regression for dimension reduction

    J. Am. Stat. Assoc.

    (1991)
  • K.-C. Li

    On principal Hessian directions for data visualization and dimension reductionanother application of Stein's lemma

    J. Am. Stat. Assoc.

    (1992)
  • R. Cook

    Regression GraphicsIdeas for Studying Regressions through Graphics

    (1998)
  • L. Buturovic

    Toward Bayes-optimal linear dimension reduction

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1994)
  • K. Fukunaga

    Introduction to Statistical Pattern Recognition

    (1990)
  • A. Jain et al.

    Statistical pattern recognitiona review

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2000)
  • R. Chengalvarayan et al.

    HMM-based speech recognition using state-dependent, discriminatively derived transforms on Mel-warped DFT features

    IEEE Trans. Speech Audio Process.

    (1997)
  • M.J.F. Gales

    Maximum likelihood multiple subspace projections for hidden Markov models

    IEEE Trans. Speech Audio Process.

    (2002)
  • B. Schölkopf et al.

    Nonlinear component analysis as a kernel eigenvalue problem

    Neural Comput.

    (1998)
  • S. Mika, G. Rätsch, J. Weston, B. Schölkopf, K.R. Müller, Fisher discriminant analysis with kernels, in: Neural...
  • M. Aladjem

    Nonparametric discriminant analysis via recursive optimization of Patrick–Fisher distance

    IEEE Trans. Syst. Man Cybern. B

    (1998)
  • A. Biem et al.

    Pattern recognition using discriminative feature extraction

    IEEE Trans. Signal Process.

    (1997)
  • G. Saon, M. Padmanabhan, Minimum Bayes error feature selection for continuous speech recognition, in: Proceedings of...
  • K. Torkkola, Learning discriminative feature transforms to low dimensions in low dimensions, in: Proceedings of NIPS...
  • C. Lee et al.

    Feature extraction based on decision boundaries

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1993)
  • L. Jimenez et al.

    Hyperspectral data analysis and supervised feature reduction via projection pursuit

    IEEE Trans. Geosci. Remote Sensing

    (1999)
  • M. Hristache et al.

    Structure adaptive approach for dimension reduction

    Ann. Stat.

    (2001)
  • Cited by (16)

    • Incremental feature extraction based on decision boundaries

      2018, Pattern Recognition
      Citation Excerpt :

      It utilizes the normal vectors to the decision boundaries that are crucial for classification. Although it was initially proposed for parametric classifiers (i.e., Gaussian maximum likelihood classifier), it has been successfully applied to other nonparametric classifiers [28,29], especially to neural networks [30]. Although DBFE shows good performance, it requires a large amount of computations.

    • Research progress on key technologies of radar signal sorting

      2020, Advances in Intelligent Systems and Computing
    • Incremental Feature Extraction Based on Gaussian Maximum Likelihood

      2019, 34th International Technical Conference on Circuits/Systems, Computers and Communications, ITC-CSCC 2019
    • Feature extraction for deep neural networks based on decision boundaries

      2017, Proceedings of SPIE - The International Society for Optical Engineering
    • Learning Subspace-Based RBFNN Using Coevolutionary Algorithm for Complex Classification Tasks

      2016, IEEE Transactions on Neural Networks and Learning Systems
    • Feature extraction in developing an airs cloud mask

      2013, International Geoscience and Remote Sensing Symposium (IGARSS)
    View all citing articles on Scopus

    About the Author—JIAYONG ZHANG received the B.E. and M.S. degrees in Electronic Engineering from Tsinghua University in 1998 and 2001, respectively. He is currently a Ph.D. candidate in the Robotics Institute, Carnegie Mellon University. His research interests include computer vision, pattern recognition, image processing, machine learning, human motion analysis, character recognition and medical applications.

    About the Author—YANXI LIU is a faculty member (associate research professor) affiliated with both the Robotics Institute (RI) and the Center for Automated Learning and Discovery (CALD) of Carnegie Mellon University (CMU). She received her Ph.D. in Computer Science from the University of Massachusetts, where she studied the group theory application in robotics. Her postdoct training was in LIFIA/IMAG (now INRIA) of Grenoble, France. She also received an NSF fellowship from DIMACS (NSF Center for discrete mathematics and theoretical computer science). Her research interests include discriminative subspace induction in large biomedical image databases and computational symmetry in robotics, computer vision and computer graphics.

    This research was supported in part by NIH award N01-CO-07119 and PA-DOH grant ME01-738.

    View full text