Elsevier

Pattern Recognition Letters

Volume 30, Issue 16, 1 December 2009, Pages 1516-1522
Pattern Recognition Letters

Info-margin maximization for feature extraction

https://doi.org/10.1016/j.patrec.2009.08.006Get rights and content

Abstract

We propose a novel method of linear feature extraction with info-margin maximization (InfoMargin) from information theoretic viewpoint. It aims to achieve a low generalization error by maximizing the information divergence between the distributions of different classes while minimizing the entropy of the distribution in each single class. We estimate the density of data in each class with Gaussian kernel Parzen window and develop an efficient and fast convergent algorithm to calculate quadratic entropy and divergence measure. Experimental results show that our method outperforms the traditional feature extraction methods in the classification and data visualization tasks.

Introduction

In pattern recognition field, the raw input data often have very high dimensionality and the limited number of samples, for example, image or document data. We have to face the “curse of dimensionality” problem. Feature extraction (Fukunaga, 1990, Friedman, 1987) is a dimensionality reduction method to map high-dimensional data to a low-dimensional space for classification or visualization. Principal component analysis (PCA) and linear discriminant analysis (LDA) are two popular linear feature extraction methods among the state-of-art literatures (Fukunaga, 1990).

PCA (Fukunaga, 1990) aims to find a set of mutually orthogonal basis functions that capture the directions of maximum variance in the data, so it is very useful in reducing noise at the data.

LDA (Fukunaga, 1990) is used to derive a discriminative transformation which maximizes between-class scatter while minimizing within-class scatter. However, LDA only makes use of second-order statistical information, covariances, so it is optimal for data where each class has a uni-modal Gaussian density and does not share the same mean (Fukunaga, 1990). When the class conditional densities are multi-modal, LDA does not work well. LDA also fails when the class separability cannot be represented by the between-class scatter matrix Sb. Especially in the case that each class shares the same mean, it fails to find the discriminant direction because Sb=0 (Fukunaga, 1990). Nonparametric discriminant analysis (Fukunaga and Mantock, 1983) is proposed to improve this drawback and calculates Sb in nonparametric nature, but it lacks of the global consideration for the distributions of data. Another drawback of LDA is the so-called “small sample size (SSS)” problem that within-class scatter matrix is singular when dealing with high-dimensional data. PCA + LDA (Belhumeur et al., 1997) and null space-based LDA (NLDA) (Chen et al., 2000) are two effective approaches to solve this problem. Though these methods improve the performance of the basic LDA, they are also solely based on the second order statistical information. Hence, they may not work well especially when the distributions are non-Gaussian in many practical cases.

There are some other works based on higher-order statistics for supervised feature extraction (Linsker, 1988, Bollacker and Ghosh, 1996, Principe et al., 2000, Vasconcelos, 2002, Lyu, 2005, Hild et al., 2006), which are based on maximizing the mutual information (MI) between extracted features and class label.

To clarify these works and lead our motivation, we give some brief definitions firstly. Assuming that a random variable y is drawn from the distribution p(y), and a discrete-valued random variable c representing its class label from p(c), the entropies of c and y, making use of Shannons definition (Cover and Thomas, 1991), are expressed in terms of the prior probabilities p(y) and p(c),H(c)=-cP(c)log(P(c)),H(p)=-yp(y)logp(y)dy.

After having observed a feature vector y, the uncertainty of the class label c could be defined as the conditional entropy,H(c|y)=-cyp(c,y)logp(c|y)dy.

The mutual information (MI) I(y,c) between y and c can be written asI(y,c)=cyp(c,y)logp(c,y)P(c)p(y)dy=H(c)-H(c|y)=H(y)-H(y|c).

Mutual information maximization is a powerful feature extraction criterion, and it is optimal for training samples in the minimum Bayes error (Vasconcelos, 2002). However, it is still not widely used currently due to its computational difficulties, especially for the high-dimensional data. Although histogram-based MI estimation works with two or three variables, it fails in higher dimensions. Torkkola (2003) proposes an effective MI based feature extraction method with replacing Shannon entropy with Renyi entropy, which can improve computational efficiency greatly.

Although the MI based methods achieved some good performances in feature extraction, they also suffer from the following drawbacks: (1) they are based on density estimation in the transformed subspace, which is still computationally expensive and is not robust when the dimensionality of subspace is high; (2) from Eq. (5), we can see MI based methods fail to find the best features among the feature candidates with the same mutual information to class label. Eq. (5) shows that when the classes are separated in extracted feature space (H(C|y)=0), MI based methods will not continue to find the better feature space (see Fig. 1). Thus they cannot guarantee a low generalization error for incoming unknown samples; (3) from Eq. (6), it are not explicit how H(y|C) play role to maximize the mutual information. Since we expect that the samples in same class are close with each other, so H(y|C) should be as low as possible. But MI based methods cannot guarantee this point.

Margin maximization (Vapnik, 1995) is theoretically interesting because it facilitates generalization error analysis, and practically interesting because it presents a clear geometric interpretation of the models being built. Margin is the divergency between classes with different measures indeed, such as Euclidian distance and hypothesis distance (Crammer et al., 2002). A loss function of “margin maximizing” in this sense is useful for generating good prediction models (Rosset et al., 2003). An excellent example for margin maximization is support vector machine (SVM) (Vapnik, 1995), which defines margin as the distance between an instance and the decision boundary.

Fig. 1 shows that the mutual information between the projected samples and class labels are the same in two projection directions P1 and P2. However the projection direction P1 has a larger margin than P2 from SVM viewpoint. Once mutual information maximization based methods, such as (Torkkola, 2003), find the projection P2, they cannot continue to search the optimal projection P1 because P2 has reached a local maximum of mutual information.

In this paper, we define a novel margin (info-margin) from information theoretic view, and propose a linear feature extraction method (InfoMargin) by maximizing the defined info-margin. It aims to achieve a low generalization error by maximizing the information divergence between the distributions which belong to different classes while minimizing the entropy of distribution in each single class. We estimate the density in single class with Gaussian kernel Parzen window and give an efficient algorithm by quadratic entropy and divergence measure, which avoid to use histogram-based methods to estimate the class densities.

The rest of the paper is organized as following. Section 2 gives a brief review for information theory, especially the generalized definitions for entropy and divergence measure. Section 3 proposes the info-margin criterion and info-margin maximization algorithm. Then, we show the experimental results in Section 4. At last, we conclude our works in Section 5.

Section snippets

Entropy, divergence measure, and density estimation

In this section, we give a brief introduction to the definitions of entropy, divergence measures and density estimation.

Info-margin maximization

To classify the samples, we need some divergence measure to calculate the divergency between classes. We would hope that a pattern is close to those in the same class but far from those in different classes. Therefore, a good feature extractor should maximize the divergence between classes while minimizing the distances between samples in same class after the transformation. In this section, we reformulate this idea from information theoretic view and give a info-margin maximization criterion

Experiments

In this section, we apply info-margin maximization (InfoMargin) to real world data, and compare it with the other popular linear feature extraction methods, such as PCA (Turk and Pentland, 1991), PCA + LDA (Belhumeur et al., 1997), nonparametric discriminant analysis (NDA) (Fukunaga, 1990), nonlinear component analysis (Schölkopf et al., 1998) and mutual information maximization (MMI) (Torkkola, 2003). Since that MMI is not robust when the dimensionality of transformed subspace is high, we use

Conclusions

In this paper, we proposed a new feature extraction method, info-margin maximization criterion (InfoMargin), which finds the important discriminant directions from information theoretic viewpoint and deals well with non-Gaussian class distribution. We use quadratic entropy and divergence measure with Gaussian kernel Parzen density estimator, which enhances the computational efficiency greatly. Our experiments show that our method is very efficient and robust. In the further works, we will

References (24)

  • L. Chen et al.

    A new LDA-based face recognition system which can solve the small sample size problem

    Pattern Recogn.

    (2000)
  • Asuncion, A., Newman, D., 2007. UCI machine learning repository. URL:...
  • P. Belhumeur et al.

    Eigenfaces vs. fisherfaces: Recognition using class specific linear projection

    IEEE Trans. Pattern Anal. Machine Intell.

    (1997)
  • Z. Bian et al.

    Pattern Recognition

    (1999)
  • Bollacker, K., Ghosh, J., 1996. Linear feature extractors based on mutual information. In: Proc. 13th ICPR, pp....
  • T.M. Cover et al.

    Elements of Information Theory

    (1991)
  • Crammer, K., Gilad-Bachrach, R., Navot, A., Tishby, N., 2002. Margin analysis of the lvq algorithm. In: Proc. Neural...
  • J.H. Friedman

    Exploratory projection pursuit

    J. Amer. Statist. Assoc.

    (1987)
  • K. Fukunaga

    Introduction to Statistical Pattern Recognition

    (1990)
  • K. Fukunaga et al.

    Nonparametric discriminant analysis

    IEEE Trans. Pattern Anal. Machine Intell.

    (1983)
  • Graham, D.B., Allinson, N.M., 1998. Characterizing virtual eigensignatures for general purpose face recognition. (in)...
  • K.E. Hild et al.

    Feature extraction using information-theoretic learning

    IEEE Trans. Pattern Anal. Machine Intell.

    (2006)
  • Cited by (0)

    This work was (partially) funded by Chinese NSF 60673038, Doctoral Fund of Ministry of Education of China 200802460066, and Shanghai Science and Technology Development Funds 08511500302.

    View full text