Elsevier

Pattern Recognition

Volume 45, Issue 3, March 2012, Pages 1136-1145
Pattern Recognition

Extract minimum positive and maximum negative features for imbalanced binary classification

https://doi.org/10.1016/j.patcog.2011.09.004Get rights and content

Abstract

In an imbalanced dataset, the positive and negative classes can be quite different in both size and distribution. This degrades the performance of many feature extraction methods and classifiers. This paper proposes a method for extracting minimum positive and maximum negative features (in terms of absolute value) for imbalanced binary classification. This paper develops two models to yield the feature extractors. Model 1 first generates a set of candidate extractors that can minimize the positive features to be zero, and then chooses the ones among these candidates that can maximize the negative features. Model 2 first generates a set of candidate extractors that can maximize the negative features, and then chooses the ones that can minimize the positive features. Compared with the traditional feature extraction methods and classifiers, the proposed models are less likely affected by the imbalance of the dataset. Experimental results show that these models can perform well when the positive class and negative class are imbalanced in both size and distribution.

Highlights

► We present a method to extract minimum positive and maximum negative features. ► We design two models and algorithms to generate feature extractors. ► The proposed method performs well on imbalanced dataset.

Introduction

As one of the fundamental problems in machine learning, learning from imbalanced datasets has attracted much attention in recent years [1], [2]. In this paper, we limit our study on the imbalanced binary classification problem if not specified. The imbalance has at least two forms. One form of imbalance is the number of samples, where one class has much more samples than the other class. Another form of imbalance is that the distributions of different classes are quite different. A typical imbalanced binary classification problem is the task of verification. In this task, the positive class consists of the representations of one object and negative class consists of anything else. It is an imbalanced problem because (1) the positive class has fewer samples than the negative class; (2) the positive samples (representations of one object) form a cluster while the negative samples (which can be anything that different from the positive samples) do not.

Imbalanced data degrade the performances of many dimension reduction or feature extraction methods. When presented with imbalanced datasets, some methods tend to yield feature extractors that favor the majority class, such as principal component analysis (PCA) [3]. The unsupervised PCA seeks the feature extractors that maximize the total scatter. Its feature extractor will be largely determined by the majority class if one class has much more samples than the other class. Some feature extraction methods cannot perform well on imbalanced datasets because they are essentially developed only for the balanced datasets, such as Fisher discriminant analysis (FDA) [4], [5]. The supervised FDA aims to maximize the between class scatter and minimize the within class scatter. It is developed based on the assumption that samples from two classes are subjected to Gaussian distributions.

Many standard classifiers tend to favor the majority class on imbalanced data. Support vector machine (SVM) refers to the samples that near boundaries as support vectors and seeks the separating hyperplane that maximizes the separation margin between the hypothesized concept boundary and the support vectors [1]. The SVMs are inherently biased toward the majority class because they aim to minimize the total error. Multilayer perceptron (MLP) is proved to have difficulty in learning from imbalanced datasets [6]. Because of their ability of avoiding the so-called overfitting, the simple and robust linear classifiers are attractive, such as linear discriminant analysis (LDA), minimum square error (MSE), and support vector machine (SVM) [7]. These classifiers make an implicit assumption that the positive and negative classes can be roughly separated by a hyperplane [8]. However, this assumption is violated in many imbalanced datasets where only the positive samples form a cluster, as detailed in Section 2. This explains why the performances of these linear classifiers are significantly degraded by the imbalanced datasets.

Different from the discriminative methods (LDA, MSE, and SVM), Gaussian mixture model (GMM) [9] is a generative method. In GMM, the distribution of the samples is modeled by a linear combination of two or more Gaussian distributions [10], [11], [12]. GMM has been used in many fields [10], [11], [12], and can deal with the imbalanced problem if the parameters of the Gaussian distributions are well fixed. The main difficulty in GMM is to estimate the number of Gaussians to use [13].

This paper proposes a method for imbalanced binary classification. The proposed method seeks feature extractors that can generate minimum positive and maximum negative features in terms of absolute value. In other words, the positive features extracted by a feature extractor are expected to be in an interval [−ξ,ξ], and the negative features fall into (−∞,−ξ)∪(ξ,+∞), where ξ is a positive scalar. This agrees with the situation in a verification task where positive samples cluster together and the negative samples may not. To obtain the feature extractors, this paper proposes two models and designs algorithms to solve these models. While model 1 first minimizes the positive features then maximizes the negative features, model 2 first maximizes the negative features then minimizes the positive features. After projecting the samples onto feature extractors, the proposed method classifies the features based on their weighted distances to the origin.

The advantages of the proposed method are mainly summarized as follows:

Firstly, the proposed method is less likely affected by the imbalanced distributions of the positive and negative classes in two aspects. Different from the traditional feature extraction methods that assume the positive and negative samples cluster together, the proposed method only requires the positive samples cluster together (the negative samples can either cluster together or not). Different from the traditional linear classifiers that require the samples can be roughly separated by a single hyperplane, the proposed method can perform well if two parallel hyperplanes can separate the positive samples from the negative ones.

Secondly, the proposed method is less likely affected by the imbalanced sizes of the positive and negative classes. The positive and negative samples are independently input to two steps in the proposed algorithms. Thus, the two classes have equal power in determining the final feature extractors even though one class may consist of much more samples than the other class.

Thirdly, the proposed method significantly reduces the misclassification of the outliers into positive class. Different from traditional methods that assign two symmetric half-spaces to positive class and negative class, our method assigns two asymmetrical areas to these two classes. As the area for the positive class is much smaller than that of the negative class, the outliers are not likely to be misclassified into the positive class.

The rest of this paper is organized as follows. Section 2 describes the background and motivation. Section 3 presents the proposed method. Section 4 presents the experiments and Section 5 draws a conclusion.

Section snippets

Background and motivation

We consider a binary classification problem, where the d dimensional column vectors x1,x2,,xl1 are samples from the positive class with class label yi=1(1≤il1) and xl1+1,xl1+2,,xl1+l2 from the negative class with class label yi=−1(l1+1≤il). The total number of samples is l, where l=l1+l2. We denote the matrix consists of all the training samples as X=[x1x2xl], and the vector consists of all the class labels as Y=[y1y2yl]T.

Imbalanced datasets degrade the performance of many feature

Proposed method

In this section, we propose a new method for imbalanced binary classification. For simplicity, we consider the samples include an extra dimension with fix value 1 and the threshold w0 (in Eq. (1)) turns to be an additional dimension of the coefficient vector. Also, as only the direction of the coefficient vector w is important for the classification, we restrict it to have a unit norm. This coefficient vector is also referred to as the feature extractor.

Section 3.1 introduces the basic idea of

Experiments

In this section, we first compare our method with different classifiers (back propagation, GMM, and five different forms of SVM) on two synthetic datasets in Section 4.1 and object verification in Section 4.2. Then, we compare our method with different feature extraction methods (FDA, PCA, and LPP) on face verification in Section 4.3. The experimental results validate the feasibility of the proposed method.

Conclusion

This paper proposes a method for extracting minimum positive and maximum negative features in terms of absolute value for imbalanced binary classification. Corresponding to each feature extractor is a pair of parallel hyperplanes to separate the positive samples from the negative ones, as shown in Fig. 3. This differentiates our method from the traditional linear classifiers that try to separate the samples using a single hyperplane. To obtain the feature extractors, this paper presents two

Acknowledgment

The authors are most grateful for the constructive advice on the revision of the manuscript from the anonymous reviewers. The funding support from Hong Kong Government under its GRF scheme (5341/08E and 5366/09E) and the research grant from Hong Kong Polytechnic University (1-ZV5U) are greatly appreciated.

Jinghua Wang received his B.S. degree in Computer Science from the Shandong University and his M.S. degree from the Harbin Institute of Technology. He is currently a Ph.D. candidate in the Department of Computing, The Hong Kong Polytechnic University. His current research interests are in the areas of pattern recognition and image processing.

References (30)

  • J. Yang et al.

    What's wrong with Fisher criterion?

    Pattern Recognition

    (2002)
  • D.A. Reynolds

    Speaker identification and verification using Gaussian mixture speaker models

    Speech Communication

    (1995)
  • H. He et al.

    Learning from imbalanced data

    IEEE Transactions on Knowledge and Data Engineering

    (2009)
  • T.M. Khoshgoftaar et al.

    Supervised neural network modeling: an empirical investigation into learning from imbalanced data with labeling errors

    IEEE Transactions on Neural Networks

    (2010)
  • M. Kirby et al.

    Application of the Karhunen–Loeve procedure for the characterization of human faces

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (1990)
  • Y. Xu et al.

    A novel method for Fisher discriminant analysis

    Pattern Recognition

    (2004)
  • Y.L. Murphey et al.

    Neural learning from imbalanced data

    Applied Intelligence, special issue on Neural Networks and Applications

    (2004)
  • W. Chen et al.

    A novel hybrid linear/nonlinear classifier for two-class classification: theory algorithm and applications

    IEEE Transaction on Medical Imaging

    (2010)
  • R.O. Duda et al.

    Pattern Classification

    (2001)
  • B. Scherrer, Gaussian Mixture Model Classifiers, 2007. Available online at...
  • P. Bansal et al.

    Improved hybrid model of HMM/GMM for speech recognition

    Intelligent Information and Engineering Systems

    (2008)
  • D.A. Reynolds et al.

    Robust text-independent speaker identification using Gaussian mixture speaker models

    IEEE Transactions on Speech and Audio Processing

    (1995)
  • J.V.B. Soares et al.

    Segmentation of retinal vasculature using wavelets and supervised classification: theory and implementation

  • R. Bellman

    Adaptive Control Processes: A Guided Tour

    (1961)
  • T. Zhang et al.

    Discriminative orthogonal neighborhood-preserving projections for classification

    IEEE Transaction on Systems, Man, and Cybernetics-part B: Cybernetics

    (2010)
  • Cited by (0)

    Jinghua Wang received his B.S. degree in Computer Science from the Shandong University and his M.S. degree from the Harbin Institute of Technology. He is currently a Ph.D. candidate in the Department of Computing, The Hong Kong Polytechnic University. His current research interests are in the areas of pattern recognition and image processing.

    Jane You obtained her B.Eng. in Electronic Engineering from the Xi’an Jiaotong University in 1986 and Ph.D. in Computer Science from the La Trobe University, Australia, in 1992. She was a lecturer at the University of South Australia and senior lecturer at the Griffith University from 1993 till 2002. Currently she is a professor at the Hong Kong Polytechnic University. Her research interests include image processing, pattern recognition, medical imaging, biometrics computing, multimedia systems, and data mining.

    Qin Li received his B.Eng. degree in computer science from the China University of Geoscience, the M.Sc. degree (with distinction) in computing from the University of North-Umbria at Newcastle, and the Ph.D. degree from the Hong Kong Polytechnic University. His current research interests include medical image analysis, biometrics, image processing, and pattern recognition.

    Yong Xu was born in Sichuan, China, in 1972. He received his B.S. degree, M.S. degree in 1994 and 1997 respectively. He received the Ph.D. degree in Pattern recognition and Intelligence System at NUST(China) in 2005. Now he works at the Shenzhen graduate school, Harbin Institute of Technology. His current interests include feature extraction, biometric, face recognition, machine learning, and image processing.

    View full text