Elsevier

Pattern Recognition

Volume 47, Issue 2, February 2014, Pages 789-805
Pattern Recognition

Integrated Fisher linear discriminants: An empirical study

https://doi.org/10.1016/j.patcog.2013.07.021Get rights and content

Highlights

  • An optimal threshold is found out from a series of empirical threshold formulas developed for Fisher linear discriminants based on classification accuracies.

  • Weight vectors and thresholds are updated by an epoch-limited iterative learning strategy.

  • The singular within-class scatter matrices are reduced in dimensionality but not added with perturbations.

  • A coding system enlarges class margins and approximately preserves neighborhood relationships.

  • An integrated learning algorithm improves the learning and generalization performances of Fisher linear discriminants.

Abstract

This paper studies Fisher linear discriminants (FLDs) based on classification accuracies for imbalanced datasets. An optimal threshold is found out from a series of empirical formulas developed, which is related not only to sample sizes but also to distribution regions. A mixed binary–decimal coding system is suggested to make the very dense datasets sparse and enlarge the class margins on condition that the neighborhood relationships of samples are nearly preserved. The within-class scatter matrices being or approximately singular should be moderately reduced in dimensionality but not added with tiny perturbations. The weight vectors can be further updated by a kind of epoch-limited (three at most) iterative learning strategy provided that the current training error rates come down accordingly. Putting the above ideas together, this paper proposes a type of integrated FLDs. The extensive experimental results over real-world datasets have demonstrated that the integrated FLDs have obvious advantages over the conventional FLDs in the aspects of learning and generalization performances for the imbalanced datasets.

Introduction

Linear classifiers are the basic pattern recognition models, mainly including classical Fisher linear discriminants (FLDs) [1], [2], [3], [4], [5], single-layer perceptrons (SLPs) [6], [7], [8] and linear support vector machines (SVMs) [9], [10], [11]. Because of simple structures and low computational costs, FLDs are considered as the most popular linear classifiers in many applications [12], [13], [14]. The key of success of FLDs seems to lie in the fact that the linear decision hyperplanes obtained usually offer reasonable partitions. However, a major issue existing in FLDs is that the parameters, namely the mean vectors and within-class scatter matrices, have to be estimated by limited available training samples in order to determine the weight vectors and thresholds [5], [15], [16]. Though the parameter estimations associated with Gaussian distributions reveal some good statistical properties, the Gaussian assumptions are not able to suit all distribution cases [17], [18].

The criterion function J(.) that maximizes the ratio of between-class scatters to within-class scatters in order to find the proper weight vectors is well known and a great creation, and thus called the Fisher's criterion [15]. Up to date, many criterion functions have been presented [11], [16], [17], [19]. In a sense, the diverse quadratic optimization functions used in SVMs are a natural generalization of the Fisher's criterion [20], [21]. As a matter of fact, FLDs are consistent with SVMs at the aim to maximize the class margins by constructing the optimal separating hyperplanes or projected directions [22].

Imbalance is absolute while balance is relative. Currently, more attention has been paid to imbalanced datasets [23], [24], [25], [26], [27], [28], the most of which is centered on the imbalance of sample sizes for the sake of intuitiveness and simplicity. A classifier tends to categorize the present samples to the majority class when learning an imbalanced dataset [29]. In other words, it is prone to generate a classifier that has a strong bias toward the majority class, resulting in a large number of false negatives.

Re-sampling techniques are popular in solving the imbalanced problems, i.e., either the minority classes are over-sampled or the majority classes are under-sampled or some combinations of the two are employed [25], [28], [30], [31], [32]. In the meantime, boosting and bagging algorithms have been developed for successively training component classifiers, named the cost-sensitive classifiers [10], [33]. They work by assigning larger weights to the mislabeled samples, otherwise smaller. Adaboost algorithm, a variation of boosting, is the popular one [34]. These algorithms can handle the imbalanced cases, to some extent.

Is an FLD able to solve a linearly separable problem? The answer is “Yes” for a “balanced” dataset. However, while a two-class dataset is seriously imbalanced, such an FLD may fail. The more serious the imbalance is, the poorer the resulting FLD performs. The imbalance of distribution regions usually has a far more important influence on the performances of classifiers than the imbalance of sample sizes does [35]. Therefore, it is relatively consistent with the actual situations to consider the imbalance of distribution regions, e.g., variances [5], [10], [16], [18]. Indeed, scatters, manifestations of variances, are already included in J(.). However, the sum of two within-class scatter matrices is only used as the denominator of J(.). In other words, J(.) is not related to the difference of within-class scatters or the imbalance of distribution regions.

It is the threshold in an FLD that finally determines the location of a separating hyperplane. Based on the above observations, we firstly have to answer the question: “Can the classification accuracy of an FLD be improved by selecting a proper threshold?”

The weight vector w in an FLD for solving a two-class problem {ω1, ω2} is only determined by the within-class scatter matrix SW as well as two mean vectors μ1 and μ2. If SW is or approximately singular, the FLD will no longer work. In order to address this issue, SW is often added with a tiny perturbation matrix [1], [12], [36], [34]. However, it is well known that (A+B)−1≠(A−1+B−1). Therefore, we need to answer the second question: “Is it feasible for a nearly singular matrix SW to be added with a tiny perturbation?”

Large attribute elements usually have a larger influence on the parameters of classifiers than those small do during learning courses; on the contrary, small class margins usually have a larger influence on the generalization performances of classifiers than those large do while making decisions [17], [22], [35]. The unduly large elements in a small part of attributes may make SW close to singular, and the unduly small margins will increase the difficulty to seek the optimal separating hyperplanes. The above two cases will make the generalization performances of classifiers designed become poor. In a sense, feature representation is a crucial step for designing classifiers, regardless of whether they are linear or non-linear.

Decimal (DEC) and binary (BIN) codes are two common feature representation systems. Normalized, proportional, logarithmic and sigmoid transformations are several popular equal-dimensional ones. SVMs enlarge the class margins by making the original data sparse in the higher-dimensional feature spaces through nonlinear transformations chosen in prior, e.g., polynomial and radial basis function (RBF) kernels [20], [21]. Quoting the thought, we can transform a data from a lower-dimensional input space to a higher-dimensional feature space directly by coding in advance. The premise to do like this is that the original information must be preserved as much as possible [19]. Therefore, the third question to be answered is: “How to develop an effective feature representation system so as to enlarge the class margins as much as possible, lessen the within-class scatters and ease the unfavorably large components, on condition that the neighborhood relationships are approximately preserved?”

The solution of weight vectors in FLDs is in essence an analytic learning process, which is often faster than the iterative ones used in neural networks and SVMs [15], [20]. In spite of an accustomed practice, the process of one-time analytic solution of weights is not certainly optimal. Therefore, the fourth question to be answered is: “How to develop an iterative learning algorithm in order to alleviate the imbalance and accordingly update the weights and thresholds by means of properly selecting a portion of the training samples?”

This paper aims to noticeably improve the classification accuracies of FLDs around answering the above-mentioned questions. In other words, this paper motivates to empirically optimize the weights and thresholds of FLDs to make the minimum-error-rate classification on the basis of Bayesian decision theory, from the heuristic point of view. The rest of this paper is organized as follows. Section 2 introduces the related work of FLDs. In Section 3, a series of empirical threshold formulas is proposed to alleviate the imbalance. Section 4 illustrates some mixed feature representation approaches and the condition for carrying out feature extraction by principal component analysis (PCA). Section 5 goes into details on the epoch-limited iterative learning strategy for further alleviating the imbalance. Section 6 presents numerous experimental results. Finally, Section 7 comes to our conclusions.

Section snippets

Related work

First of all, let us consider a two-class classification problem {ω1, ω2} as well as the linear discriminant function. For a pattern x=(x1, x2, …, xm)TRm in the m-dimensional input space, the decision hyperplane π can be written asπ:f(x)=wTx+w0=wTxθ=0where w=(w1, w2, …, wm)TRm is the weight vector, often called the normal or projected direction of π, and θ=−w0 is the threshold or bias.

The decision rule is{xω1,ifwTx>θxω2,ifwTx<θIndefinite,ifwTx=θ

Suppose the two input data matrices are

Empirical thresholds

First of all, let us discuss a simple example—Example 1.

A synthetic dataset X only consists of four samples {(0, 0); (1, 0); (0, 1); (1, 1)}, the first 3 of which belong to class ω1, and the remaining one to class ω2. Obviously, the dataset is linearly separable. The detailed computational results are μ1=(1/3,1/3)T, μ2=(1,2)T, w=(2/2,2/2)T, μw(1)=wTμ1=2/3, μw(2)=wTμ2=2, θ1=2/2, θ2=22/3, and the 4 projected values (0,2/2,2/2,2). The FLD with θ2 is able to recognize all the 4 patterns,

Neighborhood-preserving feature representation and extraction

Without the loss of generalization, the following discussions are only limited to the DEC attributes.

Iterative learning strategy and integrated FLDs

Let us pay attention on Example 1 in Section 3 again. After the point (a, a) is removed, the imbalance between two classes ω1 and ω2 will be manifestly alleviated, and the weight vector w can be recalculated only by the samples within the interval [μw(1),μw(2)]. The samples outside the interval, regardless of whether they are correctly classified or not, will no longer have an effect upon optimizing w. We can thus propose an iterative learning strategy of FLDs, called the iterative FLDs. The

Shuttle dataset

The Shuttle dataset [42] contains 9 INT (9D) attributes and 7 classes. 78.41% of the training samples belong to class ω1. The IR of sample sizes between ω1 and ω6 even reaches 34,108/6=5684.67! In other words, the maximum imbalanced ratio of sizes is IRmax=5684.67. Obviously, the imbalance of sizes is very serious.

Table 1 summarizes the statistic characteristics of sample distributions in the training set. The standard deviations of attributes 2, 4 and 6 come up to 78.14, 41.00 and 179.49,

Conclusions

On the basis of theoretical analysis and experimental results, we came to the following suggestions and conclusions:

  • (A)

    For the very dense datasets, the class margins can be enlarged directly by proper coding modes on condition that the neighborhood relationships of samples are approximately preserved. The mixed BIN–DEC coding system has such advantages.

  • (B)

    It is unfeasible for the within-class scatter matrices SW being or approximately singular to be added with tiny perturbations. By means of

Conflict of interest

None declared.

Acknowledgments

This work is funded by the National Science Foundation of China (NSFC) under Grant nos. 21176077, 61272198 and 60675027, the High-Tech Development Program of China (863) under Grant no. 2006AA10Z315, and the Open Funding Project of the State Key Laboratory of Bioreactor Engineering.

Gao Daqi received the Ph.D. Degree in Industrial Automation from Zhejiang University, China, in 1996. Currently, he is a Professor in the Department of Computer Science at East China University of Science and Technology (ECUST). He has authored or coauthored more than 90 papers. His research interests include Pattern Recognition, Machine Learning, Neural Networks and Artificial Olfactory.

References (55)

  • A. Roy et al.

    An algorithm to generate radial basis function (RBF)-like nets for classification problems

    Neural Networks

    (1995)
  • D. Elozondo

    The linear separability problem: some testing methods

    IEEE Transactions on Neural Networks

    (2006)
  • T. Cooke

    Two variations on Fisher's linear discriminant for pattern recognition

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2002)
  • J.S. Koford et al.

    The use of an adaptive threshold element to design a linear optimal pattern classifier

    IEEE Transactions on Information Theory

    (1966)
  • P.R. Lippmann

    Neural network classifiers for speech recognition

    Lincoln Laboratory Journal

    (1988)
  • Y. LeCun et al.

    Gradient-based learning applied to document recognition

    Proceedings of the IEEE

    (1998)
  • S. Raudys et al.

    Pairwise costs in multiclass perceptrons

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2010)
  • F. Zhouyu et al.

    Mixing linear SVMs for nonlinear classification

    IEEE Transactions on Neural Networks

    (2010)
  • O. Pujol et al.

    Geometry-based ensembles: towards a structural characterization of the classification boundary

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2009)
  • D. You et al.

    Kernel optimization in discriminant analysis

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2011)
  • P. Chaudhuri et al.

    Classification based on hybridization of parametric and nonparametric classifiers

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2009)
  • Z. Nenadic

    Information discriminant analysis: feature extraction with an information–theoretic objective

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2007)
  • R.O. Duda et al.

    Pattern Classification

    (2000)
  • M. Loog et al.

    Linear dimensionality reduction via a heteroscedastic extension of LDA: the Chernoff criterion

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2004)
  • A.K. Jain et al.

    Statistical pattern recognition: a review

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2000)
  • O.C. Hamsici et al.

    Bayes optimality in linear discriminant analysis

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2008)
  • E. Kokiopoulou et al.

    Orthogonal neighborhood preserving projections: a projection-based dimensionality reduction technique

    IEEE Transactions on Pattern Analysis and Machine Intelligences

    (2007)
  • Cited by (14)

    • Regularized fisher linear discriminant through two threshold variation strategies for imbalanced problems

      2018, Knowledge-Based Systems
      Citation Excerpt :

      Both ideas meet our requirement. In this paper, considering the characteristic of FLD and motivated by both classifiers mentioned above, we first regularize the original FLD in a way inspired by the Locality Preserving Projection (LPP) [32–34], and then boost the Regularized FLD (RFLD) by two strategies that are respectively modified from the Integrated FLD [5] and the BEPILD [31]. As a result, we design two classifiers for imbalanced problems and call them RFLD-S1 and RFLD-S2, respectively.

    • Pseudo-inverse linear discriminants for the improvement of overall classification accuracies

      2016, Neural Networks
      Citation Excerpt :

      We stress that this work is the development of our earlier work (Gao et al., 2014); therefore we will always pay much attention on the difference between PILDs and FLDs.

    View all citing articles on Scopus

    Gao Daqi received the Ph.D. Degree in Industrial Automation from Zhejiang University, China, in 1996. Currently, he is a Professor in the Department of Computer Science at East China University of Science and Technology (ECUST). He has authored or coauthored more than 90 papers. His research interests include Pattern Recognition, Machine Learning, Neural Networks and Artificial Olfactory.

    Ding Jun is currently a Ph.D. student in ECUST. His research interests are Machine Learning and Data Mining.

    Zhu Changming is currently a Ph.D. student in ECUST. His research interests are Pattern Recognition and Machine Learning.

    View full text