Integrated Fisher linear discriminants: An empirical study
Introduction
Linear classifiers are the basic pattern recognition models, mainly including classical Fisher linear discriminants (FLDs) [1], [2], [3], [4], [5], single-layer perceptrons (SLPs) [6], [7], [8] and linear support vector machines (SVMs) [9], [10], [11]. Because of simple structures and low computational costs, FLDs are considered as the most popular linear classifiers in many applications [12], [13], [14]. The key of success of FLDs seems to lie in the fact that the linear decision hyperplanes obtained usually offer reasonable partitions. However, a major issue existing in FLDs is that the parameters, namely the mean vectors and within-class scatter matrices, have to be estimated by limited available training samples in order to determine the weight vectors and thresholds [5], [15], [16]. Though the parameter estimations associated with Gaussian distributions reveal some good statistical properties, the Gaussian assumptions are not able to suit all distribution cases [17], [18].
The criterion function J(.) that maximizes the ratio of between-class scatters to within-class scatters in order to find the proper weight vectors is well known and a great creation, and thus called the Fisher's criterion [15]. Up to date, many criterion functions have been presented [11], [16], [17], [19]. In a sense, the diverse quadratic optimization functions used in SVMs are a natural generalization of the Fisher's criterion [20], [21]. As a matter of fact, FLDs are consistent with SVMs at the aim to maximize the class margins by constructing the optimal separating hyperplanes or projected directions [22].
Imbalance is absolute while balance is relative. Currently, more attention has been paid to imbalanced datasets [23], [24], [25], [26], [27], [28], the most of which is centered on the imbalance of sample sizes for the sake of intuitiveness and simplicity. A classifier tends to categorize the present samples to the majority class when learning an imbalanced dataset [29]. In other words, it is prone to generate a classifier that has a strong bias toward the majority class, resulting in a large number of false negatives.
Re-sampling techniques are popular in solving the imbalanced problems, i.e., either the minority classes are over-sampled or the majority classes are under-sampled or some combinations of the two are employed [25], [28], [30], [31], [32]. In the meantime, boosting and bagging algorithms have been developed for successively training component classifiers, named the cost-sensitive classifiers [10], [33]. They work by assigning larger weights to the mislabeled samples, otherwise smaller. Adaboost algorithm, a variation of boosting, is the popular one [34]. These algorithms can handle the imbalanced cases, to some extent.
Is an FLD able to solve a linearly separable problem? The answer is “Yes” for a “balanced” dataset. However, while a two-class dataset is seriously imbalanced, such an FLD may fail. The more serious the imbalance is, the poorer the resulting FLD performs. The imbalance of distribution regions usually has a far more important influence on the performances of classifiers than the imbalance of sample sizes does [35]. Therefore, it is relatively consistent with the actual situations to consider the imbalance of distribution regions, e.g., variances [5], [10], [16], [18]. Indeed, scatters, manifestations of variances, are already included in J(.). However, the sum of two within-class scatter matrices is only used as the denominator of J(.). In other words, J(.) is not related to the difference of within-class scatters or the imbalance of distribution regions.
It is the threshold in an FLD that finally determines the location of a separating hyperplane. Based on the above observations, we firstly have to answer the question: “Can the classification accuracy of an FLD be improved by selecting a proper threshold?”
The weight vector w in an FLD for solving a two-class problem {ω1, ω2} is only determined by the within-class scatter matrix SW as well as two mean vectors μ1 and μ2. If SW is or approximately singular, the FLD will no longer work. In order to address this issue, SW is often added with a tiny perturbation matrix [1], [12], [36], [34]. However, it is well known that (A+B)−1≠(A−1+B−1). Therefore, we need to answer the second question: “Is it feasible for a nearly singular matrix SW to be added with a tiny perturbation?”
Large attribute elements usually have a larger influence on the parameters of classifiers than those small do during learning courses; on the contrary, small class margins usually have a larger influence on the generalization performances of classifiers than those large do while making decisions [17], [22], [35]. The unduly large elements in a small part of attributes may make SW close to singular, and the unduly small margins will increase the difficulty to seek the optimal separating hyperplanes. The above two cases will make the generalization performances of classifiers designed become poor. In a sense, feature representation is a crucial step for designing classifiers, regardless of whether they are linear or non-linear.
Decimal (DEC) and binary (BIN) codes are two common feature representation systems. Normalized, proportional, logarithmic and sigmoid transformations are several popular equal-dimensional ones. SVMs enlarge the class margins by making the original data sparse in the higher-dimensional feature spaces through nonlinear transformations chosen in prior, e.g., polynomial and radial basis function (RBF) kernels [20], [21]. Quoting the thought, we can transform a data from a lower-dimensional input space to a higher-dimensional feature space directly by coding in advance. The premise to do like this is that the original information must be preserved as much as possible [19]. Therefore, the third question to be answered is: “How to develop an effective feature representation system so as to enlarge the class margins as much as possible, lessen the within-class scatters and ease the unfavorably large components, on condition that the neighborhood relationships are approximately preserved?”
The solution of weight vectors in FLDs is in essence an analytic learning process, which is often faster than the iterative ones used in neural networks and SVMs [15], [20]. In spite of an accustomed practice, the process of one-time analytic solution of weights is not certainly optimal. Therefore, the fourth question to be answered is: “How to develop an iterative learning algorithm in order to alleviate the imbalance and accordingly update the weights and thresholds by means of properly selecting a portion of the training samples?”
This paper aims to noticeably improve the classification accuracies of FLDs around answering the above-mentioned questions. In other words, this paper motivates to empirically optimize the weights and thresholds of FLDs to make the minimum-error-rate classification on the basis of Bayesian decision theory, from the heuristic point of view. The rest of this paper is organized as follows. Section 2 introduces the related work of FLDs. In Section 3, a series of empirical threshold formulas is proposed to alleviate the imbalance. Section 4 illustrates some mixed feature representation approaches and the condition for carrying out feature extraction by principal component analysis (PCA). Section 5 goes into details on the epoch-limited iterative learning strategy for further alleviating the imbalance. Section 6 presents numerous experimental results. Finally, Section 7 comes to our conclusions.
Section snippets
Related work
First of all, let us consider a two-class classification problem {ω1, ω2} as well as the linear discriminant function. For a pattern x=(x1, x2, …, xm)T∈Rm in the m-dimensional input space, the decision hyperplane π can be written aswhere w=(w1, w2, …, wm)T∈Rm is the weight vector, often called the normal or projected direction of π, and θ=−w0 is the threshold or bias.
The decision rule is
Suppose the two input data matrices are
Empirical thresholds
First of all, let us discuss a simple example—Example 1.
A synthetic dataset X only consists of four samples {(0, 0); (1, 0); (0, 1); (1, 1)}, the first 3 of which belong to class ω1, and the remaining one to class ω2. Obviously, the dataset is linearly separable. The detailed computational results are , , , , , , , and the 4 projected values . The FLD with θ2 is able to recognize all the 4 patterns,
Neighborhood-preserving feature representation and extraction
Without the loss of generalization, the following discussions are only limited to the DEC attributes.
Iterative learning strategy and integrated FLDs
Let us pay attention on Example 1 in Section 3 again. After the point (a, a) is removed, the imbalance between two classes ω1 and ω2 will be manifestly alleviated, and the weight vector w can be recalculated only by the samples within the interval . The samples outside the interval, regardless of whether they are correctly classified or not, will no longer have an effect upon optimizing w. We can thus propose an iterative learning strategy of FLDs, called the iterative FLDs. The
Shuttle dataset
The Shuttle dataset [42] contains 9 INT (9D) attributes and 7 classes. 78.41% of the training samples belong to class ω1. The IR of sample sizes between ω1 and ω6 even reaches 34,108/6=5684.67! In other words, the maximum imbalanced ratio of sizes is IRmax=5684.67. Obviously, the imbalance of sizes is very serious.
Table 1 summarizes the statistic characteristics of sample distributions in the training set. The standard deviations of attributes 2, 4 and 6 come up to 78.14, 41.00 and 179.49,
Conclusions
On the basis of theoretical analysis and experimental results, we came to the following suggestions and conclusions:
- (A)
For the very dense datasets, the class margins can be enlarged directly by proper coding modes on condition that the neighborhood relationships of samples are approximately preserved. The mixed BIN–DEC coding system has such advantages.
- (B)
It is unfeasible for the within-class scatter matrices SW being or approximately singular to be added with tiny perturbations. By means of
Conflict of interest
None declared.
Acknowledgments
This work is funded by the National Science Foundation of China (NSFC) under Grant nos. 21176077, 61272198 and 60675027, the High-Tech Development Program of China (863) under Grant no. 2006AA10Z315, and the Open Funding Project of the State Key Laboratory of Bioreactor Engineering.
Gao Daqi received the Ph.D. Degree in Industrial Automation from Zhejiang University, China, in 1996. Currently, he is a Professor in the Department of Computer Science at East China University of Science and Technology (ECUST). He has authored or coauthored more than 90 papers. His research interests include Pattern Recognition, Machine Learning, Neural Networks and Artificial Olfactory.
References (55)
- et al.
Nonlinear Fisher discriminant analysis using a minimum squared error cost function and the orthogonal least squares algorithm
Neural Networks
(2002) - et al.
Novel Fisher discriminant classifiers
Pattern Recognition
(2012) - et al.
Efficient leave-one-out cross-validation of kernel fisher discriminant classifier
Pattern Recognition
(2003) - et al.
Do unbalanced data have a negative effect on LDA?
Pattern Recognition
(2008) - et al.
Cost-sensitive boosting for classification of imbalanced data
Pattern Recognition
(2007) - et al.
Inverse random under sampling for class imbalance problem and its application to multi-label classification
Pattern Recognition
(2012) - et al.
A balanced neural tree for pattern classification
Neural Networks
(2012) The use of the area under the ROC curve in the evaluation of machine learning algorithms
Pattern Recognition
(1997)- et al.
Dimensionality and sample size considerations in pattern recognition practice
- et al.
Multi-class pattern classification using neural networks
Pattern Recognition
(2007)
An algorithm to generate radial basis function (RBF)-like nets for classification problems
Neural Networks
The linear separability problem: some testing methods
IEEE Transactions on Neural Networks
Two variations on Fisher's linear discriminant for pattern recognition
IEEE Transactions on Pattern Analysis and Machine Intelligence
The use of an adaptive threshold element to design a linear optimal pattern classifier
IEEE Transactions on Information Theory
Neural network classifiers for speech recognition
Lincoln Laboratory Journal
Gradient-based learning applied to document recognition
Proceedings of the IEEE
Pairwise costs in multiclass perceptrons
IEEE Transactions on Pattern Analysis and Machine Intelligence
Mixing linear SVMs for nonlinear classification
IEEE Transactions on Neural Networks
Geometry-based ensembles: towards a structural characterization of the classification boundary
IEEE Transactions on Pattern Analysis and Machine Intelligence
Kernel optimization in discriminant analysis
IEEE Transactions on Pattern Analysis and Machine Intelligence
Classification based on hybridization of parametric and nonparametric classifiers
IEEE Transactions on Pattern Analysis and Machine Intelligence
Information discriminant analysis: feature extraction with an information–theoretic objective
IEEE Transactions on Pattern Analysis and Machine Intelligence
Pattern Classification
Linear dimensionality reduction via a heteroscedastic extension of LDA: the Chernoff criterion
IEEE Transactions on Pattern Analysis and Machine Intelligence
Statistical pattern recognition: a review
IEEE Transactions on Pattern Analysis and Machine Intelligence
Bayes optimality in linear discriminant analysis
IEEE Transactions on Pattern Analysis and Machine Intelligence
Orthogonal neighborhood preserving projections: a projection-based dimensionality reduction technique
IEEE Transactions on Pattern Analysis and Machine Intelligences
Cited by (14)
Robust generalised quadratic discriminant analysis
2021, Pattern RecognitionIMCStacking: Cost-sensitive stacking learning with feature inverse mapping for imbalanced problems
2018, Knowledge-Based SystemsRegularized fisher linear discriminant through two threshold variation strategies for imbalanced problems
2018, Knowledge-Based SystemsCitation Excerpt :Both ideas meet our requirement. In this paper, considering the characteristic of FLD and motivated by both classifiers mentioned above, we first regularize the original FLD in a way inspired by the Locality Preserving Projection (LPP) [32–34], and then boost the Regularized FLD (RFLD) by two strategies that are respectively modified from the Integrated FLD [5] and the BEPILD [31]. As a result, we design two classifiers for imbalanced problems and call them RFLD-S1 and RFLD-S2, respectively.
Pseudo-inverse linear discriminants for the improvement of overall classification accuracies
2016, Neural NetworksCitation Excerpt :We stress that this work is the development of our earlier work (Gao et al., 2014); therefore we will always pay much attention on the difference between PILDs and FLDs.
Gao Daqi received the Ph.D. Degree in Industrial Automation from Zhejiang University, China, in 1996. Currently, he is a Professor in the Department of Computer Science at East China University of Science and Technology (ECUST). He has authored or coauthored more than 90 papers. His research interests include Pattern Recognition, Machine Learning, Neural Networks and Artificial Olfactory.
Ding Jun is currently a Ph.D. student in ECUST. His research interests are Machine Learning and Data Mining.
Zhu Changming is currently a Ph.D. student in ECUST. His research interests are Pattern Recognition and Machine Learning.