A theoretical comparison of two-class Fisher’s and heteroscedastic linear dimensionality reduction schemes
Introduction
Linear dimensionality reduction (LDR) techniques aim to reduce high-dimensional data to a lower dimension in such a way that the classification of the new data is more tractable, and can still be done efficiently. We consider the traditional two-class case, and assume that the two classes, and , are represented by two normally distributed n-dimensional random vectors, and , and whose a priori probabilities are and , respectively. The aim is to linearly transform and into new normally distributed random vectors and of dimension d, , using a matrix of order , in such a way that the classification error in the transformed space is as small as possible.
A typical approach to reduce the dimension of the data is principal component analysis (PCA) (Duda et al., 2000, Theodoridis and Koutroumbas, 2006, Webb, 2002), but it better applies to unsupervised learning problems, since it takes the whole data as a “single” class, losing the discriminability power of labeled data. Supervised classification schemes that reduce to dimension one include the well-known Fisher’s discriminant (FD) approach (Duda et al., 2000, Theodoridis and Koutroumbas, 2006), and the direct Fisher’s discriminant analysis (Gao and Davis, 2006). A heteroscedastic criterion for LDR is the one proposed in (Loog and Duin, 2004), which considers directed distance matrices in the objective function to capture the Chernoff distance from the original space. As opposed to this, a reduction method that maximizes the Chernoff distance, but in the transformed space has been recently proposed in (Rueda and Herrera, 2008).
We consider two LDR techniques, Fisher’s discriminant (FD), and Loog–Duin (LD) dimensionality reduction, and theoretically analyze their common aspects. Let and be the within-class and between-class scatter matrices, respectively. The FD criterion consists of maximizing the distance between the transformed distributions by finding that maximizes the following function (Loog and Duin, 2004, Theodoridis and Koutroumbas, 2006)The matrix that maximizes Eq. (1) is obtained by finding the eigenvalue decomposition ofwhenever is non-singular, and taking the d eigenvectors whose eigenvalues are the largest ones. Since the eigenvalue decomposition of the matrix (2) leads to only one non-zero eigenvalue, , whose eigenvector is given by , FD can only reduce to dimension .
While the FD criterion is homoscedastic, as it takes the Mahalanobis distance between the means, the LD criterion considers the concept of directed distance matrices as a heteroscedastic component to linearly transform the data to a lower-dimensional space by generalizing Fisher’s criterion. It achieves this by substituting the between-class scatter matrix for the corresponding directed distance matrix. The LD criterion consists of obtaining the matrix that maximizes the function (Loog and Duin, 2002)The solution to this criterion is given by the matrix that is composed of the d eigenvectors (whose eigenvalues are maximum) of the following matrix:In (Ali et al., 2006), it has been empirically shown that when reducing to dimensions , LD outperforms FD in many cases when coupling the LDR technique with a quadratic (Bayesian) classifier, namely when the optimal classifier under the assumption of normally distributed classes. Even though this behavior is exhibited for the case , in (Ali et al., 2006) and in this paper, it is empirically shown that LD outperforms FD for . In this paper, we theoretically compare both LDR techniques, and provide the necessary and sufficient conditions for their equivalence. We also show empirically that the theoretical analysis is related to the probability of error obtained by coupling the LDR technique and the quadratic and linear classifiers in the one-dimensional space.
Section snippets
Theoretical comparison
In this section, we derive the necessary and sufficient conditions for which the FD and LD criteria are maximized for the same linear transformation. While this is the main result proved in Theorem 1, that proof resorts to two main assumptions. One is that for the criterion, there always exist diagonal covariance matrices for which an equivalent criterion leads to the same maximum. This is proved in Lemma 1. The other assumption is for the criterion, for which there also
Experimental results
The first set of experiments was performed to corroborate the validity of Conjecture 1. For this, various experiments were performed on randomly generated parameters for two 20-dimensional normal distributions, and , and a priori probabilities and . A total of 1000 different parameters were generated and the optimal transformation was found for and , where is a orthogonal matrix with d randomly generated between 1 and 19. The maximum for both
Conclusion
We have formalized the necessary and sufficient conditions for which two well-known LDR criteria, FD and LD, are maximized for the same linear transformation. To derive these conditions, we have first shown that the two criteria preserve the same maximum value after a diagonalization process is applied. For FD, we have found the linear transformation that allows to obtain the LDR in the original space. For the LD criterion, we have conjectured that the maximum values coincide in both the
Acknowledgements
The authors would like to thank the Reviewers for their valuable comments and their dedication to improve the quality of this paper. This research has been supported by the Chilean National Council for Technological and Scientific Research, FONDECYT Grant No. 1060904, and the Institute of Informatics, National University of San Juan, Argentina.
References (10)
- et al.
Why direct LDA is not equivalent to LDA
Pattern Recognition
(2006) - et al.
Linear dimensionality reduction by maximizing the Chernoff distance in the transformed space
Pattern Recognition
(2008) - et al.
On the performance of Chernoff distance-based linear dimensionality reduction techniques
- et al.
Pattern Classification
(2000) Matriz Algebra from a Statisticians Perspective
(1997)