Elsevier

Pattern Recognition Letters

Volume 29, Issue 16, 1 December 2008, Pages 2092-2098
Pattern Recognition Letters

A theoretical comparison of two-class Fisher’s and heteroscedastic linear dimensionality reduction schemes

Communicated by R.P.W. Duin
https://doi.org/10.1016/j.patrec.2008.07.005Get rights and content

Abstract

We present a theoretical analysis for comparing two linear dimensionality reduction (LDR) techniques for two classes, a homoscedastic LDR scheme, Fisher’s discriminant (FD), and a heteroscedastic LDR scheme, Loog–Duin (LD). We formalize the necessary and sufficient conditions for which the FD and LD criteria are maximized for the same linear transformation. To derive these conditions, we first show that the two criteria preserve the same maximum values after a diagonalization process is applied. We derive the necessary and sufficient conditions for various cases, including coincident covariance matrices, coincident prior probabilities, and for when one of the covariances is the identity matrix. We empirically show that the conditions are statistically related to the classification error for a post-processing one-dimensional quadratic classifier and the Chernoff distance in the transformed space.

Introduction

Linear dimensionality reduction (LDR) techniques aim to reduce high-dimensional data to a lower dimension in such a way that the classification of the new data is more tractable, and can still be done efficiently. We consider the traditional two-class case, and assume that the two classes, ω1 and ω2, are represented by two normally distributed n-dimensional random vectors, x1N(m1,S1) and x2N(m2,S2), and whose a priori probabilities are p1 and p2, respectively. The aim is to linearly transform x1 and x2 into new normally distributed random vectors y1 and y2 of dimension d, d<n, using a matrix A of order d×n, in such a way that the classification error in the transformed space is as small as possible.

A typical approach to reduce the dimension of the data is principal component analysis (PCA) (Duda et al., 2000, Theodoridis and Koutroumbas, 2006, Webb, 2002), but it better applies to unsupervised learning problems, since it takes the whole data as a “single” class, losing the discriminability power of labeled data. Supervised classification schemes that reduce to dimension one include the well-known Fisher’s discriminant (FD) approach (Duda et al., 2000, Theodoridis and Koutroumbas, 2006), and the direct Fisher’s discriminant analysis (Gao and Davis, 2006). A heteroscedastic criterion for LDR is the one proposed in (Loog and Duin, 2004), which considers directed distance matrices in the objective function to capture the Chernoff distance from the original space. As opposed to this, a reduction method that maximizes the Chernoff distance, but in the transformed space has been recently proposed in (Rueda and Herrera, 2008).

We consider two LDR techniques, Fisher’s discriminant (FD), and Loog–Duin (LD) dimensionality reduction, and theoretically analyze their common aspects. Let SW=p1S1+p2S2 and SE=(m1-m2)(m1-m2)t be the within-class and between-class scatter matrices, respectively. The FD criterion consists of maximizing the distance between the transformed distributions by finding A that maximizes the following function (Loog and Duin, 2004, Theodoridis and Koutroumbas, 2006)JFD(A)=tr{(ASWAt)-1(ASEAt)}.The matrix A that maximizes Eq. (1) is obtained by finding the eigenvalue decomposition ofSFD=SW-1SE,whenever SW is non-singular, and taking the d eigenvectors whose eigenvalues are the largest ones. Since the eigenvalue decomposition of the matrix (2) leads to only one non-zero eigenvalue, (m1-m2)t(m1-m2), whose eigenvector is given by SW-1(m1-m2), FD can only reduce to dimension d=1.

While the FD criterion is homoscedastic, as it takes the Mahalanobis distance between the means, the LD criterion considers the concept of directed distance matrices as a heteroscedastic component to linearly transform the data to a lower-dimensional space by generalizing Fisher’s criterion. It achieves this by substituting the between-class scatter matrix for the corresponding directed distance matrix. The LD criterion consists of obtaining the matrix A that maximizes the function (Loog and Duin, 2002)JLD(A)=tr(ASWAt)-1ASEAt-ASW12p1log(SW-12S1SW-12)+p2log(SW-12S2SW-12)p1p2SW12AtThe solution to this criterion is given by the matrix A that is composed of the d eigenvectors (whose eigenvalues are maximum) of the following matrix:SLD=SW-1SE-SW12p1log(SW-12S1SW-12)+p2log(SW-12S2SW-12)p1p2SW12.In (Ali et al., 2006), it has been empirically shown that when reducing to dimensions d>1, LD outperforms FD in many cases when coupling the LDR technique with a quadratic (Bayesian) classifier, namely when the optimal classifier under the assumption of normally distributed classes. Even though this behavior is exhibited for the case d>1, in (Ali et al., 2006) and in this paper, it is empirically shown that LD outperforms FD for d=1. In this paper, we theoretically compare both LDR techniques, and provide the necessary and sufficient conditions for their equivalence. We also show empirically that the theoretical analysis is related to the probability of error obtained by coupling the LDR technique and the quadratic and linear classifiers in the one-dimensional space.

Section snippets

Theoretical comparison

In this section, we derive the necessary and sufficient conditions for which the FD and LD criteria are maximized for the same linear transformation. While this is the main result proved in Theorem 1, that proof resorts to two main assumptions. One is that for the JF(A) criterion, there always exist diagonal covariance matrices for which an equivalent criterion JF(A) leads to the same maximum. This is proved in Lemma 1. The other assumption is for the JLD(A) criterion, for which there also

Experimental results

The first set of experiments was performed to corroborate the validity of Conjecture 1. For this, various experiments were performed on randomly generated parameters for two 20-dimensional normal distributions, x1N(m1,S1) and x2N(m2,S2), and a priori probabilities p1 and p2. A total of 1000 different parameters were generated and the optimal transformation was found for JLD(A) and JLD(A), where A is a d×20 orthogonal matrix with d randomly generated between 1 and 19. The maximum for both max{

Conclusion

We have formalized the necessary and sufficient conditions for which two well-known LDR criteria, FD and LD, are maximized for the same linear transformation. To derive these conditions, we have first shown that the two criteria preserve the same maximum value after a diagonalization process is applied. For FD, we have found the linear transformation that allows to obtain the LDR in the original space. For the LD criterion, we have conjectured that the maximum values coincide in both the

Acknowledgements

The authors would like to thank the Reviewers for their valuable comments and their dedication to improve the quality of this paper. This research has been supported by the Chilean National Council for Technological and Scientific Research, FONDECYT Grant No. 1060904, and the Institute of Informatics, National University of San Juan, Argentina.

References (10)

There are more references available in the full text version of this article.

Cited by (0)

View full text