Kernel Entropy Discriminant Analysis for Dimension Reduction

Mehta, Aditya; Chandra Sekhar, C.

doi:10.1007/978-3-319-69900-4_5

Kernel Entropy Discriminant Analysis for Dimension Reduction

Conference paper
First Online: 01 November 2017

2679 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10597))

Abstract

The unsupervised techniques for dimension reduction, such as principal component analysis (PCA), kernel PCA and kernel entropy component analysis, do not take the information about class labels into consideration. The reduced dimension representation obtained using the unsupervised techniques may not capture the discrimination information. The supervised techniques, such as multiple discriminant analysis and generalized discriminant analysis, can capture discriminatory information. However the reduced dimension is limited by number of classes. We propose a supervised technique, kernel entropy discriminant analysis (kernel EDA), that uses Euclidean divergence as criterion function. Parzen window method for density estimation is used to find an estimate of Euclidean divergence. Euclidean divergence estimate is expressed in terms of eigenvectors and eigenvalues of the kernel gram matrix. The eigenvalues and eigenvectors that contribute significantly to the Euclidean divergence estimate are used for determining the directions for projection. Effectiveness of the kernel EDA method is demonstrated through the improved classification accuracy for benchmark datasets.

You have full access to this open access chapter, Download conference paper PDF

1 Introduction

The goal of data transformation techniques is to transform a set of large number of features into a compact set of informative features. The data transformation techniques are unsupervised or supervised. In the unsupervised techniques, the aim is to preserve some significant characteristics of the data in the transformed space. The most commonly used unsupervised technique is the principal component analysis (PCA) [4]. It projects a given d-dimensional feature vector onto the eigenvectors of data covariance matrix corresponding to l most significant eigenvalues of the matrix. Kernel PCA [11] performs PCA in the kernel feature space of a Mercer kernel. The unsupervised techniques do not make use of class labels because of which the transformed representation may not be discriminative. In supervised data transformation techniques, the class label information is also used. The commonly used supervised techniques are Fisher discriminant analysis (FDA) [3], multiple discriminant analysis (MDA) [2] and their many variants. The FDA finds a direction for projection along which the separability of projections of data belonging to two classes is maximum. The MDA is a multi-class extension of FDA. Kernel FDA [7] performs the FDA in the kernel feature space. Generalized discriminant analysis (GDA) [1] is extension of kernel FDA for multiple classes. A major limitation of these supervised techniques is that the dimension of transformed data is limited by the number of classes.

In the kernel entropy component analysis (kernel ECA) [5, 6] technique, the eigenvectors of kernel gram matrix used for projection are determined based on their contribution to the Renyi quadratic entropy of the input data. The kernel ECA is an unsupervised technique.

In this paper, we develop a new discriminative transformation method that can be considered as an extension of kernel ECA modified for supervised dimension reduction. It chooses the directions for projection that maximally preserves the Euclidean divergence [10] between the probability density function of two classes. An estimator of Euclidean divergence is expressed in terms of eigenvectors and eigenvalues of kernel gram matrix of Gaussian kernel used in Parzen window [9] method for density estimation. The directions for projection are obtained using eigenvalues and eigenvectors that contribute significantly to the divergence estimate.

The paper is organized as follows. In Sect. 2, we present the kernel ECA method. We discuss the proposed kernel EDA method in Sect. 3. Experimental studies and results are presented in Sect. 4.

2 Kernel Entropy Component Analysis

Kernel entropy component analysis (Kernel ECA) focuses on entropy components instead of principal components that represent variance in kernel PCA. The Renyi’s quadratic entropy of a distribution p(x)is given by

$$\begin{aligned} H(p(\mathbf x ))=-log \int p^2(\mathbf x )\,d\mathbf x \end{aligned}$$

(1)

The information potential of a distribution p(x) is defined as V(p(x))= $\int p^2(\mathbf x )\,dx$. The information potential can also be expressed as V(p)=$\mathcal {E}_p(p)$, where $\mathcal {E}_p(.)$ denotes expectation w.r.t. $p(\mathbf x )$. Consider the data set $D=\{\mathbf {x}_1,\mathbf {x}_2,...\mathbf {x}_N \}$. Let $k_\sigma (\mathbf x _m,\cdot )$ be the Gaussian kernel with $\sigma $ as width of kernel used in the Parzen window method for estimation of density at $\mathbf x _m$. It may be noted that the Gaussian kernel is a Mercer kernel and, therefore, it is a positive semi-definite kernel. Then the estimate of density is given by

$$\begin{aligned} \hat{p}(\mathbf x _m) = \frac{1}{N} \sum \limits _\mathbf{x _n \in D} k_{\sigma } (\mathbf x _m,\mathbf x _n) \end{aligned}$$

(2)

Then the estimate of $V(p(\mathbf x ))$ denoted by $\hat{V}(p)$ is given by

$$\begin{aligned} \hat{V}(p)= \frac{1}{N}\sum \limits _\mathbf{x _m\in D}\hat{p}(\mathbf x _m) = \frac{1}{N}\sum \limits _\mathbf{x _m\in D}\frac{1}{N}\sum \limits _\mathbf{x _n\in D} k_\sigma (\mathbf x _m,\mathbf x _n)=\frac{1}{N^{2}} \bar{\mathbf{1 }}^{T}{} \mathbf K \bar{\mathbf{1 }} \end{aligned}$$

(3)

where $\bar{\mathbf{1 }}$ is an ($N\times 1$) vector consisting of all 1’s and K is the kernel gram matrix of the kernel $k_\sigma (\cdot ,\cdot )$ on the dataset D. Using the eigen decomposition of kernel gram matrix K, Eq. (3) can be rewritten as

$$\begin{aligned} \hat{V}(p)= \frac{1}{N^2}\sum \limits _{i=1}^{N}(\sqrt{\lambda _{i}}{} \mathbf e _{i}^{T}\bar{\mathbf{1 }})^{2} \end{aligned}$$

(4)

where $\lambda _i$ and $\mathbf {e}_i$ are the eigenvalues and eigenvectors of K respectively. The eigenvectors of K used for projection are identified based on their extent of contribution to the information potential estimate. It is noted that the kernel ECA method does not make use of the information about the class labels of examples in D.

3 Kernel Entropy Discriminant Analysis

Let D be the data set that consists of data of two classes, $D_1$=$\{\mathbf {x}_{11},\mathbf {x}_{12},..,\mathbf {x}_{1N_1} \}$ and $D_2$ =$\{\mathbf {x}_{21},\mathbf {x}_{22},..,\mathbf {x}_{2N_2} \}$ which we assume are generated from probability density functions (pdf) $p_1(\mathbf {x})$ and $p_2(\mathbf {x})$ of classes respectively. The Euclidean divergence between the pdfs of these two classes is given as

$$\begin{aligned} ED(p_1,p_2)= \int p_1^2(\mathbf x )d\mathbf x -2\int p_1(\mathbf x )p_2(\mathbf x )d\mathbf x + \int p_2^2(\mathbf x )d\mathbf x \end{aligned}$$

(5)

Using the Parzen window technique for pdf estimation the estimate of $ED(p_1,p_2)$ denoted by $\hat{ED}(p_1,p_2)$ is given by

(6)

Let $\mathbf z _1=[z_{11},z_{12},.,z_{1N}]^T$ and $\mathbf z _2=[z_{21},z_{22},.,z_{2N}]^T$ be vectors with $z_{ij}=1$ if $\mathbf x _j \in D_i$ and $z_{ij}=0$ otherwise. Thus Eq. (6) can be written as

(7)

where $\mathbf K $ is the kernel gram matrix of the dataset D. Using the eigen decomposition of $\mathbf K $, Eq. (7) can be rewritten as

(8)

where $\lambda _i$ are the eigenvalues and $\mathbf e _i$ are the eigenvectors of K. We can write Eq. (8) as

(9)

Let $\psi _i = \lambda _{i}\bigg ( \frac{\mathbf {e}_i^{T}\mathbf {z}_1}{N_1}-\frac{\mathbf {e}_i^{T}\mathbf {z}_2}{N_2}\bigg )^2$. Then $\hat{ED}(p_1,p_2)= \sum \limits _{i=1}^{N} \psi _i $

The term $\psi _i$ is a measure of extent of contribution of $\lambda _i$ and $\mathbf e _i$ to the Euclidean divergence. Certain eigenvalues and eigenvectors for which $\psi _i$ is large contribute more to the Euclidean divergence than the others. By considering only those eigenvalue and eigenvector pairs that contribute significantly to the divergence, we identify the directions for projection that can capture discriminatory features for the data of two given classes.

As the Gaussian kernel is also a Mercer kernel, it is an inner product kernel. Therefore, $k_\sigma (\mathbf x _m,\mathbf x _n) =\langle \varvec{\phi }(\mathbf x _m),\varvec{\phi }(\mathbf x _n)\rangle $, where $\varvec{\phi }(\mathbf x )$ is the kernel space representation of a data point $\mathbf x $. The mean vectors for two classes in $\varvec{\phi }(x)$-space are given by $\mathbf m _1^{\phi } = \frac{1}{N_1} \sum \limits _\mathbf{x _m \in D_1} \varvec{\phi }(\mathbf x _m)$ and $\mathbf m _2^{\phi } = \frac{1}{N_2} \sum \limits _\mathbf{x _m \in D_2} \varvec{\phi }(\mathbf x _m)$. Then each term in Eq. (6) can be expressed as follows:

(10)

(11)

(12)

Therefore, the estimate of Euclidean divergence $\hat{ED}(p_1,p_2)$ can be expressed as

$$\begin{aligned} \hat{ED}(p_1,p_2)=\langle \mathbf m ^\phi _1,\mathbf m ^\phi _1\rangle - 2\langle \mathbf m ^\phi _1,\mathbf m ^\phi _2\rangle +\langle \mathbf m ^\phi _2,\mathbf m ^\phi _2\rangle = \vert \vert \mathbf m ^\phi _1-\mathbf m ^\phi _2\vert \vert ^2 \end{aligned}$$

(13)

Thus the Euclidean divergence in x space corresponds to squared Euclidean distance between the means of the data of two classes in the kernel feature space. Let $\varvec{\nu }_i$ be the direction for projections in the $\varvec{\phi }(\mathbf x )$-space. As in kernel PCA, the vector $\varvec{\nu }_i$ is expressed as follows:

$$\begin{aligned} \varvec{\nu }_i = \frac{1}{\sqrt{\lambda _{i}}}\sum \limits _{n=1}^{N}e_{in}\varvec{\phi }(\mathbf x _n) \end{aligned}$$

(14)

where $e_{in}$ is the $n^{th}$ element of $\mathbf e _i$. The projection of a given data point $\varvec{\phi }(\mathbf x )$ in the kernel feature space is given by

$$\begin{aligned} a_i =\varvec{\nu }_i^{T} \varvec{\phi }(\mathbf x )= \frac{1}{\sqrt{\lambda _{i}}}\sum \limits _{n=1}^{N}e_{in}\varvec{\phi }(\mathbf x _n)^{T}\varvec{\phi }(\mathbf x ) = \frac{1}{\sqrt{\lambda _{i}}}\sum \limits _{n=1}^{N}e_{in} k(\mathbf x _n,\mathbf x ), \end{aligned}$$

(15)

For a given data point x, the l-dimensional transformed representation using the kernel EDA method is obtained by computing $a_i$, $i=1,2,..,l$ where l is the number of directions for projection chosen.

4 Experiments

First, we analyze our proposed algorithm on a synthetic dataset. Then we evaluate the performance on two-class IDA benchmark datasets and multi-class datasets. We compare the performance of the proposed kernel EDA method with PCA, kernel PCA, kernel ECA, kernel FDA and GDA. We have used the Gaussian kernel for the data transformation techniques. The kernel width $\sigma $ is chosen empirically for each dataset and in order to make a fair comparison, the same kernel width is used for all the techniques. Linear support vector machine is used for obtaining the classification accuracy on the transformed representation of data. The choice of using a linear classifier helps us identifying how effective is the transformed representation in performing the classification task. The data is split into 75%, 10% and 15% for training, validation and testing respectively.

Table 1. Details of the 2-class IDA benchmark datasets used.

Full size table

4.1 Studies on Synthetic Dataset

The synthetic dataset contains 3424 data points randomly distributed on two spirals, shown in Fig. 1(a). The classification accuracies on the transformed data using linear SVM are plotted in Fig. 1(b). The accuracies are shown for kernel PCA, kernel ECA and kernel EDA methods and for different values of l. It is seen that kernel EDA performs better than the other two methods.

Table 2. Classification accuracies (in %) obtained on the 2-class IDA benchmark datasets for different methods of dimensionality reduction and for different values of reduced dimension.

Full size table

4.2 Studies on 2-Class Real World Datasets

Details of two-class IDA benchmark datasets [6] used in our study are listed in Table 1. The classification accuracies on transformed representation obtained using different methods are given in Table 2. The results show an improvement in performance for the proposed technique over the existing techniques on almost all of the datasets.

Table 3. Details of multi-class benchmark datasets used.

Full size table

Table 4. Classification Accuracies (in %) obtained on Vogel and MIT datasets for different methods for dimension reduction and for different values of reduced dimension. Performance is also compared for the one-versus-one (1-vs-1) and the one-vs-rest (1-vs-R) approach to multi class classification.

Full size table

4.3 Studies on Multi-class Datasets

The kernel EDA can be used in multiclass classification problem by converting the multiclass problem to multiple binary classification problems by using “One vs Rest” or “One vs One” scheme. The datasets of Vogel [12] and MIT [8] scene multi-class datasets used in our studies are given in Table 3. Local block features are used. Each image is divided into fixed size blocks. From each block, the color, edge direction histogram, and texture features are extracted. Thus, each block of image is represented by a 23-dimensional feature vector. The feature vectors from all the blocks in an image are concatenated to get the representation for an image.

The classification accuracies obtained on the transformed representation for the Vogel and MIT datasets using the linear SVM as classifier is given in Table 4. The kernel EDA technique gives a much higher accuracy as compared to the other techniques on MIT dataset. On Vogel dataset, all the techniques give a similar performance.

5 Conclusion

In this paper, we have proposed kernel entropy discriminant analysis as a data transformation method. It uses Euclidean divergence between the estimates of probability density functions of the two classes as the criterion function to decide the directions for projection. Though the kernel EDA is a supervised technique, it is not limited by the number of classes. Studies on various datasets show that proposed kernel EDA performs better or on-par as compared to PCA, kernel PCA, kernel ECA, kernel FDA and GDA.

References

Baudat, G., Anouar, F.: Generalized discriminant analysis using a kernel approach. Neural Comput. 12(10), 2385–2404 (2000)
Article Google Scholar
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. Wiley-Interscience, Hoboken (2000)
MATH Google Scholar
Fisher, R.A.: The use of multiple measurements in taxonomic problems. Ann. Eugenics 7(2), 179–188 (1936)
Article Google Scholar
Hotelling, H.: Relations between two sets of variates. Biometrika 28(3/4), 321–377 (1936)
Article MATH Google Scholar
Izquierdo-Verdiguier, E., Laparra, V., Jenssen, R., Gmez-Chova, L., Camps-Valls, G.: Optimized kernel entropy components. IEEE Trans. Neural Netw. Learn. Syst. 28(6), 1466–1472 (2017)
Article Google Scholar
Jenssen, R.: Kernel entropy component analysis. IEEE Trans. Pattern Anal. Mach. Intell. 32(5), 847–860 (2010)
Article Google Scholar
Mika, S., Ratsch, G., Weston, J., Scholkopf, B., Mullers, K.R.: Fisher discriminant analysis with kernels. In: Proceedings of the 1999 IEEE Signal Processing Society Workshop on Neural Networks for Signal Processing IX, pp. 41–48 (1999)
Google Scholar
Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comput. Vision 42(3), 145–175 (2001)
Article MATH Google Scholar
Parzen, E.: On estimation of a probability density function and mode. Ann. Math. Stat. 33(3), 1065–1076 (1962)
Article MATH MathSciNet Google Scholar
Principe, J.: Information Theoretic Learning: Renyi’s Entropy and Kernel Perspectives, 1st edn. Springer, New York (2010)
Book MATH Google Scholar
Schölkopf, B., Smola, A., Müller, K.-R.: Kernel principal component analysis. In: Gerstner, W., Germond, A., Hasler, M., Nicoud, J.-D. (eds.) ICANN 1997. LNCS, vol. 1327, pp. 583–588. Springer, Heidelberg (1997). doi:10.1007/BFb0020217
Google Scholar
Vogel, J., Schiele, B.: Semantic modeling of natural scenes for content-based image retrieval. Int. J. Comput. Vision 72(2), 133–157 (2007)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai, India
Aditya Mehta & C. Chandra Sekhar

Authors

Aditya Mehta
View author publications
You can also search for this author in PubMed Google Scholar
C. Chandra Sekhar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aditya Mehta .

Editor information

Editors and Affiliations

Indian Statistical Institute, Kolkata, India
B. Uma Shankar
Indian Statistical Institute, Kolkata, India
Kuntal Ghosh
Indian Statistical Institute, Kolkata, India
Deba Prasad Mandal
Indian Statistical Institute, Kolkata, India
Shubhra Sankar Ray
The Hong Kong Polytechnic University, Hong Kong, China
David Zhang
Indian Statistical Institute, Kolkata, India
Sankar K. Pal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mehta, A., Chandra Sekhar, C. (2017). Kernel Entropy Discriminant Analysis for Dimension Reduction. In: Shankar, B., Ghosh, K., Mandal, D., Ray, S., Zhang, D., Pal, S. (eds) Pattern Recognition and Machine Intelligence. PReMI 2017. Lecture Notes in Computer Science(), vol 10597. Springer, Cham. https://doi.org/10.1007/978-3-319-69900-4_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-69900-4_5
Published: 01 November 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69899-1
Online ISBN: 978-3-319-69900-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Abstract

1 Introduction

2 Kernel Entropy Component Analysis

3 Kernel Entropy Discriminant Analysis

4 Experiments

4.1 Studies on Synthetic Dataset

4.2 Studies on 2-Class Real World Datasets

4.3 Studies on Multi-class Datasets

5 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation