1 Introduction

The goal of data transformation techniques is to transform a set of large number of features into a compact set of informative features. The data transformation techniques are unsupervised or supervised. In the unsupervised techniques, the aim is to preserve some significant characteristics of the data in the transformed space. The most commonly used unsupervised technique is the principal component analysis (PCA) [4]. It projects a given d-dimensional feature vector onto the eigenvectors of data covariance matrix corresponding to l most significant eigenvalues of the matrix. Kernel PCA [11] performs PCA in the kernel feature space of a Mercer kernel. The unsupervised techniques do not make use of class labels because of which the transformed representation may not be discriminative. In supervised data transformation techniques, the class label information is also used. The commonly used supervised techniques are Fisher discriminant analysis (FDA) [3], multiple discriminant analysis (MDA) [2] and their many variants. The FDA finds a direction for projection along which the separability of projections of data belonging to two classes is maximum. The MDA is a multi-class extension of FDA. Kernel FDA [7] performs the FDA in the kernel feature space. Generalized discriminant analysis (GDA) [1] is extension of kernel FDA for multiple classes. A major limitation of these supervised techniques is that the dimension of transformed data is limited by the number of classes.

In the kernel entropy component analysis (kernel ECA) [5, 6] technique, the eigenvectors of kernel gram matrix used for projection are determined based on their contribution to the Renyi quadratic entropy of the input data. The kernel ECA is an unsupervised technique.

In this paper, we develop a new discriminative transformation method that can be considered as an extension of kernel ECA modified for supervised dimension reduction. It chooses the directions for projection that maximally preserves the Euclidean divergence [10] between the probability density function of two classes. An estimator of Euclidean divergence is expressed in terms of eigenvectors and eigenvalues of kernel gram matrix of Gaussian kernel used in Parzen window [9] method for density estimation. The directions for projection are obtained using eigenvalues and eigenvectors that contribute significantly to the divergence estimate.

The paper is organized as follows. In Sect. 2, we present the kernel ECA method. We discuss the proposed kernel EDA method in Sect. 3. Experimental studies and results are presented in Sect. 4.

2 Kernel Entropy Component Analysis

Kernel entropy component analysis (Kernel ECA) focuses on entropy components instead of principal components that represent variance in kernel PCA. The Renyi’s quadratic entropy of a distribution p(x)is given by

$$\begin{aligned} H(p(\mathbf x ))=-log \int p^2(\mathbf x )\,d\mathbf x \end{aligned}$$
(1)

The information potential of a distribution p(x) is defined as V(p(x))= \(\int p^2(\mathbf x )\,dx\). The information potential can also be expressed as V(p)=\(\mathcal {E}_p(p)\), where \(\mathcal {E}_p(.)\) denotes expectation w.r.t. \(p(\mathbf x )\). Consider the data set \(D=\{\mathbf {x}_1,\mathbf {x}_2,...\mathbf {x}_N \}\). Let \(k_\sigma (\mathbf x _m,\cdot )\) be the Gaussian kernel with \(\sigma \) as width of kernel used in the Parzen window method for estimation of density at \(\mathbf x _m\). It may be noted that the Gaussian kernel is a Mercer kernel and, therefore, it is a positive semi-definite kernel. Then the estimate of density is given by

$$\begin{aligned} \hat{p}(\mathbf x _m) = \frac{1}{N} \sum \limits _\mathbf{x _n \in D} k_{\sigma } (\mathbf x _m,\mathbf x _n) \end{aligned}$$
(2)

Then the estimate of \(V(p(\mathbf x ))\) denoted by \(\hat{V}(p)\) is given by

$$\begin{aligned} \hat{V}(p)= \frac{1}{N}\sum \limits _\mathbf{x _m\in D}\hat{p}(\mathbf x _m) = \frac{1}{N}\sum \limits _\mathbf{x _m\in D}\frac{1}{N}\sum \limits _\mathbf{x _n\in D} k_\sigma (\mathbf x _m,\mathbf x _n)=\frac{1}{N^{2}} \bar{\mathbf{1 }}^{T}{} \mathbf K \bar{\mathbf{1 }} \end{aligned}$$
(3)

where \(\bar{\mathbf{1 }}\) is an (\(N\times 1\)) vector consisting of all 1’s and K is the kernel gram matrix of the kernel \(k_\sigma (\cdot ,\cdot )\) on the dataset D. Using the eigen decomposition of kernel gram matrix K, Eq. (3) can be rewritten as

$$\begin{aligned} \hat{V}(p)= \frac{1}{N^2}\sum \limits _{i=1}^{N}(\sqrt{\lambda _{i}}{} \mathbf e _{i}^{T}\bar{\mathbf{1 }})^{2} \end{aligned}$$
(4)

where \(\lambda _i\) and \(\mathbf {e}_i\) are the eigenvalues and eigenvectors of K respectively. The eigenvectors of K used for projection are identified based on their extent of contribution to the information potential estimate. It is noted that the kernel ECA method does not make use of the information about the class labels of examples in D.

3 Kernel Entropy Discriminant Analysis

Let D be the data set that consists of data of two classes, \(D_1\)=\(\{\mathbf {x}_{11},\mathbf {x}_{12},..,\mathbf {x}_{1N_1} \}\) and \(D_2\) =\(\{\mathbf {x}_{21},\mathbf {x}_{22},..,\mathbf {x}_{2N_2} \}\) which we assume are generated from probability density functions (pdf) \(p_1(\mathbf {x})\) and \(p_2(\mathbf {x})\) of classes respectively. The Euclidean divergence between the pdfs of these two classes is given as

$$\begin{aligned} ED(p_1,p_2)= \int p_1^2(\mathbf x )d\mathbf x -2\int p_1(\mathbf x )p_2(\mathbf x )d\mathbf x + \int p_2^2(\mathbf x )d\mathbf x \end{aligned}$$
(5)

Using the Parzen window technique for pdf estimation the estimate of \(ED(p_1,p_2)\) denoted by \(\hat{ED}(p_1,p_2)\) is given by

(6)

Let \(\mathbf z _1=[z_{11},z_{12},.,z_{1N}]^T\) and \(\mathbf z _2=[z_{21},z_{22},.,z_{2N}]^T\) be vectors with \(z_{ij}=1\) if \(\mathbf x _j \in D_i\) and \(z_{ij}=0\) otherwise. Thus Eq. (6) can be written as

(7)

where \(\mathbf K \) is the kernel gram matrix of the dataset D. Using the eigen decomposition of \(\mathbf K \), Eq. (7) can be rewritten as

(8)

where \(\lambda _i\) are the eigenvalues and \(\mathbf e _i\) are the eigenvectors of K. We can write Eq. (8) as

(9)

Let \(\psi _i = \lambda _{i}\bigg ( \frac{\mathbf {e}_i^{T}\mathbf {z}_1}{N_1}-\frac{\mathbf {e}_i^{T}\mathbf {z}_2}{N_2}\bigg )^2\). Then \(\hat{ED}(p_1,p_2)= \sum \limits _{i=1}^{N} \psi _i \)

The term \(\psi _i\) is a measure of extent of contribution of \(\lambda _i\) and \(\mathbf e _i\) to the Euclidean divergence. Certain eigenvalues and eigenvectors for which \(\psi _i\) is large contribute more to the Euclidean divergence than the others. By considering only those eigenvalue and eigenvector pairs that contribute significantly to the divergence, we identify the directions for projection that can capture discriminatory features for the data of two given classes.

As the Gaussian kernel is also a Mercer kernel, it is an inner product kernel. Therefore, \(k_\sigma (\mathbf x _m,\mathbf x _n) =\langle \varvec{\phi }(\mathbf x _m),\varvec{\phi }(\mathbf x _n)\rangle \), where \(\varvec{\phi }(\mathbf x )\) is the kernel space representation of a data point \(\mathbf x \). The mean vectors for two classes in \(\varvec{\phi }(x)\)-space are given by \(\mathbf m _1^{\phi } = \frac{1}{N_1} \sum \limits _\mathbf{x _m \in D_1} \varvec{\phi }(\mathbf x _m)\) and \(\mathbf m _2^{\phi } = \frac{1}{N_2} \sum \limits _\mathbf{x _m \in D_2} \varvec{\phi }(\mathbf x _m)\). Then each term in Eq. (6) can be expressed as follows:

(10)
(11)
(12)

Therefore, the estimate of Euclidean divergence \(\hat{ED}(p_1,p_2)\) can be expressed as

$$\begin{aligned} \hat{ED}(p_1,p_2)=\langle \mathbf m ^\phi _1,\mathbf m ^\phi _1\rangle - 2\langle \mathbf m ^\phi _1,\mathbf m ^\phi _2\rangle +\langle \mathbf m ^\phi _2,\mathbf m ^\phi _2\rangle = \vert \vert \mathbf m ^\phi _1-\mathbf m ^\phi _2\vert \vert ^2 \end{aligned}$$
(13)

Thus the Euclidean divergence in x space corresponds to squared Euclidean distance between the means of the data of two classes in the kernel feature space. Let \(\varvec{\nu }_i\) be the direction for projections in the \(\varvec{\phi }(\mathbf x )\)-space. As in kernel PCA, the vector \(\varvec{\nu }_i\) is expressed as follows:

$$\begin{aligned} \varvec{\nu }_i = \frac{1}{\sqrt{\lambda _{i}}}\sum \limits _{n=1}^{N}e_{in}\varvec{\phi }(\mathbf x _n) \end{aligned}$$
(14)

where \(e_{in}\) is the \(n^{th}\) element of \(\mathbf e _i\). The projection of a given data point \(\varvec{\phi }(\mathbf x )\) in the kernel feature space is given by

$$\begin{aligned} a_i =\varvec{\nu }_i^{T} \varvec{\phi }(\mathbf x )= \frac{1}{\sqrt{\lambda _{i}}}\sum \limits _{n=1}^{N}e_{in}\varvec{\phi }(\mathbf x _n)^{T}\varvec{\phi }(\mathbf x ) = \frac{1}{\sqrt{\lambda _{i}}}\sum \limits _{n=1}^{N}e_{in} k(\mathbf x _n,\mathbf x ), \end{aligned}$$
(15)

For a given data point x, the l-dimensional transformed representation using the kernel EDA method is obtained by computing \(a_i\), \(i=1,2,..,l\) where l is the number of directions for projection chosen.

4 Experiments

First, we analyze our proposed algorithm on a synthetic dataset. Then we evaluate the performance on two-class IDA benchmark datasets and multi-class datasets. We compare the performance of the proposed kernel EDA method with PCA, kernel PCA, kernel ECA, kernel FDA and GDA. We have used the Gaussian kernel for the data transformation techniques. The kernel width \(\sigma \) is chosen empirically for each dataset and in order to make a fair comparison, the same kernel width is used for all the techniques. Linear support vector machine is used for obtaining the classification accuracy on the transformed representation of data. The choice of using a linear classifier helps us identifying how effective is the transformed representation in performing the classification task. The data is split into 75%, 10% and 15% for training, validation and testing respectively.

Fig. 1.
figure 1

(a) Synthetic spiral dataset (b) Classification accuracies (in %) for different methods for data transformation and for different values of transformed dimensions.

Table 1. Details of the 2-class IDA benchmark datasets used.

4.1 Studies on Synthetic Dataset

The synthetic dataset contains 3424 data points randomly distributed on two spirals, shown in Fig. 1(a). The classification accuracies on the transformed data using linear SVM are plotted in Fig. 1(b). The accuracies are shown for kernel PCA, kernel ECA and kernel EDA methods and for different values of l. It is seen that kernel EDA performs better than the other two methods.

Table 2. Classification accuracies (in %) obtained on the 2-class IDA benchmark datasets for different methods of dimensionality reduction and for different values of reduced dimension.

4.2 Studies on 2-Class Real World Datasets

Details of two-class IDA benchmark datasets [6] used in our study are listed in Table 1. The classification accuracies on transformed representation obtained using different methods are given in Table 2. The results show an improvement in performance for the proposed technique over the existing techniques on almost all of the datasets.

Table 3. Details of multi-class benchmark datasets used.
Table 4. Classification Accuracies (in %) obtained on Vogel and MIT datasets for different methods for dimension reduction and for different values of reduced dimension. Performance is also compared for the one-versus-one (1-vs-1) and the one-vs-rest (1-vs-R) approach to multi class classification.

4.3 Studies on Multi-class Datasets

The kernel EDA can be used in multiclass classification problem by converting the multiclass problem to multiple binary classification problems by using “One vs Rest” or “One vs One” scheme. The datasets of Vogel [12] and MIT [8] scene multi-class datasets used in our studies are given in Table 3. Local block features are used. Each image is divided into fixed size blocks. From each block, the color, edge direction histogram, and texture features are extracted. Thus, each block of image is represented by a 23-dimensional feature vector. The feature vectors from all the blocks in an image are concatenated to get the representation for an image.

The classification accuracies obtained on the transformed representation for the Vogel and MIT datasets using the linear SVM as classifier is given in Table 4. The kernel EDA technique gives a much higher accuracy as compared to the other techniques on MIT dataset. On Vogel dataset, all the techniques give a similar performance.

5 Conclusion

In this paper, we have proposed kernel entropy discriminant analysis as a data transformation method. It uses Euclidean divergence between the estimates of probability density functions of the two classes as the criterion function to decide the directions for projection. Though the kernel EDA is a supervised technique, it is not limited by the number of classes. Studies on various datasets show that proposed kernel EDA performs better or on-par as compared to PCA, kernel PCA, kernel ECA, kernel FDA and GDA.