Keywords

1 Introduction

In recent years, the visual information surge has led to the widespread application of image recognition or identification in the field of pattern recognition where a test image is to be identified based on the samples in the training data. Many of the recognition algorithms are appearance based where an image is considered as a large array of intensity values. While working with these methods, an image of \(m\times n\) size is represented as an mn-dimensional vector which demands high computation and often results in slower computation, also known as curse of dimensionality. Though images are represented as high dimensional data, natural images have only limited degrees of freedom thus often lie in comparatively very low dimensional linear or non-linear manifold. This led to the development of Dimensionality Reduction (DR) techniques that reduce redundant information present in higher dimensional space and remove curse of dimensionality.

The elementary task of DR techniques is to search for a linear or non-linear mapping of the data from a high dimensional space to a lower dimensional subspace which preserves some specific information of high dimensional space that leads to favorable outcomes for recognition while reducing the computational load. Some of the most used linear DR techniques are Principal Component Analysis (PCA) [15] and Linear Discriminant Analysis (LDA) [8]. These methods search low dimensional representation of data assuming it lies on a linear manifold but this assumption is usually not true in complex data such as images. To overcome this shortcoming, nonlinear DR methods were developed to search underlying low dimensional manifold. Some of the well-known nonlinear DR are Locally Linear Embeddings (LLE) [12] and Laplacian Eigenmaps (LE) [1]. These methods suffer from the limitation of embedding out-of-sample data because of the absence of explicit mapping from high dimensional space to low dimensional subspace. To overcome it, linear extensions of LLE is proposed in Orthogonal Neighborhood Preserving Projection (ONPP) [4].

The neighborhood selection and the distance used to define it have the paramount effect on the learned manifold. Most DR methods use Euclidean distance to define neighborhood of a data point, which generally do not match the classification properties. It is possible that data points from different classes may be considered neighbors based on their small Euclidean distance, on the other hand, data points from same class may not be considered neighbors because of large Euclidean distance. One approach to handle this problem is incorporating class information in neighborhood selection, data points from the same class are considered neighbors of each other. In image recognition, where the knowledge of the class label is available, neighbors are selected based on their class label only. These methods are known as supervised methods but when data points from different classes are overlapping or closely placed, such hard neighborhood selection rule does not capture the data geometry efficiently.

Various approaches to define this neighborhood is attempted in the past. It is proved in [7] that combining different distance matrices or dissimilarity representations can often increase the performance of individual ones. In [11], authors have proposed Supervised LLE (SLLE) that uses knowledge of class label to modify the pair-wise distances that define neighborhood. It blindly adds a constant to the Euclidean distance of the data points belonging to different classes. In [10], k-means clustering based approach is proposed to find neighbors of a data point which is unsupervised approach that does not consider class label knowledge to define the neighborhood. Work documented in [16] proposes an adaptive neighborhood of varying size based on the local linearity.

In this article we are addressing the problem with neighborhood selection based on either class label or a distance measure only. The main contribution of this paper is the computation of class similarity based distance between two data points and a new neighborhood selection rule based on this class similarity. The proposed neighborhood selection rule merges the effect of class knowledge and Euclidean distance. The new neighborhood selection rule is used in conventional ONPP for image recognition. We report the results of recognition experiments on two kinds of data: Face image data with very high dimensions, and handwritten numerals data with a large number of samples. The proposed Class Similarity based ONPP outperforms the ONPP with significantly less number of subspace dimensions. The paper also compares the recognition performance of a Modified variant of ONPP, namely MONPP [5] with proposed neighborhood rule.

The article is organized as follows, Sect. 2 explains ONPP in detail, Sect. 3 proposes modified distance based on Class Similarity. Section 4 documents recognition experiments with conventional ONPP approach and Class Similarity based ONPP approach. Section 5 concludes the proposed work.

2 Orthogonal Neighborhood Preserving Projections

ONPP achieves lower dimensional subspace in three steps. First, a local neighborhood of each data point is defined, each data point is then expressed as a linear combination of its neighbors. The third step seeks subspace bases that preserve this linear relationship among neighbors in the lower dimensional representation of data through a minimization problem. In applications involving images, Let \(\mathbf {X = [x_{1},x_{2},...x_{N}]}\) be the data matrix of vectorized images so that an image \(\mathbf {x_i}\) constitutes a point in mn dimensional space. The goal of subspace based DR methods is to find an orthogonal/non-orthogonal subspace bases \(\mathbf {V}\in \mathcal {R}^{mn\times d}\) such that the low dimensional representation \(\mathbf {Y}=\mathbf {V}^{T}\mathbf {X}\).

Step 1: Finding Neighborhood: For each data point \(\mathbf {x_i}\), a local neighborhood is defined using simple clustering methods such as k-Nearest Neighbor (k-NN) or \(\varepsilon \) - Neighbors (\(\varepsilon \)-NN). Let \(\mathcal {N}_{x_{i}}\) be the set of \(k \) neighbors of \(\mathbf {x_i}\).

Step 2: Calculating Reconstruction Weights: ONPP assumes that the neighborhood lies on a locally linear manifold, thus each data point can be expressed as a linear combination of its neighbors. For \(\mathbf {x_i}\), the linear combination can be denoted as \(\sum _{j=1}^{k} w_{ij}\mathbf {x_{j}}\) where, \(\mathbf {x_{j}}\in \mathcal {N}_{x_{i}}\). The linear weights \(w_{ij}\) for each \(\mathbf {x_i}\) can be computed by posing the problem as minimization of the reconstruction error

$$\begin{aligned} \underset{\mathbf {W}}{\arg \min }{\displaystyle {\displaystyle \sum \nolimits _{i=1}^{N}}\parallel \mathbf {x_{i}}-\sum \nolimits _{j=1}^{k}}w_{ij}\mathbf {x_{j}}\parallel ^{2} \text {s.t.} \sum \nolimits _{j=1}^{k} w_{ij}=1 \end{aligned}$$
(1)

An improved variant of ONPP is proposed as Modified Orthogonal Neighborhood Preserving Projections (MONPP) in [5]. MONPP stresses on the fact that for larger neighborhoods, the local linearity assumption may not hold true and the neighborhood assumed to be a linear patch may have some inherent non-linearity. To take this non-linearity into account, MONPP uses nonlinear weights in place of linear weights to improve recognition performance. In this article, ONPP as well as MONPP is used to compare the performance of proposed Class Similarity based neighborhood selection rule.

Step 3: Finding Subspace: This step is dimensionality reduction by finding the bases \(\mathbf {V}\in \mathcal {R}^{mn\times d}\) of low dimensional subspace which preserves the linear relationship of each \(\mathbf {x_i}\) with its neighbors \(\mathbf {x_j}\) with corresponding reconstruction weights \(w_{ij}\) in each projection \(\mathbf {y_i}\) and its neighbors \(\mathbf {y_j}\) with same weights \(w_{ij}\). Such embedding is obtained by minimizing the reconstruction errors in the subspace. Hence, the minimization problem is defined as

$$\begin{aligned} \underset{\mathbf {Y}}{\arg \min }\sum \nolimits _{i=1}^{N}\parallel \mathbf {y_{i}}-\sum \nolimits _{j=1}^{k}w_{ij}\mathbf {y_{j}}\parallel ^{2} \text {s.t. }\mathbf {V}^{T}\mathbf {V}=\mathbf {I} \end{aligned}$$
(2)

The bases \(\mathbf {V}\) turns out to be the eigen-vectors of \(\mathbf {X}(\mathbf {I}-\mathbf {W})(\mathbf {I}-\mathbf {W})^T\mathbf {X}^T\) corresponding to the smallest d eigen-values (\(d \ll mn\)). For recognition tasks, an out-of-sample data point \(\mathbf {x_l}\) can now be projected to subspace as \(\mathbf {y_{l}}=\mathbf {V}^{T}\mathbf {x_{l}}\).

DR methods can be implemented in supervised mode when class label knowledge is available. It is proved in [4] that incorporating class information in neighborhood selection improves recognition, but it is not always a good idea to ignore Euclidean distance entirely. Moreover, in tasks like image recognition, the number of samples N available to learn a projection space is less than dimension mn (known as small-sample size problem). To overcome this limitation all dimensionality reduction algorithms apply PCA as preprocessing to achieve an intermediate low dimensional space to learn bases. Based on these two observations, in the next section, we propose a new class similarity based distance that uses preprocessed data to define a new neighborhood selection rule.

3 Class Similarity Based ONPP (CS-ONPP)

In conventional ONPP, the neighbors of data point \(\mathbf {x_i}\) are selected based on euclidean distance (in unsupervised settings) or based on class label information (in supervised settings). To incorporate underlying similarity between data points along with their class label many works have been done, in [17] authors have proposed an Enhanced Supervised LLE (ESLLE) where the Euclidean distance is simply modified by adding a constant increment for the pairs of data that belongs to different class, keeping the distance of intra-class data point pairs unchanged. The scheme does not consider similarity between intra-class data or inter-class data in any way. When data points are very similar, they are closely placed in the high dimensional space and the classes are overlapping. In such cases, hard decision rule based on class label may not help finding a good low dimensional representation. To overcome this limitation of neighborhood finding rule, we are proposing a novel neighborhood rule inspired by [18].

Instead of claiming a data point \(\mathbf {x_i}\) belonging to an unique class and modifying distance accordingly, let us define a C-dimensional class probability vector \(\mathbf {p}(\mathbf {x_i}) = [p_1(\mathbf {x_i}), p_2(\mathbf {x_i}), ..., p_C(\mathbf {x_i})]^T\). Here, C is number of classes in data. The \(c^{th}\) element of the vector \(p_c({\mathbf {x_i}})\) represents the probability of a data point \(\mathbf {x_i}\) belonging to \(c^{th}\) class.

For given data matrix \(\mathbf {X}\) with known class label, the probability of each data point \(\mathbf {x_i}\) belonging to class c can be computed using Logistic Regression (LR). The LR assumes that the logit of the probability \(\pi (\mathbf {x_i})\) is a linear combination of features of \(\mathbf {x_i}\), that can be given by

$$\begin{aligned} \nonumber \log \Big (\frac{\pi (\mathbf {x_i})}{1-\pi (\mathbf {x_i})}\Big ) = \alpha + \beta ^T\mathbf {x_i} \end{aligned}$$

Specifically for a class c,

$$\begin{aligned} \pi (\mathbf {x_i}; \alpha _c , \beta _c) = \frac{\exp (\alpha _c+{\beta _c}^T\mathbf {x_i})}{1+\exp (\alpha _c+{\beta _c}^T\mathbf {x_i})}, c = 1, ..., C. \end{aligned}$$
(3)

Where \(\alpha _c \in \mathcal {R}\) and \(\beta _c \in \mathcal {R}^{mn}\) are parameters for class c learned on training data with class knowledge using maximum likelihood estimation.

Performing LR on high dimensional data causes huge computational burden, thus we take advantage of pre-processing performed in ONPP and use lower dimensional representation sought using PCA to find these class probabilities for each data point \(\mathbf {x_i}\). Let \(\mathbf {z_i}\) be a lower dimensional PCA representation of \(\mathbf {x_i}\) to find probability for class c. The Eq. (3) becomes

$$\begin{aligned} \pi (\mathbf {x_i}) = \pi (\mathbf {z_i}; \alpha _c , \beta _c)= & {} \frac{\exp (\alpha _c+{\beta _c}^T\mathbf {z_i})}{1+\exp (\alpha _c+{\beta _c}^T\mathbf {z_i})} \end{aligned}$$
(4)

To find probability vector \(\mathbf {p}(\mathbf {x_i})\), each entry is the probability \(p_c(\mathbf {x_i})\) for class c can be computed by

$$\begin{aligned} p_c(\mathbf {x_i})= & {} \frac{\pi (\mathbf {z_i}; \alpha _c , \beta _c)}{\sum _{c=1}^C\pi (\mathbf {z_i}; \alpha _c , \beta _c)} \end{aligned}$$
(5)

Note that, PCA representation \([\mathbf {z_1, z_2, ... z_N}]\) carries class information form corresponding data points \([\mathbf {x_1, x_2, ... x_N}]\). For a pair of data points \(\mathbf {x_i}\) and \(\mathbf {x_j}\),class-similarity \(\mathcal {S}(i,j)\) is proposed as (6) to define a new distance measure \(\varDelta '\),

$$\begin{aligned} \mathcal {S}(i,j)= {\left\{ \begin{array}{ll} 1; &{} \mathbf {x_i}=\mathbf {x_j} \\ \mathbf {p(x_i)}^T\mathbf {p(x_j)}; &{} \mathbf {x_i}\ne \mathbf {x_j} \end{array}\right. } \end{aligned}$$
(6)
$$\begin{aligned} \varDelta '(\mathbf {x_i},\mathbf {x_j})&= \left\| \mathbf {x_i}-\mathbf {x_j}\right\| + \alpha \max (\varDelta ) (1-\mathcal {S}(i,j)) \end{aligned}$$
(7)

The new distance formula modifies Euclidean distance based on the similarity value \(\mathcal {S}(i,j)\) of two data points. Similarity of a data point with itself is defined as 1 to make the distance of the data point with itself 0. For two distinct data points \(\mathbf {x_i}\) and \(\mathbf {x_j}\), the similarity is defined as an inner product of class probability vectors. If \(\mathbf {x_i}\) and \(\mathbf {x_j}\) belong to different classes, the inner product of class probability vectors is expected to be smaller, thus a smaller similarity \(\mathcal {S}(i,j)\) resulting into larger \(\varDelta '(\mathbf {x_i},\mathbf {x_j})\) compared to Euclidean distance.

Based on this new distance \(\varDelta '(\mathbf {x_i},\mathbf {x_j})\), neighbors for data point \(\mathbf {x_i}\) will be selected, which incorporates class information as well as similarity among neighbors. The rest of the method of finding subspace is similar to ONPP. Table 1 gives algorithm to find Class-Similarity based ONPP subspace.

Table 1. Class-Similarity based ONPP Algorithm

4 Experiments and Results

Class-similarity based neighborhood selection approach is applied to ONPP and MONPP, now onwards denoted as CS-ONPP and CS-MONPP. The recognition performance and dimensionality reduction performance of Conventional ONPP and Modified ONPP are compared with that of the CS-ONPP and CS-MONPP on some benchmark face databases and handwritten numerals databases.

The face databases used are ORL [13], UMIST [3] and CMU-PIE [14] having nearly 400, 564 and 1596 images respectively showing variations in terms of pose, lighting, occlusions and expressions. For uniformity, all images are resized to \(40\times 40\), out of which 50% images are used for training. The Handwritten numerals databases used are MNIST [6], Gujarati [9] and Devanagari [2] having nearly 68000, 13000 and 18000 images respectively showing large variations in stroke width, orientation, shape etc. All images were resized to \(30 \times 30\). For each database, randomly 1000 images were used for training.

To analyze the behavior of proposed method with respect to parameters PCA dimensions \(d_{PCA}\), the tuning parameter \(\alpha \) and ONPP subspace dimension d, experiments are repeated with various set of (\(d_{PCA}\), \(\alpha \), d), where values of \(d_{PCA}\in \{2,4,6,8,10\}\) \(\alpha \in \{0.25, 0.50, 0.75\}\) and ONPP subspace dimensions are considered from \(\{5,10,\dots \}\). To achieve unbiased results, such 20 randomization for all set of (\(d_{PCA}\), \(\alpha \), d) were performed on all databases. Best recognition (in %) results achieved with conventional ONPP and MONPP (column 3) with corresponding subspace dimensions (column 4) are reported in Table 2 with subspace dimensions required, corresponding PCA dimension and \(\alpha \) (column 5,6,7) for CS based approaches to achieve the same recognition accuracy.

Table 2. Best Recognition Accuracy (%) achieved using ONPP and MONPP with corresponding subspace dimensions d. To achieve same recognition accuracy, subspace dimensions required in CS-ONPP and CS-MONPP are reported with corresponding tuning parameter \(\alpha \) and PCA dimension \(d_{PCA}\)
Table 3. Best Recognition Accuracy(%) of proposed CS-ONPP and CS-MONPP with parameters subspace dimensions (d), tuning parameter \(\alpha \) and PCA dimensions \(d_{PCA}\)

For all databases, it is observed that Class similarity based approaches achieve better recognition at less number of subspace dimensions compared to conventional approaches. For ORL, CS-ONPP and CS-MONPP needs average 55 and 62 lesser subspace dimensions respectively. For UMIST, proposed methods needs on average 100 and 85 lesser dimensions. In CMU-PIE, CS-ONPP improved dimension reduction with only a small margin, but CS-MONPP needs comparatively 700 less dimensions to achieve best recognition. For MNIST, the best recognition can be achieved with average 45 lesser dimensions using both CS-ONPP and CS-MONPP. In Devanagari data, to achieve best recognition, CS-ONPP needs on average 20 less dimensions, where as CS-MONPP needs average 15 less dimensions. In Gujarati, to reach best recognition of ONPP, CS-ONPP needs average 30 less dimensions, where as CS-MONPP needs average 27 less dimensions. It is also observed that the proposed neighborhood rule increases the overall recognition performance in terms of accuracy. Table 3 reports best recognition accuracy for proposed methods CS-ONPP and CS-MONPP along with parameters subspace dimensions (d), tuning parameter (\(\alpha \)) and PCA dimensions (\(d_{PCA}\)).

5 Conclusion

Conventional ONPP selects neighbors based on Euclidean distance or the class knowledge, which may not be the best rule when data distribution is highly overlapping. We propose a new probability based neighborhood selection rule which incorporates both the information - Euclidean Distance and Class Similarity between two data points. Logistic Regression is used to compute the probability. Experiments performed on Face data and Handwritten numerals data, Class Similarity based approaches CS-ONPP and CS-MONPP outperforms conventional algorithms in recognition and achieves superior recognition rates with comparatively less number of subspace dimensions. In the future, it will be an interesting work to observe whether this neighborhood rule improves the performance of class of DR methods that are based on local neighborhood.