Keywords

1 Introduction

Image categorization was a hot topic in the computer vision and patter recognition community. Researchers brought many progresses to this domain by deploying semi-supervised learning (SSL) paradigms [2, 7, 8, 15].

Unlike unsupervised and supervised learning, SSL exploits both labeled and unlabeled data samples in estimating the models. However, SSL may face some difficulties especially in cases where the labels are very scarce. Therefore, one interesting approach is to increase the size of the labeled data by invoking active learning paradigms (e.g., [1, 11, 18]). One main goal of active learning is to generate more labeled samples by simply predicting the labels of unlabeled samples, and exploit them to build new models and classifiers. The main problems that these paradigms solve are: (i) identifying the most relevant unlabeled samples that the system should predict their label first, (ii) preserving confident predictions. Usually, the proposed solutions rely on the concept of confidence in prediction and classification. For instance, if the confidence of a label prediction is not enough for a specific data sample (i.e., the predicted label has a high uncertainty), then the corresponding sample will not be exploited by the final model. At most, it will be used as an unlabeled sample since its estimated label is uncertain. In addition to the uncertainty and confidence concepts, some methods proposed other criteria. In order to avoid having many labeled samples in the same cluster, Nguyen et al. [21] exploit the diversity concept by deploying a pre-clustering method. In [6], the authors proposed an active cluster based sampling method. However, since this approach employs a hierarchical clustering of unlabeled samples, the final performance can be impacted by the performance of clustering process itself. In [17], the authors introduced an active probabilistic variant of the K-NN classifier that can be used for multi-class problems. In [14], the authors proposed an approach that is based on informativeness and representativeness of unlabeled samples. Besides active learning paradigms, sparse representation has brought significant advances to the pattern recognition field [5]. This is due to its capacity to acquire, represent and compress knowledge of the domain, and thus to reconstruct the data with minimal loss [26]. The Sparse Representation based Classifier (SRC) [27] can be thought as a generalization of the Nearest Neighbor classifier (NN) and the Nearest Feature Subspace (NFS) [16]. Unlike the NN and NFS classifiers, SRC can be more robust in the presence of deviations and occlusions [25]. Despite the fact that SRC has good performance, it has a high computational cost since it is based on the \(\ell _1\) minimization in the coding process. Therefore, SRC cannot practical for real-world problems requiring a fast decision and classification. Thus, many researchers exploited data locality [9]. For instance, the work of [19] limited the sparse coding dictionary to the nearest neighbors only. Xu et al. [28] proposed a Two Phase Test Sample Sparse Representation (TPTSSR) approach in which the regularization is given by the \(\ell _2\) norm. This method has two phases. In the first phase, the testing sample is represented as a linear combination of all training samples. The first M samples that provide its best representation are then chosen to be the atoms of a new compact dictionary. In the second phase, the testing sample is coded using the new dictionary of M samples. The label of the test sample is made upon this representation. The Collaborative Representation Classifier (CRC) is the classifier that uses the first phase of the TPTSSR classifier. This paper is organized as follows: Sect. 2 presents our Active Two Phase Collaborative Representation classifier (ATPCRC). The experimental results and methods comparison are presented in Sect. 3. Section 4 concludes the paper. In the paper, capital bold letters denote matrices and bold letters denote vectors.

2 Active Two Phase Collaborative Representation Classifier

Proposed Method. In this section, we propose the Active Two Phase Collaborative Representation Classifier (ATPCRC). While our proposed ATPCRC makes the TPTSSR classifier active, it is also able to make any collaborative representation-based classifier (e.g., CRC and SRC) active. Our proposed ATPCRC aims to construct a classifier that exploits both labeled and unlabeled samples. Let \(\mathbf{x}_1, \mathbf{x}_2, \ldots , \mathbf{x}_L\) denote the labeled data samples and \(\mathbf{x}_{L+1}, \mathbf{x}_{L+2}, \ldots , \mathbf{x}_N\) denote the unlabeled data samples. The matrix of labeled samples is denoted by \(\mathbf{X}_l = [\mathbf{x}_1, \mathbf{x}_2, \ldots , \mathbf{x}_L] \in \mathbb {R}^{D \times L}\) and the matrix of unlabeled samples is denoted by \(\mathbf{X}_u = [\mathbf{x}_{L+1}, \mathbf{x}_{L+2}, \ldots , \mathbf{x}_N] \in \mathbb {R}^{D \times U}\) where L and \(U = N - L\) are the numbers of labeled and unlabeled samples, respectively. The training data are defined by the matrix \(\mathbf{X}= [\mathbf{x}_1, \mathbf{x}_2, \ldots , \mathbf{x}_N] \in \mathbb {R}^{D \times N}\).

Using active learning strategies, we first estimate the labels of all unlabeled samples, \(\mathbf{X}_u\). We then use both the original labeled data and the predicted ones, \(\mathbf{X}\), to build a new classifier. We recall that the TPTSSR is a lazy classifier in the sense that all of its computation stages run at the testing step. In order to predicting the label of the unlabeled samples any classifier can be invoked. In our work, we employ the TPTSSR classifier in which the original set of labeled samples are used. Once this stage is achieved, every sample in the training data matrix \(\mathbf{X}\) has either an original label or a predicted one. In order to classifying a testing sample by the proposed ATPCRC, we proceed as follows. Two coding schemes are carried out independently, each has two phases of coding like in TPTSSR. The first coding scheme uses the labeled data \(\mathbf{X}_l\). The second coding scheme uses the whole training data matrix \(\mathbf{X}\). To infer the class of any testing sample, a fusion of the class-wise reconstruction error is exploited. Let \(M_l\) and M denote the parameters of the two coding processes. We proceed as follows.

First Phase. In the first phase, the testing sample \(\mathbf{y}\in \mathbb {R}^{D}\) will have two representation or codes: the first code vector is computed from a linear combination of the labeled samples \(\mathbf{X}_l\) and the second code results from a linear combination of the whole training data \(\mathbf{X}\). These two codes of \(\mathbf{y}\) are given by:

$$\begin{aligned} \mathbf{y}&= a^l_1 \, \mathbf{x}_1 \, + \, a^l_2 \, \mathbf{x}_2 \, + \ldots + \, a^l_L \, \mathbf{x}_L \\ \mathbf{y}&= a_1 \, \mathbf{x}_1 \, + \, a_2 \, \mathbf{x}_2 \, + \ldots + \, a_L \, \mathbf{x}_L + \ldots + \, a_N \, \mathbf{x}_N \end{aligned}$$
(1)

Equations (1) and (2) can be rewritten in matrix form as follows:

$$ \mathbf{y}= \mathbf{X}_l \, \mathbf{a}^l \,\,\,\,\,\,\,\, \text{ and } \,\,\,\,\,\,\,\, \mathbf{y}= \mathbf{X}\, \mathbf{a}\,\,\,\,\,\,\,\, $$

where \(\mathbf{a}^l = [ a^l_1, a^l_2, \ldots , a^l_L]^T\) and \(\mathbf{a}= [ a_1, a_2, \ldots , a_N]^T\). The unknown code vectors \(\mathbf{a}^l\) and \(\mathbf{a}\) are recovered using \(\ell _2\) regularization. These two vectors are solutions to the following optimization problems, respectively:

$$\begin{aligned} \mathbf{a}^{l\star } = arg \min _{\mathbf{a}^l} \Vert \mathbf{y}- \mathbf{X}_l \, \mathbf{a}^l \Vert ^2 + \lambda _l \, \Vert \mathbf{a}^l \Vert ^2 \nonumber \\ \mathbf{a}^{\star } = arg \min _{\mathbf{a}} \Vert \mathbf{y}- \mathbf{X}\, \mathbf{a}\Vert ^2 + \lambda \, \Vert \mathbf{a}\Vert ^2 \nonumber \end{aligned}$$
(2)

where \(\lambda _l\) and \(\lambda \) are two regularization parameters. The solutions for \(\mathbf{a}^l\) and \(\mathbf{a}\) are provided by:

$$\begin{aligned} \mathbf{a}^{l\star } = (\mathbf{X}_l^{T} \,\mathbf{X}_l +\lambda _l \, \mathbf{I}_l)^{-1} \, \mathbf{X}_l^{T} \mathbf{y}\nonumber \\ \mathbf{a}^{\star } = (\mathbf{X}^{T} \,\mathbf{X}+\lambda \, \mathbf{I})^{-1} \, \mathbf{X}^{T} \mathbf{y} \end{aligned}$$
(3)

where \(\mathbf{I}\) and \(\mathbf{I}_l\) are identity matrices with an appropriate size. From Eqs. (1) and (2), one can see that each data sample, \(\mathbf{x}_i\), has its own contribution in the reconstruction of the test sample \(\mathbf{y}\). Thus, from Eq. (1) the contribution of \(\mathbf{x}_i\) is \(a^l_i \mathbf{x}_i\). From (2), the contribution is \(a_i \mathbf{x}_i\). Therefore, \(\mathbf{x}_i\) has a large contribution in Eq. (1) if \(\Vert \mathbf{y}- a^l_i \, \mathbf{x}_i \Vert ^2\) is small, and it has a large contribution in Eq. (2) if \(\Vert \mathbf{y}- a_i \, \mathbf{x}_i \Vert ^2\) is small. Thus, the \(M_l\) samples (\(1 \le M_l \le L\)) that have the largest \(M_l\) contributions when approximating \(\mathbf{y}\) in Eq. (1) and the M samples (\(1 \le M \le N\)) that have the largest M contributions when approximating \(\mathbf{y}\) in Eq. (2) are chosen to be handed over to the second phase of coding. The two subsets of selected samples are denoted by \(\{ \widetilde{\mathbf{x}}^l_1, \widetilde{\mathbf{x}}^l_2, \ldots , \widetilde{\mathbf{x}}^l_{M_l} \}\), and \( \{ \widetilde{\mathbf{x}}_1, \widetilde{\mathbf{x}}_2, \ldots , \widetilde{\mathbf{x}}_{M} \}\). In matrix form, these two dictionaries are given by \(\widetilde{\mathbf{X}_l} = [ \widetilde{\mathbf{x}}^l_1, \widetilde{\mathbf{x}}^l_2, \ldots , \widetilde{\mathbf{x}}^l_{M_l}]\) and \(\widetilde{\mathbf{X}} = [ \widetilde{\mathbf{x}}_1, \widetilde{\mathbf{x}}_2, \ldots , \widetilde{\mathbf{x}}_{M}]\).

Second Phase. In the second phase, the testing sample \(\mathbf{y}\) is represented by two code vectors: the first one is a linear combination of the remaining \(M_l\) labeled samples and the second one is a linear combination of the remaining M training samples. This can be written as:

$$ \mathbf{y}= \widetilde{\mathbf{X}_l} \, \mathbf{b}^l \,\,\,\,\,\,\,\, \text{ and } \,\,\,\,\,\,\,\, \mathbf{y}= \widetilde{\mathbf{X}} \, \mathbf{b}\,\,\,\,\,\,\,\, $$

where \(\mathbf{b}^l\) and \(\mathbf{b}\) denote the second phase vectors. Similarly to (1) and (2), the unknown vectors \(\mathbf{b}^l\) and \(\mathbf{b}\) are provided by:

$$\begin{aligned} \mathbf{b}^{l\star } = (\widetilde{\mathbf{X}_l}^{T} \, \widetilde{\mathbf{X}_l} + \gamma _l\, \mathbf{I}_l)^{-1} \, \widetilde{\mathbf{X}_l}^{T} \, \mathbf{y}\nonumber \\ \mathbf{b}^{\star } = (\widetilde{\mathbf{X}}^{T} \, \widetilde{\mathbf{X}} + \gamma \, \mathbf{I})^{-1} \, \widetilde{\mathbf{X}}^{T} \, \mathbf{y} \end{aligned}$$
(4)

where \(\gamma \) and \(\gamma _l\) are two regularization parameters.

Suppose we have \(t_l\) data samples, from the \(M_l\) labeled samples, belonging to the \(c^{th}\) class: \((\widetilde{\mathbf{x}}^l_1)^c, (\widetilde{\mathbf{x}}^l_2)^c, \ldots , (\widetilde{\mathbf{x}}^l_{t_l})^c\), and their corresponding coefficients are \((b^l_1)^{c}, (b^l_2)^{c}, \ldots , (b^l_{t_l})^{c}\). Suppose that, from the M training samples, there are t data samples belonging to the \(c^{th}\) class (or predicted to be in this class): \((\widetilde{\mathbf{x}}_1)^c, (\widetilde{\mathbf{x}}_2)^c, \ldots , (\widetilde{\mathbf{x}}_{t})^c\) and their corresponding coefficients are \((b_1)^{c}, (b_2)^{c}, \ldots , (b_{t})^{c}\). We can define the reconstruction error associated to class c, Dev(c) by:

$$\begin{aligned} \eta \, \bigg \Vert \mathbf{y}- \sum _{j=1}^{t_l} \, \widetilde{\mathbf{x}}^{c}_j \, (b^l_j)^c \bigg \Vert ^2 + (1-\eta ) \, \bigg \Vert \mathbf{y}- \sum _{j=1}^{t} \, \widetilde{\mathbf{x}}^{c}_j \, (b_j)^{c} \bigg \Vert ^2 \end{aligned}$$
(5)

where \(\eta \) is a balance parameter (\(0 \le \eta \le 1\)). The above proposed residual is a way of fusing the collaborative contribution of the selected samples of the \(c^{th}\) class, in representing the testing sample \(\mathbf{y}\) by both \(\mathbf{X}_l\) and \(\mathbf{X}\). A large contribution corresponds to small residual. Therefore, the label of \(\mathbf{y}\) is estimated by:

$$ l(\mathbf{y}) = arg \min _{c} Dev (c) \,\,\,\,\,\,\,\,\,\,\,\,\,\,\, 1 \le c \le C $$

where C is the number of classes and \(l(\mathbf{y})\) is the predicted class label of the testing sample \(\mathbf{y}\). This, is the output of the ATPCRC. By using this merging rule, we are able to down-weigh the residual associated with the samples in \(\mathbf{X}\) since their labels are not all correct. The introduced class-wise reconstruction errors avoid the use of an ad-hoc sample-based confidence measure. Based on Eq. (5), we can observe that if \(\eta \) is set to 1, we get the classic TPTSSR. If \(\eta \) is set to zero, we get a trivial active variant of TPTSSR. In the sequel, we will show that the proposed ATPCRC can outperform both the classic TPTSSR and the trivial active variant of TPTSSR.

The Algorithm. The introduces ATPCRC has the following inputs: the labeled data matrix \(\mathbf{X}_l = [\mathbf{x}_1, \mathbf{x}_2, \ldots , \mathbf{x}_L] \in \mathbb {R}^{D \times L}\), the training data matrix \(\mathbf{X}= [\mathbf{x}_1, \mathbf{x}_2, \ldots , \mathbf{x}_N] \in \mathbb {R}^{D \times N}\) (it has both labeled and unlabeled samples), the testing sample \(\mathbf{y}\in \mathbb {R}^D\) and the parameters M and \(M_l\).

  1. 1.

    Estimate the labels of the samples \(\mathbf{x}_{L+1}, \mathbf{x}_{L+2}, \ldots , \mathbf{x}_N\) using the TPTSSR classifier and the training data \(\mathbf{X}_l\). M is the TPTSSR parameter.

  2. 2.

    Calculate the code vectors \(\mathbf{a}^{\star }\) and \(\mathbf{a}^{l\star }\) using Eq. (3).

  3. 3.

    Compute the vector \(\mathbf{e}=(e_1,e_2, \ldots , e_N)^T\) where \(e_i = \Vert \mathbf{y}- a_i \, \mathbf{x}_i \Vert ^2\). Sort \(\mathbf{e}\) and choose the samples that corresponding to the smallest M elements of \(\mathbf{e}\). These selected samples are denoted \(\widetilde{\mathbf{x}}_1, \widetilde{\mathbf{x}}_2, \ldots , \widetilde{\mathbf{x}}_{M}\). Finally, form the matrix \(\widetilde{\mathbf{X}} = [\widetilde{\mathbf{x}}_1, \widetilde{\mathbf{x}}_2, \ldots , \widetilde{\mathbf{x}}_{M}]\).

  4. 4.

    Similarly form the matrix \(\widetilde{\mathbf{X}^l} = [\widetilde{\mathbf{x}}^l_1, \widetilde{\mathbf{x}}^l_2, \ldots , \widetilde{\mathbf{x}}^l_{M_l}]\) using \(e^l_i = \Vert \mathbf{y}- a^l_i \, \mathbf{x}_i \Vert ^2\) instead of \(e_i = \Vert \mathbf{y}- a_i \, \mathbf{x}_i \Vert ^2\) and \(M_l\) instead of M.

  5. 5.

    Compute the code vectors \(\mathbf{b}^{\star }\) and \(\mathbf{b}^{l\star }\) using Eq. (4).

  6. 6.

    For every class c (\(1 \le c \le C\)) calculate the global residual defined in Eq. (5).

  7. 7.

    The label of \(\mathbf{y}\) is the class that corresponds to the smallest residual error.

3 Performance Study

In this section, we compare the performance of the proposed ATPCRC with that of twelve methods: Nearest Neighbor classifier (NN), Support Vector Machines (SVM) adopting a polynomial kernel, Sparse Representation based Classifier (SRC) [27], Two Phase Test Sample Representation Classifier (TPTSSR) [28], Semi-supervised Discriminant Embedding (SDE) [13], Semi-supervised Discriminant Analysis (SDA) [4], Transductive Component Analysis (TCA) [20], Sparsity Preserving Discriminant Analysis (SPDA) [23], Laplacian Regularized Least Squares (LapRLS) [3], Flexible Manifold Embedding (FME) [22], Kernel Flexible Manifold Embedding (KFME) [12], and Semi-supervised Exponential Discriminant Embedding (ESDE) [10]. The SVM, NN, SRC, and TPTSSR classifiers are supervised methods while the other competing approaches are exploiting both unlabeled and labeled samples.

Experimental Setup. The experiments are run on four public image datasets. These four datasets belong to several categories: one object database (COIL20), one handwritten digits database (USPS), and two face datasets (Extended Yale and Honda).

COIL20Footnote 1: The Columbia Object Image Library (COIL20) The COIL20 image database has 1440 images. There are 20 objects and each object provides 72 images which are taken at pose intervals of five degrees. In our experiments, we use a subset having 18 images for each object (one image for every \(20^{\circ }\) of rotation).

Extended YaleFootnote 2: There are 1774 images depicting 28 persons. Each person has 59–64 frontal images.

Honda: We use 1138 face images retrieved from the public Honda Video DataBase (HVDB). These images correspond to 22 persons.

USPS Handwritten DigitsFootnote 3: This dataset consists of 11000 images of handwritten digits from “0” to “9” (1100 images per digit). We utilize the tenth of this database.

Each dataset is randomly split into labeled, unlabeled and testing samples. In the conducted experiments, we adopt three different partitions of the data. These partitions are illustrated in Table 1. The labeled and unlabeled parts are used in the methods that use bot labeled and unlabeled data. The test part is used to evaluate the performance.

For each partition, the splitting process is repeated ten times. As a preprocessing step, all datasets used PCA in order to reduce the dimensions. We used a PCA that preserves 98% of the variability.

Table 1. Data partitions for the used image datasets.

Method Comparison. Table 2 depicts the recognition performance of the proposed ATPCRC and that of 12 competing methods. In this table, we report the recognition rate average as well as its standard deviation over the ten random splits.

For the FME, KFME, SDE, SDA, SPDA, LapRLS and TCA methods, all parameters are tuned using the set \(\{10^{-9}, 10^{-6}, 10^{-3}, 1, 10^{+3}, 10^{+6}, 10^{+9}\}\). For the ATPCRC method, M and \(M_l\) parameters are chosen in \(\{30, 60, 90, ..., N\}\). The regularization parameters of the proposed ATPCRC method (i.e., \(\lambda \), \(\lambda _l\), \(\gamma \) and \(\gamma _l\)) are set to 0.01. \(\eta \) is set to 0.8. This value for \(\eta \) was empirically found to be a good choice for all datasets.

For the projection methods (SDE, SDA, SPDA, TCA, and ESDE), the classification was performed using the nearest neighbor (NN) classifier. The reported results correspond to the best parameters configuration over ten splits. Bold numbers correspond to the best recognition rates. Several observations can be made from Table 2. The main ones are as follows. (1) The performance of the introduced active classifier (ATPCRC) can be batter than that of many other competing methods. (2) The outperformance of the proposed ATPCRC method is significant for the Honda and Extended Yale datasets which have face images with a high variability.

Table 3 compares our proposed ATPCRC with the trivial active TPTSSR. The trivial active TPTSSR is obtained by setting the \(\eta \) parameter of ATPCRC to zero. For the trivial active TPTSSR, the entire set of data samples \(\mathbf{X}\) is used: those with ground-truth labels and those with predicted ones. From this table, we can see that the ATPCRC is superior to the trivial active TPTSSR in most of the cases. Thus, the use of weighted class-wise reconstruction residuals (i.e., Eq. (5)) was crucial for reaching a good performance.

Table 2. Average and standard deviation over ten random splits of the correct classification rate (%) using several methods.
Table 3. Average recognition rate and standard deviation in % of a simple active TPTSSR and the proposed ATPCRC classifier.

Statistical Significance. In the section we conduct a statistical analysis of the results. To this end, we use the well known paired sample t-test [24]. We adopt a confidence level of 95\(\%\) (i.e., the statistical significance threshold p is set to 0.05). Table 2 shows the outcome of all paired sample t-tests. For a given competing approach, an underlined rate indicates that there is no significant statistical difference between the proposed ATPCRC and this competing approach. Among the 144 paired tests, the proposed ATPCRC was significantly better in 134 configurations representing 93.08% of the tested pairs.

Computational Time. We measure the computational time needed by the TPTSSR, SRC, and the proposed ATPCRC method. We fix the number of labeled images to 50% of the whole data and the remaining images are used as test images. Table 4 depicts the CPU time in seconds associated with classification of the whole test images. The experiments have been run using MATLAB on a 128 GB RAM intel core I7-6900k 8 cores 3.6 GHz CPU computer. As it can be seen, the proposed ATPCRC approach is much faster than the SRC method.

Table 4. CPU time (in seconds) of the SRC, TPTSSR and ATPCRC classifiers when 50% of the dataset are labeled images and the remaining 50% are test images.

4 Conclusion

In this paper, we introduced an active Two Phase Collaborative Representation Classifier. Indeed, transforming the original TPTSSR (or any collaborative representation classifier) to an active classifier is a challenging task. The proposed fused class-wise reconstruction residual avoided adopting an ad-hoc sample-based confidence measure. Experiments conducted on four public images datasets show the outperformance of the proposed method over 12 classification methods. These experiments demonstrate that active learning can lead to a performance which is significantly better than that provided by the passive classifiers TPTSSR and SRC.