Keywords

1 Introduction

Face recognition can be viewed as one of the most popular and challenging topic in computer vision and pattern recognition. In the past 20 years, substantial face recognition methods [1,2,3,4,5,6,7,8,9,10,11,12,13] have been developed by numerous researchers. Among these methods, sparse coding and discriminative methods have yielded significant results.

Nassem et al. [1] proposed the linear regression classifier (LRC) for face recognition. The main idea of LRC is representing a testing face by a suitable way and classifying it to one class, which can represent it better than other classes. One after another, \(l_1\)-norm regularization term is imposed upon the LRC model to avoid over-fitting by Wright et al. [2] who proposed a sparse representation based classification (SRC) framework to solve FR problems. In SRC, a testing image is coded by a sparse linear combination of training samples via the \(l_1\)-norm minimization. SRC classifies the testing image through estimating which class of training samples could generate the smallest reconstruction error of it with the corresponding class coding coefficients. Zhang et al. [3] illustrate that not only \(l_1\)-norm but also \(l_2\)-norm could achieve parallel results on coding coefficients and proposed the collaborative representation classifier (CRC) scheme. Among the above models, the fidelity terms are measured by the \(l_2\)-norm or \(l_1\)-norm, which follows the assumption that the pixels of error obey Gaussian or Laplacian distribution independently. Nevertheless, if there were some illumination variation, occlusion, or disguise in the images, the above assumption might be unconscionable.

Subsequently, several scholars enhanced the sparse coding based models and proposed some new methods. Typically, to obtain more robustness, Yang et al. [4] proposed a robust sparse coding (RSC) model for FR, in which the residual of the testing image and the estimated one is assumed independently and identically distributed according to some probability density function (PDF), where the parameter characterizes the distribution. Then, RSC finds an maximum likelyhood estimation solution of the sparse coding, which can be viewed as a weighted LASSO problem. He et al. [5] took advantage of the correntropy induced robust error metric and proposed the correntropy based sparse representation (CESR) model. What is interesting is that RSC and CESR can be viewed similar work of M-estimator with different kernel size. Recently, He et al. [6] proposed a new model of using different half-quadratic functions to measure the error image, which combines the ideas of SRC, CESR and RSC. In addition, to make the LRC more robust to random pixel disguise, occlusion, or illumination, Nassem et al. [7] extended the LRC to the robust linear regression classification (RLRC) by making use of Huber estimator. Zhou et al. [8] borrowed the markov random field model into the sparse coding scheme and proposed sparse error correction with MRF model. Jia et al. [9] utilized structured sparsity-inducing norm into the SRC model and presented a structured sparse representing classifier (SSRC).

To improve the recognition rate of sparse coding methods, we propose a deep cascade model based on sparse softmax vector coding (SSVD) in this paper, inspired by [23]. The main contributions of our work are as follows. (1) The use of discriminative softmax vector. SRC codes a testing image by sparse linear combination of all training images and classifies it to the class which has minimum representation error. In other words, representation errors of all classes show class discrimination. Most existing sparse coding based methods only focus on the original or extracted image feature. To further explore the effectiveness of sparse coding method on the discriminative representation errors, we propose the SSVD method, in which the softmax vectors transformed by representation errors are used to do sparse representation repeatedly. (2) Three-level spatial pyramid structure is used to enhance class discrimination. Most of the sparse coding methods are based on the whole images, which ignores the local information of the subregion. Because the subregions of the whole image show more detailed local information and more discriminative than the whole image, SSVD combines the whole image and its subregions to obtain softmax vectors by using three-level spatial pyramid structure as shown in the image coding part of Fig. 1. (3) Deep cascade model based on concatenated softmax vectors is proposed. As the cascade model goes deep, the concatenated softmax vectors obtain more class discrimination, which is in favour of classification. Our extensive experiments in benchmark databases show that the proposed deep model achieves better performance than many existing sparse coding methods.

Fig. 1.
figure 1

An example is given to illustrate how the deep model works when classifying a test image \(\mathbf {y}\) under all training images \(\mathbf {X}\). (Color figure online)

The rest of this paper is organized as follows. Section 2 presents the proposed deep model. Section 3 presents the solving algorithm of sparse representation. The experiment results are shown in Sect. 4. Section 5 concludes this paper.

2 The Proposed Approach

In this section, we illustrate how we classify the testing image by giving all training images. First, we define a procedure getting new feature in the first part. Then, in the second part, we present a detailed illustration that how the deep cascade model goes as shown in Fig. 1

2.1 Getting New Feature

According to SRC, suppose that we have C classes of subjects and define that \(\mathbf {d}\) represents one of testing sample and \(\mathbf {D}=[{\mathbf {D}_1},{\mathbf {D}_2},\cdots ,{\mathbf {D}_C}]\) represents the dictionary. The representation model can be transformed into following problem [15]:

$$\begin{aligned} \min _{\varvec{\alpha }}\parallel {\mathbf {d}-\mathbf {D}\varvec{\alpha }}\parallel _2^2+ \lambda \parallel {\varvec{\alpha }}\parallel _1 \end{aligned}$$
(1)

where \(\lambda \) is a scalar constant. After solving the above function, we compute the representation error of each class as follow:

$$\begin{aligned} r_c=\parallel {\mathbf {d}-{\mathbf {D}}^c{\varvec{\alpha }}^c}\parallel _2^2\ \end{aligned}$$
(2)

where \(\mathbf {D}^c\) is the \(c\text {-}th\) class samples, and \(\varvec{\alpha }^c\) is the coefficient vector associated with \(c\text {-}th\) class. Softmax vector \(\mathbf {r}\) is computed by softmax function as follow:

$$\begin{aligned} \mathbf {r}=\frac{e^{-r_c}}{\sum _{c=1}^{C}{e^{-r_c}}} \end{aligned}$$
(3)

where \(\mathbf {r}=[r_1,r_2,\cdots ,r_C]\in \mathbb {R}^C\). If the testing sample \(\mathbf {d}\) belonged to class \(i(\le C)\), \(r_i\) should be bigger than other atoms in softmax vector \(\mathbf {r}\), which is called class discrimination. The above process of obtaining softmax vector \(\mathbf {r}\) is named as Getting New Feature on dictionary \(\mathbf {D}\) (\(GNF_D\))

2.2 Sparse Softmax Vector Coding Based Deep Cascade Model

Without loss of generality, we let \(\mathbf {X}\) represents the training images and \(\mathbf {Y}\) represents the testing images. The class number is C. The numbers of training images and testing images are \(N_1\) and \( N_2\). For each image, a three-level spatial pyramid is used to compute the softmax vector. We take one testing image \(\mathbf {y}\) and all training images \(\mathbf {X}\) as an example to explain how to obtain the softmax vectors and classify the testing image \(\mathbf {y}\) as shown in Fig. 1.

There are 3 parallel channels that are designed to process the input images. In the first channel, the original testing image \(\mathbf {y}\) is represented by all training images \(\mathbf {X}\) and goes through the \(GNF_X\) procedure to get a softmax vector. Similarly, a softmax vector set of training images will be obtained after each training image goes through \(GNF_X\) procedure. In the second channel, all the input images are equally divided into 4 subregions. Let \({\mathbf {y}}_i\) denote the \(i\text {-}th (i=1,\cdots ,4)\) subregion of test image \(\mathbf {y}\) and \({\mathbf {X}}_i\) denote the \(i\text {-}th (i=1,\cdots ,4)\) subregion set of all the training images \(\mathbf {X}\). Similar to the first channel, \({\mathbf {y}}_i\) goes through \(GNF_{X_i}\) procedure, then 4 softmax vectors will be generated. Those 4 softmax vectors are transformed into one vector after max pooling or average pooling. Like the testing image, each subregion of per training image goes through the corresponding \(GNF_{X_i}\) procedure, and 4 softmax vectors will be generated. After the max pooling or average pooling, the 4 softmax vectors are transformed into 1 vector. Then the transformed vector of each image is parallel integrated into one matrix as shown in Fig. 2 that is an instance presented in red dashed part of the second channel in Fig. 1 to illustrate max pooling and average pooling. In the third channel, each input images are equally divided into 16 subregions. Using the same approach in the second channel, a transformed softmax vector of testing image \(\mathbf {y}\) and a transformed softmax vector set of training images \(\mathbf {X}\) will be generated.

Fig. 2.
figure 2

Illustration of the max pooling and average pooling. 4 softmax vectors will be obtained after each subregion of per image goes through \(GNF_{X_i}\) procedure. Then, we compute the maximum value or average of the 4 values in the corresponding dimension to construct a new vector. Finally, the new vector of each image is parallel integrated into one matrix.

After those 3 channels, 3 softmax vectors (tinted with blue) of testing image are concatenated into one vector \({\mathbf {d}}_0\) and 3 softmax vector sets (tinted with red) are concatenated into one vector set \({\mathbf {D}}_0\). Then, \({\mathbf {d}}_0\) goes through \(GNF_{D_0}\) procedure to compute the softmax vector that is concatenated with \({\mathbf {d}}_0\) to construct input sample \({\mathbf {d}}_1\) of second layer. Similarly, each column in \({\mathbf {D}}_0\) also goes through \(GNF_{D_0}\) procedure to compute the softmax set that is concatenated with \({\mathbf {D}}_0\) to construct input dictionary \({\mathbf {D}}_1\) of second layer. Using the same way, we can obtain the testing sample \({\mathbf {d}}_L\) and dictionary \({\mathbf {D}}_L\) of level L. Finally, \({\mathbf {d}}_L\) goes through \(GNF_{D_L}\) to get the softmax vector. The prediction will be obtained by taking the class with the maximum value in softmax vector.

3 Solving Algorithm of Sparse Representation

In recent years, many algorithms have been proposed for sparse representation. In particular, the alternating direction method of multipliers (ADMM), first proposed in 1970s [14], has drawn a lot of attention. Yang and Zhang [15] integrated the proximal methods into ADMM when solving \(l_1\)-norm minimization problems.

In this paper, we also use ADMM method to solve sparse representation problem. In the SSVD model, sparse representation problems need to be solved in different stages. We just take the first channel in Fig. 1 for an instance to illustrate how we solve the sparse coding coefficients of testing samples based on the dictionary \(\mathbf {X}\). Let \(\mathbf {X}=[\mathbf {x_1}, \mathbf {x_2}, \cdots , \mathbf {x_{N_1}}]\in \mathbb {R}^{d\times {N_1}}\) denote training samples and  \(\mathbf {Y}=[\mathbf {y_1}, \mathbf {y_2}, \cdots , \mathbf {y_{N_2}}]\in \mathbb {R}^{d\times {N_2}}\) denote testing samples. Each column represents a sample. To learn the representation coefficients, a general sparse representation model is formulated as

$$\begin{aligned} \min _{\mathbf {W}}\parallel {\mathbf {Y-XW}}\parallel _F^2+\lambda \parallel {\mathbf {W}}\parallel _1 \end{aligned}$$
(4)

where \(\lambda \) is the regularization parameter for balancing respective term. We introduce \(\mathbf {Z=W}\) to solve model (4) by using augmented Lagrangian function according to ADMM method. The augmented Lagrangian function of problem (4) is formulated as

(5)

where \(<\mathbf {P,Q}>=tr(\mathbf {P^TQ})\), \(\mathbf {\Lambda }\) is a Lagrange multiplier and \(\mu \) is a scalar constant. The augmented Lagrangian is minimized alone one coordinate direction at each iteration. ADMM consists of the following iterations.

  1. (i)

    Given \(\mathbf {Z}=\mathbf {Z}^t, \mathbf {\Lambda }=\mathbf {\Lambda }^t\), updating \(\mathbf {W}\) by

    $$\begin{aligned} {\mathbf {W}}^{t+1}=arg \min _{\mathbf {W}} L_{\mu }(\mathbf {W}, \mathbf {Z},\mathbf {\Lambda }) \end{aligned}$$
    (6)
  2. (ii)

    Given \(\mathbf {W}=\mathbf {W}^{t+1}, \mathbf {\Lambda }=\mathbf {\Lambda }^k\), updating \(\mathbf {Z}\) by

    $$\begin{aligned} \mathbf {Z}^{t+1}=arg \min _{\mathbf {Z}} L_{\mu }(\mathbf {W}, \mathbf {Z},\mathbf {\Lambda }) \end{aligned}$$
    (7)
  3. (iii)

    Given \(\mathbf {W}=\mathbf {W}^{t+1}, \mathbf {Z}=\mathbf {Z}^{t+1}\), updating \(\mathbf {\Lambda }\) by

    $$\begin{aligned} \mathbf {\Lambda }^{t+1}=\mathbf {\Lambda }^t+\mu (\mathbf {W}^{t+1}+\mathbf {Z}^{t+1}) \end{aligned}$$
    (8)

The key steps are to solve the optimization problems in Eqs.(6) and (7). Based on the augmented Lagrangian function in Eqs.(5) and (6) can be expressed as

$$\begin{aligned} \mathbf {W}^{t+1}=arg\min _{\mathbf {W}}(\parallel {\mathbf {Y-XW}}\parallel _F^2+<\mathbf {\Lambda },\mathbf {W-Z}>+\frac{\mu }{2}\parallel {\mathbf {W-Z}}\parallel _F^2) \end{aligned}$$
(9)

Since Eq. (9) is a standard regression model, we can get its closed-form solution as follows

$$\begin{aligned} \mathbf {W}^{t+1}=(\mathbf {X}^T\mathbf {X}+\mu \mathbf {I})^{-1}(\mathbf {X}^T\mathbf {Y}- \mathbf {\Lambda }^t+\mu \mathbf {Z}^t ) \end{aligned}$$
(10)

where \(\mathbf {I}\) is a identity matrix. Based on the augmented Lagrangian function in Eqs. (5) and (7) can be rewritten as

$$\begin{aligned} \mathbf {Z}^{t+1}=arg\min _{\mathbf {Z}}(\lambda \parallel {\mathbf {Z}}\parallel _1+<\mathbf {\Lambda },\mathbf {W-Z}> +\frac{\mu }{2}\parallel {\mathbf {W-Z}}\parallel _F^2) \end{aligned}$$
(11)

Because \(l_1\)-norm problem is indifferentiable, the shrinkage technique [15] is used to solve this problem. The optimal solution presents as

$$\begin{aligned} \mathbf {W}^{t+1}=shrinkage_{\frac{\lambda }{\mu }}(\mathbf {W}^{t+1}+\frac{\mathbf {\Lambda }^{t}}{\mu }) \end{aligned}$$
(12)

According to ADMM algorithm, the objective function value will be convergence until certain optimality conditions and stopping criteria are satisfied. In this paper, to simplify this problem, we set a max iteration instead. The detailed process for solving problem (4) is summarized in Algorithm 1.

figure a

4 Experimental Results

In this section, we present the experimental results of our proposed SSVD method on publicly available databases, following the same experimental settings in [16]. We randomly split the databases into two part. To avoid special case, all the experiments are run 10 times, and the average recognition rates are reported. Different from [16], we just validate our proposed framework on three face databases (Extended Yale B [17], CMU PIE [18], AR [19]) and one object database (COIL-100 [20]). We compare the proposed method with the popular methods such as LLC, LRC, CRC, SRC, SVM [21] and three methods (ENLR, DENLR, MENLR) proposed in [16]. Our (Max) and Our (Ave) respectively represent the methods to obtain the final softmax vectors in the Image Coding part by using max pooling and average pooling.

In the experiments, we reshape each image into one vector or extract the random feature of image. The \(l_2\)-normalization is used for all the samples. The experimental results shows that our method can achieve more significant results than many compared methods especially on face databases. The bold numbers represent the best recognition rate. In the following experiments, we let \(\lambda _1\), \(\mu _1\) represent the parameters in image coding part and \(\lambda _2\), \(\mu _2\) represent the parameters in softmax vector coding part in Fig. 1. The number of layers is set as 10 on all database.

(1) Extended Yale B Database: The Extended Yale B database contains 2414 frontal face images of 38 individuals each of them has around 64 near frontal images under different illuminations. We randomly select 15, 20, 25, 30 images per person for training, and the rest for testing. We set \(\lambda _1=10^{-4}\), \(\mu _1=10^{-1}\), \(\lambda _2=10^{-4}\), and \(\mu _2=1.7\). The recognition rates of different methods on this database are summarized in Table 1. Note that the mean recognition rate are reported, and the bold numbers represent the best recognition rates. It is worth noting that our method can achieve the best recognition rates. Typically, when the number of training samples is 15, the recognition rate of our method is 4 percent higher than MENLR that achieves the best result among the compared methods. Besides, it means that our method can achieve good recognition rate when there are less training samples on this database.

(2) CMU PIE Database: The CMU PIE face database contains 41,368 face images from 68 subjects as a whole. The images under five near frontal poses (C05, C07, C09, C27 and C29) are used in our experiment. We randomly select 15, 20, 25, 30 images from each subject as training samples and the remaining images as test samples. We set \(\lambda _1=10^{-4}\), \(\mu _1=10^{-1}\), \(\lambda _2=10^{-4}\), and \(\mu _2=10^{-2}\). The classification rates of different methods are summarized in Table 2. It is clear that our method outperforms the compared methods in different cases.

Table 1. Recognition rates (%) on Extended Yale B database with different number of training samples
Table 2. Recognition rates (%) on CMU PIE database with different number of training samples

(3) AR Database: The AR face database contains about 4,000 color face images of 126 subject, which consist of the frontal faces with different facial expressions, illuminations and disguises. In this experiment, we select a subset including 2600 images from 50 female and 50 male subjects. We randomly select 8, 11, 14, 17 images for each subject as training samples and the rest of images as test samples. Following the experiment in [22], each image and its subregion are projected onto a 540-dimensional feature vector with a randomly generated matrix from a zero-mean normal distribution. We set \(\lambda _1=10^{-5}\), \(\mu _1=2\), \(\lambda _2=10^{-5}\), and \(\mu _2=10^{-3}\). The recognition rates of different methods on this database are summarized in Table 3. From the table, we can see that our method achieves the best recognition rates.

(4) COIL-100 Database: Columbia Object Image Library (COIL-100) database contains various views of 100 objects (72 images per object) with different lighting conditions. In our experiments, the images are converted to gray-scale images with the \(32 \times 32\) pixels. We randomly select 15, 20, 25, 30 images per object to construct the training set, and the test set contains the rest of the images. We set \(\lambda _1=10^{-2}\), \(\mu _1=1\), \(\lambda _2=10^{-4}\), and \(\mu _2=10^{-2}\). The recognition rates of different methods on this database are summarized in Table 4. We can see that our method is inferior to the best MENLR, but still better than other methods.

Table 3. Recognition rates (%) on AR database with different number of training samples
Table 4. Recognition rates (%) on COIL-100 database with different number of training samples

In summary, the proposed SSVD model can achieve remarkable results on face databases. It is also worth noting that SSVD (Max) outperforms SSVD (Ave) on Extended Yale B database and is inferior to SSVD (Ave) on CMU PIE, AR and COIL-100 database. The important advantage of SSVD model is that each image is divided into 4 or 16 subregions, which means that one image can be represented 4 or 16 times. It is useful to amend the misclassified image. We take an example to explain the effect of max pooling and average pooling. Suppose that there is a four categories image set split into two parts training set and testing set. Given a misclassified testing image that is actually from class 1, we will obtain its softmax vector \(r=[0.25\quad 0.40\quad 0.15\quad 0.20]^T\). As for its subregions, there are two cases. (1) There exist one subregion (first subregion we suppose) which shows much more discriminative than the whole image and other subregions. We let \(r_1=[0.60\quad 0.20\quad 0.10\quad 0.10]^T\), \( r_2=[0.30\quad 0.45\quad 0.10\quad 0.15]^T\), \(r_3=[0.25\quad 0.50\quad 0.10\quad 0.15]^T\), and \(r_4=[0.30\quad 0.35\quad 0.25\quad 0.10]^T\) respectively represent the softmax vectors of the 4 subregions. After the max pooling, we will obtain the final softmax vector \(r'=[0.60\quad 0.50\quad 0.25\quad 0.15]^T\) which can amend the misclassified image. (2) The above case is unusual in reality. Instead, there more likely exist most subregions which show a lot discriminative than other subregions. The misclassified image and its softmax vector are the same as case (1). We let \(r_1=[0.35\quad 0.25\quad 0.15\quad 0.25]^T\), \(r_2=[0.40\quad 0.20\quad 0.30\quad 0.10]^T\), \(r_3=[0.20\quad 0.50\quad 0.10\quad 0.20]^T\), and \(r_4=[0.45\quad 0.15 \quad 0.20\quad 0.20]^T\) respectively represent the softmax vectors of the 4 subregions. After the average pooling, we will obtain the final softmax vector \(r'=[0.35\quad 0.28\quad 0.19\quad 0.19]^T\) which can also amend the misclassified image.

Discussion of Layers: To better illustrate our methods, we give the sample curves (presented in Fig. 3) that shows the recognition rates with different layers in the deep model for each database. It is clear that as the number of layers increases, the recognition rate represents a rising tendency, which demonstrates the effectiveness of deep cascade model.

Fig. 3.
figure 3

The recognition rates with different layers on different database: (a) Extended Yale B database, (b) CMU PIE Database, (c) AR database, (d) COIL-100 database.

Convergence: To illustrate the effectiveness of our solving algorithm for problem (4), we show the objective function values with the varying iteration number (presented in Fig. 4) on the Extended Yale B database by using the Algorithm 1 to solve problem (4). It is easy to find that the objective function values present a convergence trend, which demonstrates the effectiveness of the Algorithm 1.

Fig. 4.
figure 4

The convergence of Algorithm 1.

5 Conclusion

This paper presented a novel sparse softmax vector coding based deep cascade model (SSVD). One important advantage of this model is using the class discrimination softmax vector. Besides, some sub-patterns show more discriminative than the whole image, which can amend the misclassified image by using max-polling or average-polling. We also explored the effectiveness of the concatenated softmax vector. The extensive experimental results clearly demonstrated that the proposed method outperforms significantly previous methods.