Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Identifying a person from a large number of face images is the principal task of face recognition [1]. Up till now, numerous methods have been proposed and used for solving the identification problem. The commonly used classification methods include k-nearest neighbor (NN) [2, 3], support vector machine (SVM) [4, 5], and sparse representation [6, 7], among which the study of the representation-based methods has been an attractive topic in the past decades due to its distinct performance in image classification such as face recognition [8], gait recognition [9], action recognition [10], and image representation [11].

The adaptation is an important aspect for evaluating the performance of the recognition method in real-world applications. Conventional methods, such as SVM and Neural Network etc., aim at training a specific classifier from a given dataset. Therefore, these classifiers are vulnerable for face variations. Recently, sparse representation classification (SRC) [12] method has provided an effective way to raise the performance of face recognition on diverse conditions. SRC aims to find the minimum residuals between a test sample and a linear representation of the training samples in different classes for classification. The weighted coefficients indicate the contribution of each class and are referred to as the sparse solution obtained via \(l_1\) regularization. However, the \(l_1\) regularization often needs an iterative process for numerical optimization, which leads to heavy computational burden for real-time applications. Consequently, the \(l_2\) regularization-based representation is proposed to overcome this drawback. The collaborative representation classification (CRC) [13] method is a typical method of the regularization-based representation. CRC is computational efficiency due to its closed-form solution. Current study [14] shows that the sparse property cannot be well guaranteed if native \(l_2\) regularization is directly applied on the classification model. However, the study of CRC demonstrates that the collaborative representation plays more important role in classification compared with the sparsity. In addition to CRC, the robust regression for classification [15], two-phase test sample sparse representation [16], discriminative sparse representation [17] etc. are effective classification methods based on regularization. However, these methods still suffer from the issues such as insufficiently discriminative ability and heavy computational burden.

To seeking an efficient discriminative representation for face images, in this paper, we propose a novel discriminative projection and representation method for face representation and recognition. The proposed method can achieve twofold discriminative properties due to the special design of objective function. The proposed method consists of two stages, including face projection and face representation. On the face projection stage, our method produces a projection matrix by jointly minimizing the similar covariance and maximizing dissimilar covariance, and this matrix maps the face images into a discriminative low-dimension space which has the minimum similarity of samples. Moreover, we employ the quadruplets [18] to construct both covariances, which integrate the similarity of the samples to improve the distinctiveness of projection matrix. On the face representation stage, we produce the discriminative representation for the test sample by obtaining the minimum sum of representation results of all classes on the low-dimension space, which enables the representation results of different classes to be the lowest correlated. Moreover, the proposed method is very efficient in computation due to the closed-form solutions in both steps. The experiments are conducted for evaluating the superiority of the proposed method over other state-of-the-art methods.

The other parts of the paper are organized as follows. Section 2 introduces the related works. Section 3 describes the proposed method. Section 4 offers the experimental results, and we conclude this paper in Sect. 5.

2 Related Works

In this section, we briefly introduce the background of \(l_2\) regularization representation and Linear Discriminant Analysis. Let set \(X=\{x_i|x_i \in \mathbb {R}^n \}\) as a training sample set, which has \(n\) training samples. If there are \(c\) classes and each class has \(s\) samples, \(X\) can be denoted as matrix \(X=[X_1,\dots ,X_j,\dots ,X_c]=[x_1, \dots , x_{s(j-1)+1}, \dots , x_{sj}, \dots , x_n],j=1,\dots ,c\). vector \(y\) is denoted as test sample.

2.1 \(l_2\) Regularization-Based Representation

We take CRC as an example to describe the classification procedure of \(l_2\) regularization representation. CRC represent a test sample using a linear combination of all classes training samples to, which can be written as following equation [13]

$$\begin{aligned} y=XQ,Q=[q_1, \dots , q_n], \end{aligned}$$
(1)

where \( q_i, i=1, \dots , N \) is representation coefficient.

The solution of Eq. (1) is \(Q^*=(X^T X+\mu I)^{-1}X^T y\), where \(\mu \) is a small positive constant and \(I\) is an identity matrix, and \(Q^*\) also is referred to as representation coefficients. Let \(Q_i^*\) be the coefficient vector of the \(i\)-th class, regularized class-specific representation residual \(r_i=\frac{||y-X_iQ_i^*||_2}{||Q_i^*||_2}\) is used for classification. The label of \(y\) is obtained by \(lable(y)=\arg \min _i r_i\).

2.2 Linear Discriminant Analysis

LDA aims to produce a projection by jointly minimizing within-class scatter and maximizing between-class scatter, and then project the data into low-dimension space. Therefore, LDA firstly needs to produce between-class scatter matrix \({S_b}\) and within-class scatter matrix \({S_w}\) respectively as follows:

$$\begin{aligned} {S_b} = \sum \limits _{i = 1}^c {s({m_i} - \bar{m}){{({m_i} - \bar{m})}^T}}, \end{aligned}$$
(2)

and

$$\begin{aligned} {S_w} = \sum \limits _{i = 1}^c {\sum \limits _{j = 1}^s {(x_j^i - {m_i}){{(x_j^i - {m_i})}^T}} }, \end{aligned}$$
(3)

where \({m_i}\) and \(\bar{m}\) are the mean vectors of the i-class and all classes samples respectively. The objective function of LDA is [18]

$$\begin{aligned} P = \arg \mathrm{{ max}}\frac{{tr({P^T}{S_b}P)}}{{tr({P^T}{S_w}P)}}, \end{aligned}$$
(4)

where P is the projection matrix, and \(tr( \cdot )\) denotes the trace of matrix. Finally, we can obtain P by solving the following eigenvector problem:

$$\begin{aligned} {S_b}P = \lambda {S_w}P, \end{aligned}$$
(5)

where \(\lambda \) is eigenvalue.

3 The Proposed Method

To simultaneously obtain distinctiveness and efficiency, our method represents a face image on a low-dimension space by integrating discriminative projection and \({l_2}\) regularization-based representation. Therefore, we exploit a projection to utilize the embedded discriminative information in the low-dimensional space under the regularization-based representation framework. The proposed objective function is represented as

$$\begin{aligned} \mathop {\min }\limits _{P,Q} \left\| {{P^T}y - {P^T}XQ} \right\| _2^2 + \alpha \sum \limits _{i = 1}^c {\left\| {{P^T}{X_i}{Q_i}} \right\| _2^2}, \end{aligned}$$
(6)

where \(\alpha \) is a balance factor, and \(Q = [{Q_1}, \ldots ,{Q_c}]\) is the representation coefficient. P is the projection matrix.

To solve P and Q, in this work, we divide the optimization of model (6) into two independent stages, including face projection and face representation:

  1. (1)

    Face projection: To compute a projection matrix that transforms face images into low-dimensional features which have the minimum similar covariance and maximum dissimilar covariance. The projection matrix has a closed-form solution.

  2. (2)

    Face representation: To produce a discriminative representation for each sample on the new low-dimensional space via \({l_2}\) regularization. This discriminative representation is obtained by a special design of the regularization term. Moreover, this stage also generates a closed-form solution.

3.1 Face Projection

Original face image set \(X = \{ {x_i}\left| {{x_i} \in {\mathbb {R}^m},i = 1,} \right. \cdots ,n\} \) lies in an m-dimension space. In order to find a representation for X in a d-dimension space, we take mapping \(f:{\mathbb {R}^m} \rightarrow {\mathbb {R}^d},(d < m)\) as projection function in pursuit of low-dimensional features. In this paper, instead of directly obtaining the projection via (6), we attempt to find the discriminative projection by solving a subspace problem which can obtain a closed-form solution. We use projection matrix P to denote the projection by f, which is obtained by jointly minimizing the similar covariance and maximizing the dissimilar covariance. To calculate these two covariances, we construct a similar set S of the pairs of samples from the same class and a dissimilar set D of the pairs of samples from different classes respectively. These two sets can be expressed as

$$\begin{aligned} S = \left\{ {\left( {x,x'} \right) \left| {D(x,x') < \delta } \right. } \right\} , \end{aligned}$$
(7)

and

$$\begin{aligned} D = \left\{ {\left( {x,x'} \right) \left| {D(x,x') > \delta } \right. } \right\} , \end{aligned}$$
(8)

where \(D(\cdot ,\cdot )\) is the distance of sample pair and \(\delta \) is a margin coefficient. Consequently, projection matrix P can be achieved via minimizing the expectation of the pairs from similar set and maximizing the expectation of the pair from dissimilar set, which maps the samples to a subspace that has the minimum difference of similar samples and maximum difference of dissimilar samples simultaneously. It is expressed as the following loss function:

$$\begin{aligned} L = \mu \mathrm{E}\left\{ {{{\left\| {{P^T}x - {P^T}x'} \right\| }^2}\left| S \right. } \right\} - \mathrm{E}\left\{ {{{\left\| {{P^T}x - {P^T}x'} \right\| }^2}\left| D \right. } \right\} , \end{aligned}$$
(9)

where \(\mu \) is a balance parameter. x and \(x'\) are the pairs of samples.

To construct the similar and dissimilar sets, we extract the quadruplets from the sample set. A quadruplet [18] is that four samples \({x_i}\), \({x_j}\), \({x_k}\), and \({x_l}\) from the sample set and act in this way:

$$\begin{aligned} D({x_i},{x_j}) + \delta < D({x_k},{x_l}). \end{aligned}$$
(10)

Thus, for a sample x, within-class similar sample pair \(({x_i},{x_j})\) is employed to produce similar set S and between-class dissimilar sample pair \(({x_k},{x_l})\) is employed to construct the dissimilar set D respectively. As a result, Eq. (9) can be rewritten as

$$\begin{aligned} L = \mu \mathrm{E}\left\{ {{{\left\| {{P^T}{x_i} - {P^T}{x_j}} \right\| }^2}\left| S \right. } \right\} - \mathrm{E}\left\{ {{{\left\| {{P^T}{x_k} - {P^T}{x_l}} \right\| }^2}\left| D \right. } \right\} . \end{aligned}$$
(11)

It is observed that

$$\begin{aligned} \mathrm{E}\left\{ {{{\left\| {{P^T}{x_i} - {P^T}{x_j}} \right\| }^2}\left| S \right. } \right\} = \mathrm{{tr}}\left\{ {P{\Sigma _S}{P^T}} \right\} , \end{aligned}$$
(12)

and

$$\begin{aligned} \mathrm{E}\left\{ {{{\left\| {{P^T}{x_k} - {P^T}{x_l}} \right\| }^2}\left| D \right. } \right\} = \mathrm{{tr}}\left\{ {P{\Sigma _D}{P^T}} \right\} , \end{aligned}$$
(13)

where \(\sum \nolimits _S { = E\{ {( {{x_i} - {x_j}} ){{( {{x_i} - {x_j}} )}^T}| S .} \}}\) and \(\sum \nolimits _D { = E\{{( {{x_k} - {x_l}} ){{( {{x_k} - {x_l}} )}^T}| D .} \}} \) are the covariance matrices of similar pairs and dissimilar pairs respectively. This leads Eq. (11) to

$$\begin{aligned} L = \mu \mathrm{{tr}}\left\{ {P{\Sigma _S}{P^T}} \right\} - \mathrm{{tr}}\left\{ {P{\Sigma _D}{P^T}} \right\} . \end{aligned}$$
(14)

Finally, we have

$$\begin{aligned} L \propto \mathrm{{tr}}\left\{ {P{\sum _S}\sum _D^{ - 1}{P^T}} \right\} = \mathrm{{tr}}\left\{ {P{\sum _R}{P^T}} \right\} , \end{aligned}$$
(15)

where \({\sum _R} = {\sum _S}\sum _D^{ - 1}\) is a ratio matrix between similar and dissimilar covariance matrices. It is obvious that \({\sum _R}\) is a semidefinite matrix and can be implemented by singular value decomposition (SVD) for obtaining an orthogonal matrix. Thus, orthogonal matrix P maps the samples into the space spanned by the d smallest eigenvectors, which has minimum similarity of sample.

3.2 Face Representation

On this stage, we can represent face images in the low-dimensional space. As a result, given projection matrix P, training sample X and test sample y can be projected onto a low-dimensional space via \(F = {P^T}X\) and \(v = {P^T}y\). The objective function in Eq. (6) is rewritten as the following formula:

$$\begin{aligned} \mathop {\min }\limits _Q \left\| {v - FQ} \right\| _2^2 + \alpha \sum \limits _{i = 1}^c {\left\| {{F_i}{Q_i}} \right\| _2^2}, \end{aligned}$$
(16)

Because of the convexity and differentiability of Eq. (16), we can obtain the optimal solution by taking the derivative with respect to Q and setting it to 0. The computational procedure is presented as follows:

Let \(f(Q) = \left\| {v - FQ} \right\| _2^2 + \alpha \sum \limits _{i = 1}^c {\left\| {{F_i}{Q_i}} \right\| _2^2} \). The derivative with respect to Q of the first term of f(Q) is

$$\begin{aligned} \frac{d}{{dQ}}{\left\| {v - FQ} \right\| ^2} = - 2{F^T}(v - FQ). \end{aligned}$$
(17)

Then we need to determine the derivative of the second term \(\frac{d}{{dQ}}\left( {\alpha \sum \limits _{i = 1}^c {\left\| {{F_i}{Q_i}} \right\| _2^2} } \right) \). Because \(g(Q) = \alpha \sum \limits _{i = 1}^c {\left\| {{F_i}{Q_i}} \right\| _2^2}\) does not explicitly contain Q, it needs to compute partial derivative \(\frac{{dg}}{{d{Q_k}}}\), and combine all \(\frac{{dg}}{{d{Q_k}}}\) \((k = 1, \ldots ,c)\) to achieve \(\frac{{dg}}{{dQ}}\).

g(Q) is composed of c terms which are dependent of \({Q_k}\). It firstly calculates the c partial derivatives \(\frac{{dg}}{{d{Q_k}}}\) as follows:

$$\begin{aligned} \begin{array}{l} \frac{{\partial g}}{{\partial {Q_k}}} = \frac{\partial }{{\partial {Q_k}}}\left( {\alpha \sum \limits _{i = 1}^c {\left\| {{F_i}{Q_i}} \right\| _2^2} } \right) = \alpha \sum \limits _{i = 1}^c {\frac{\partial }{{\partial {Q_k}}}\left\| {{F_k}{Q_k}} \right\| _2^2} \\ \mathrm{{ }} = \alpha c\frac{\partial }{{\partial {Q_k}}}\left\| {{F_k}{Q_k}} \right\| _2^2\mathrm{{ = }}2\alpha cF_k^T\left( {{F_k}{Q_k}} \right) . \end{array} \end{aligned}$$
(18)

Thus the derivative \(\frac{{dg}}{{dQ}}\) is

$$\begin{aligned} \frac{{dg}}{{dQ}} = \left( {\begin{array}{*{20}{c}} {\frac{{\partial g}}{{\partial {Q_1}}}}\\ \vdots \\ {\frac{{\partial g}}{{\partial {Q_c}}}} \end{array}} \right) = \left( {\begin{array}{*{20}{c}} {2\alpha cF_1^T\left( {{F_1}{Q_1}} \right) }\\ \vdots \\ {2\alpha cF_c^T\left( {{F_c}{Q_c}} \right) } \end{array}} \right) = 2\alpha c\left( {\begin{array}{*{20}{c}} {F_1^T{F_1}}&{} \cdots &{}O\\ \vdots &{} \ddots &{} \vdots \\ O&{} \cdots &{}{F_c^T{F_c}} \end{array}} \right) Q. \end{aligned}$$
(19)

Let \(M = \left( {\begin{array}{*{20}{c}} {F_1^T{F_1}}&{} \cdots &{}O\\ \vdots &{} \ddots &{} \vdots \\ O&{} \cdots &{}{F_c^T{F_c}} \end{array}} \right) \), the derivative of the second term is \(\frac{{dg}}{{dQ}} = 2\alpha cMQ\).

Combining the derivatives of the first and second terms, the derivative of Eq. (16) is \(\frac{{df}}{{dB}} = - 2{F^T}(v - FQ) + 2\alpha cMQ\). Thus, setting the derivative to zero, we have

$$\begin{aligned} 2{F^T}FQ - 2{F^T}v + 2\alpha cMQ = 0, \end{aligned}$$
(20)

which leads to

$$\begin{aligned} Q = {({F^T}F + \alpha cM)^{ - 1}}{F^T}v. \end{aligned}$$
(21)

As a result, optimal Q is a closed-form solution.

3.3 Classification

The proposed method classifies the samples in a new low-dimensional space. Thus, training sample X and test sample y should be projected as F and v by projection matrix P. Finally, we use Eq. (21) to compute representation coefficient Q of the projected test sample v. Then, a test sample v is classified to the k-th class according to the following procedure,

$$\begin{aligned} k = \mathop {\arg \mathrm{{ min}}}\limits _i \left\| {v - {F_i}{Q_i}} \right\| _2^2. \end{aligned}$$
(22)

In summary, we describe a full classification algorithm in Algorithm 1.

figure a

3.4 Computational Complexity Analysis

The major computational cost of the proposed method lies in the matrix operation. For \(n \times n\) input matrix, the computational complexity of our problem consists of two parts, including projection matrix P computation and representation coefficient Q optimization. To obtain the projection matrix, the total cost is about \(O({n^3})\). To obtain representation coefficient Q, we need to calculate Eq. (21) for each test sample, whose dimension is d. The total computational complexity of calculating k test samples is about \(O(d{n^2} + kdn)\). As a result, the total computational complexity of the proposed method is about \(O({n^3} + d{n^2} + kdn)\). It is pointed out that the projection matrix and representation coefficient matrix are computed completely only once and can be used for all test samples. Although the proposed method needs two computational steps, it is more efficient than the iterative methods.

4 Experimental Results

In this section, we conduct the experiments on face datasets to present the effectiveness of the proposed method. The tested datasets contain the FERET [19], Extended Yale B [20], and CMU Multi-PIE [27] datasets. Meanwhile, several state-of-the-art recognition methods including CRC [13], L1LS [21], FISTA [24], Homotopy [22], Dual augmented lagrangian method (DALM) [23], INNC [25], and a fusion classification method (FCM) [26] are used for experimental comparison. The test of Linear discriminant analysis [18] integrated with the nearest neighbor classifier is also involved in our experiments.

4.1 Dimension and Parameter Selections

The feature dimension is one of the important factors that affect the classification accuracy. Usually, the classification accuracy varies greatly under different dimensions. However, we find that the performance of the proposed method is not greatly affected by the variation of the dimension. Figure 1 shows the results of the classification accuracy under different dimensions on the FERET dataset. It is obvious that the proposed method achieves nearly stable classification accuracies for various dimensions under different numbers of training samples per class. In our experiments, the range of feature dimension is [200, 1000].

There is only one parameter \(\alpha \) in the proposed algorithm. It is a factor of balancing the effect on the two terms in the object function. We choose the optimal value of \(\alpha \) for each dataset among five candidate values, 0.01, 0.1, 1, 10, and 100. The search procedure can quickly find the optimal value which leads to the best classification accuracy. The proposed method can maintain a stable classification performance when the value of \(\alpha \) varies. Figure 2 presents the relationship between \(\alpha \) and the classification accuracy on the FERET dataset respectively. It is seen that almost stable classification rates are obtained when the value of \(\alpha \) varies in a proper range for a certain dataset.

Fig. 1.
figure 1

Classification accuracy versus dimension on the FERET dataset.

Fig. 2.
figure 2

Classification accuracy versus parameter on the FERET dataset.

4.2 Experiments on the FERET Face Dataset

This experiment is conducted on the FERET face dataset [19] which contains 1400 face images from 200 subjects. Figure 3 shows some face images from this dataset. Every face image was resized to a 40 by 40 pixels. The first 3, 4, and 5 face images of each subject and the remaining face images were used as the training samples and test samples respectively. Parameter \(\alpha \) was set to 10 and the dimension is 1000. The experimental results are presented in Table 1. From the results, we know that the proposed method achieves better recognition rate than other classification methods, which implies that the proposed method can capture more discriminative information for feature representation.

Fig. 3.
figure 3

Some face images from the FERET face dataset. The face images shown in the first and second rows are from three different subjects.

Table 1. Classification accuracies of different methods on the FERET dataset.

4.3 Experiments on the Extended Yale B Face Dataset

The Extended Yale B [20] face dataset contains 2414 single frontal facial images of 38 individuals. These images were captured under various controlled lighting conditions. The size of an image was 192168 pixels. In our experiments, all images were cropped and resized to 8496 pixels. Figure 4 shows some face images from the Extended Yale B face dataset. The first 10, 20, 30, and 40 face images of each subject were treated as original training samples and the remaining face images were viewed as testing samples. Parameter \(\alpha \) was set to 0.001 and the dimension is 1000. The experimental results are presented in Table 2. It is obvious that the proposed method increases nearly 10 percents recognition rates under different conditions, compared with other methods. More importantly, the superior recognition rates are obtained under lower dimension (1000), compared with other methods under original dimension (8064). It means that the proposed method can capture more discriminative low-dimension feature under different conditions.

Fig. 4.
figure 4

Some face images from the Extended Yale B face dataset. The face images shown in the first, second, and third rows are from three different subjects.

Table 2. Classification accuracies of different methods on the Extended Yale B dataset.
Fig. 5.
figure 5

Some face images from the CMU Multi-PIE face dataset. The face images shown in the first, second, and third rows are from three different subjects.

Table 3. Classification accuracies of different methods on the CMU Multi-PIE dataset.

4.4 Experiments on the CMU Multi-PIE Face Dataset

In this subsection, we evaluated the performance of our method on the CMU Multi-PIE face dataset [27]. The CMU Multi-PIE face dataset is composed of face image of 337 persons with variations of poses, expressions and illuminations. Figure 5 shows some face images from this dataset. We select a subset composed of 249 persons under 20 different illumination conditions with a frontal pose and 7 different illumination conditions with a smile expression. All images are cropped and resized to 4030 pixels. We choose face images corresponding to the first 3, 5, 7, 9 illuminations from the 20 illuminations and only one image from 7 smiling images as training samples and use remaining images as testing samples. Parameter \(\alpha \) is set to 0.00001 and the dimension is 1000. Table 3 lists the classification accuracy on four testing sets obtained using different methods. From the results, it can be seen that our method obtains better classification accuracy than other methods. In other words, our method is more robust to variations of illuminations, poses and expressions.

5 Conclusion

Aiming at seeking an efficient discriminative representation for face images, we proposed a novel discriminative projection and representation method for face recognition. This method obtains the superiority of effective and efficient recognition by using a specific regularization term and projection matrix of the objective function. The projection produced by minimizing similar covariance and maximizing dissimilar covariance can obtain the features which have the minimum similarity of samples. The discriminative representation result is obtained by minimizing the correlation of samples. Therefore, the proposed method possesses two-fold discriminative properties, which is very helpful to improve the classification accuracy. In addition, the proposed method provides a computational efficient algorithm for face recognition tasks.