1 Introduction

Automatic face recognition [1] remains one of the most visible and challenging research topics in computer vision, machine learning and biometrics. It is widely applied to different fields including biometric authentication, security applications and human computer interaction. Compared with other biometrics, such as iris identification and palm identification, face recognition has the advantages of being convenient, immediate and well accepted.

In face recognition, face representation plays a vital part and its effectiveness intimately associates with the final recognition results. Face representation, in essence, is feature extraction for face images. It is a crucial issue in face recognition that which low-dimensional features are the most effective or informative for discrimination, and just from the outset, diverse pioneering approaches have been proposed to this end. Conventional facial features can be roughly divided into holistic features (PCA [2], LDA [3], LPP [4], etc.) and local features (LBP [5], SIFT [6], etc.). There are so many feature extraction methods that when dealing with a specific problem, practitioners tend to be in a dilemma of which features to use. However, recent progress in compressed sensing and sparse representation leads to novel algorithms for face recognition. Wright et al. [7] put forward a seemingly simple yet effective method called sparse representation-based classification (SRC), the training samples are used to form a structured dictionary, and the test sample is decomposed based on the dictionary to get its coding coefficient via l 1-norm minimization, then the test image is classified to the class which produces the minimum reconstruction error. In SRC, the precise choice of feature space is no longer critical, and it is robust to occlusion. However, SRC forms the dictionary by using all the training images, thus the generated dictionary may have a huge size, which makes adverse effects to the following sparse solver. To overcome this drawback, Yang et al. [8] proposed an unsupervised dictionary learning algorithm to obtain dictionary elements for each class. Yang et al. [9] presented a novel dictionary learning method which introduces Fisher criterion to the objective function in order to improve the pattern classification performance. Li et al. [10] came up with a local sparse representation based classification (LSRC) scheme, which performs sparse decomposition in a local manner. In LSRC, they exploited kNN rules to find k neighbors for the test samples, and the selected samples are utilized to construct the over-complete dictionary.

For general pattern classification problems such as dimensionality reduction, classification, clustering, etc., the locality structure of data has been observed to be critical [11, 12]. Lu et al. [13] took data locality into consideration, and imposed locality on the l 1 regularization. They utilized the distance between test samples and training samples to characterize their similarity, through this way, they formed the weight matrix. By solving a weighted l 1-minimization problem, they achieved impressive results on the Extended Yale B, AR databases and several datasets from the UCI repository. Nevertheless, Wang et al. [14] argued that similarities are not merely related to distance, and it is shown that traditional distance-based similarity measure may lead to high classification error rates even on several simple datasets. In addition, according to related researches about local binary pattern (LBP), features coded by LBP have highly discriminative power [15], this property makes it suitable for image classification tasks. Inspired by these findings, we intend to use the similarity of LBP features of images to form the weight matrix, this can better make use of data locality, thus boost the accuracy of face recognition.

The rest of this paper is organized as follows: LBP and sparse representation-based classification is reviewed in Sects. 2 and 3 respectively. Section 4 presents the proposed method. Extensive experiments were conducted on publicly available databases to verify the effectiveness of the proposed method in Sect. 5. Finally, conclusions are drawn in Sect. 6.

2 Local Binary Pattern

The LBP operator was first introduced by Ojala [16] and used as texture descriptor. Then Ahonen [5] applied it to face recognition and obtained outstanding results, which demonstrates that LBP is able to well describe face images.

The original LBP operator was defined as a window of size 3 × 3. This operator uses the value of the center pixel as a threshold, and the 8 surrounding pixels whose value is higher than or equal to the value of the threshold is assigned a binary value 1, otherwise the value is 0. When this process is accomplished, 8 values can be read sequentially in the clockwise direction. The 8-bit binary number or its equivalent decimal number can be assigned to the center pixel and it can describe the texture information of an image. The basic LBP operator is illustrated in Fig. 1.

Fig. 1.
figure 1

The original LBP operator.

In order to facilitate the analysis of textures with different scales, the basic LBP operator is extended by combining neighborhoods with different radius. In this case, P points on the edge of a circle, whose radius is R, are sampled and compared with the value of the center pixel. For ease of presentation, the notation (P,R) is employed to formulate P sampling points on a circle of radius of R. See Fig. 2 for an example of circular neighborhoods.

Fig. 2.
figure 2

The circular (8, 1), (16, 2) and (8, 2) neighborhoods.

However, not all the patterns coded by LBP are useful for describing the characteristics of textures, so it is necessary to choose which local patterns are account for a major part of all patterns. These patterns are referred to as uniform patterns. In their experiment, Ojala [15] found that uniform patterns provide about 90 percent of the 3 × 3 texture pattern in examined surface textures. A local binary pattern is called uniform if the binary pattern contains at most two bitwise transitions from 0 to 1 or vice versa when the bit pattern is considered circular [17]. For example, the patterns 11111111 (0 transitions), 00111000 (2 transitions) and 11100111 (2 transitions) are uniform whereas the pattern 00110110 (4 transitions) is not. Experimental results have demonstrated uniform patterns can describe most of the texture information, at the same time, they have strong ability to do classification tasks.

A histogram of the labeled image \( f_{l} \left( {x,y} \right) \) can be defined as

$$ H_{k} = \sum\limits_{x,y} {I\left\{ {f_{l} (x,y) = k} \right\}} ,k = 0, \ldots ,n - 1 $$
(1)

in which n is the number of different labels produced by the LBP operator and

$$ I\left\{ A \right\} = \left\{ \begin{aligned} 1,\,\,&A\;is\;true \hfill \\ 0,\,\,&A\;is\;false \hfill \\ \end{aligned} \right. $$

This histogram consists of information about the distribution of the local micro-patterns, including spots, flat areas, edge ends, and curves.

Generally, when we extract features from face images, we can divide the face image into small blocks. And features are extracted from each block independently. The descriptors are then concatenated to form a global description of the face image. In this way we can obtain a description of the face image on local and holistic levels. Several possible similarity measures have been proposed for histograms, e.g. histogram intersection, log-likelihood statistic, χ2 statistic etc. [5].

Experimental results have shown that the performance of χ2 statistic is better than histogram intersection and log-likelihood statistic when using uniform patterns [5]. In this paper, since we use LBP uniform patterns of (8,1) neighborhood, so we choose χ2 statistic to measure the similarity of histograms.

3 Sparse Representation-based Classification and Weighted Sparse Representation

3.1 Sparse Representation-based Classification (SRC)

Theoretical results show that well-aligned images of a convex, Lambertian object lie near a low-dimensional feature space of the high-dimensional image space [18]. This is the only prior knowledge about the training samples in SRC. The idea of SRC is presented as follows [7].

Suppose we have C distinct classes, given sufficient training samples of the i-th object class, the size of face images is w × h, and the total number of samples of i-th class is n i . We stack the n i training images from the i-th class as columns of a matrix \( A_{i} = \left[ {\nu_{i,1} , \ldots ,\nu_{{i,n_{i} }} } \right] \in R^{{m \times n_{i} }} \) (m = w × h). For a test sample \( y \in R^{m} \) belongs to this class, according to linear subspace theory, y can be approximated by the linear combination of the samples within \( A_{i} \), i.e.

$$ y \approx \alpha_{i,1} v_{i,1} + \alpha_{i,2} v_{i,2} + \ldots + \alpha_{{i,n_{i} }} v_{{i,n_{i} }} $$
(2)
$$ \alpha_{i,j} \in R,j = 1,2 \ldots ,n_{i} . $$

Since the initial identity of the test sample y is unknown, let A be the concatenation of the n training samples from all the C classes, where \( \sum\nolimits_{i = 1}^{C} {n_{i} = n} \), then we can define a new matrix A:

$$ \begin{aligned} A &= \left[ {A_{1} ,A_{2} , \cdots ,A_{C} } \right] \hfill \\ &= \left[ {v_{1,1} , \ldots ,v_{{1,n_{1} }} , \ldots ,v_{i,1} ,v_{i,2} , \ldots ,v_{{i,n_{i} }} , \ldots ,v_{C,1, \ldots ,} v_{{C,n_{C} }} } \right] \hfill \\ \end{aligned} $$
(3)

If we use the new matrix A to represent the test image y, that is

$$ y = Ax_{0} \,\in\,R^{m} $$
(4)

where \( x_{0} = [0, \ldots ,0, \ldots ,\alpha_{i,1} ,\alpha_{i,2} , \ldots ,\alpha_{{i,n_{i} }} , \ldots ,0, \ldots ,0]^{T} \in R^{n} \) is a coefficient vector whose entries are zero except those associated with the i-th class.

In robust face recognition, the system y = Ax is always under-determined, so it has an infinity of solutions, but we just need to find an optimal solution. Conventionally, this problem is settled by choosing the minimum l 2-norm solution. However, the solution is non-sparsity and it has no discriminative information. This motivates us to seek the sparsest solution to y = Ax, solving the following optimization problem:

$$ (l^{0} )\;x_{0} = \;\arg \;{ \hbox{min} }\;||x||_{0} ,\;subject\;to\;Ax = y $$
(5)

where \( \left\| . \right\|_{0} \) denotes the l 0-norm, which counts the number of nonzero elements in a vector.

Unfortunately, the problem of finding the sparsest solution of an under-determined system of linear equations is NP-hard. Usually, one can use greedy pursuit algorithms to find a suboptimal yet sparse enough solution, e.g. matching pursuit [19], orthogonal matching pursuit [20], stage-wise orthogonal matching pursuit [21] etc.

Recent progress in the theory of sparse representation and compressed sensing reveals that if the solution x 0 is sparse enough, the solution to the l 0-minimization problem (5) is equal to the following l 1-minimization problem [22]:

$$ (l^{1} )x_{1} = \arg \;\hbox{min} \;||x||_{1} ,subject\;to\;Ax = y $$
(6)

To solve the l 1-minimization problem, one can use gradient projection method [23], homotopy algorithm [24], iterative shrinkage-thresholding [25] etc.

When dealing with small dense error, we can modify the aforesaid l 1-minimization problem (6) to obtain a stable l 1-minimization problem:

$$ (l_{s}^{1} )\;\;x_{1} = \arg \hbox{min}\,||x||_{1} ,subject\;to\;||Ax - y||_{2} \le \varepsilon $$
(7)

where ε > 0 is error tolerance.

Once the coefficient vector \( \hat{x}_{i} \) is obtained via (6) or (7), the test sample y is assigned to the class which minimizes the residual between y and \( \hat{y}_{i} \):

$$ \mathop {\hbox{min} }\limits_{i} r_{i} (y) = ||y - A\delta_{i} (\mathop {x_{1} }\limits^{ \wedge } )||_{2} $$
(8)

where \( \delta_{i} \left( x \right) \) is an operator that selects the coefficients in x only associated with class i, \( \hat{y}_{i} = A\delta_{i} \left( {\hat{x}_{i} } \right) \) is a sample that approximates the given test sample y.

With sufficient training samples, SRC does achieve excellent results. However, if the training samples is not enough, SRC may perform worse than conventional classifiers, thus makes SRC unstable. To overcome the drawback of SRC, Lu et al. [13] proposed a more robust weight sparse representation method which integrates both sparsity and data locality structure into a unified framework. The WSRC algorithm will be described in Sect. 3.2.

3.2 Weighted Sparse Representation-based Classification (WSRC)

SRC can ensure sparsity, but sometimes it may lose locality information. It has been shown that local sparse coding is effective for image classification. Weighted Sparse Representation-based Classification (WSRC) is a method for local sparse coding. WSRC preserves the similarity between the test sample and its neighboring training data while seeking the sparse linear representation [13]. WSRC solves the following weighted l 1-minimization problem:

$$ (Weighted\;l^{1} )\mathop {x_{1} }\limits^{ \wedge } = \arg \hbox{min}\,||Wx||_{1} \;subject\;to\;y = Ax $$
(9)

As mentioned above, A is the matrix that contains all of training samples, and each column of A is a sample, y is the input test sample, x is coding coefficient, W is a block-diagonal matrix, which is the locality adaptor that penalizes the distance between y and each training data. In [13], W is defined as

$$ diag\,(W) = [dist\,(y,x_{1}^{1} ), \ldots ,dist(y,x_{{n_{c} }}^{C} )]^{T} $$

where \( dist\,(y,\,x_{i}^{c} ) = \,\left\| {y - x_{i}^{c} } \right\|^{s} \) denotes the distance between y and the i-th sample of c-th class, and s is the locality adaptor parameter.

When coping with occlusion, weighted l 1-minimization problem (9) can be extended to the following stable l 1-minimization problem:

$$ (Weighted\;l_{s}^{1} )\hat{x}_{1} = \arg \hbox{min}\,||Wx||_{1} \;subject\;to||y - Ax|| \le \varepsilon $$
(10)

When obtaining the coding efficient x, the subsequent classification procedure is similar to that of SRC.

4 New Weighted Sparse Representation-based Classification

In WSRC, Lu [13] used the distance between test samples and training samples plus a locality adaptor parameter to form the weight matrix. In this way, they can utilize data locality while seeking the sparse linear representation. However, local feature is not extracted effectively.

When it comes to local feature extraction, features extracted by local binary pattern have highly discriminative power, which makes it suitable for image classification tasks. So in this paper, we make use of the similarity of LBP features between test samples and training samples to form the weight matrix. Considering wavelet transform has the nice features of space-frequency localization and multi-resolutions, we use wavelet transform to get the multi-scale LBP features of face images.

Wavelet transform has been introduced in our method to do the preprocess of the face images, it can reduce noise of images, and the low frequency component is a coarser approximation to the original image. Thus the wavelet image should be more suitable for recognition.

The LBP features that extracted from the low frequency component after 1-level wavelet transform have meaningful local and global features, and these features contribute a lot to face recognition. We divide the face image into m × n blocks. And the LBP features that extracted from the low frequency component after 2-level wavelet transform mainly have global features. When the aforesaid LBP features of two low frequency components are concatenated, we can gain the multi-scale LBP feature of a face image.

Given all that, the procedure of the proposed method WSRC-MSLBP is presented as follows:

  1. 1.

    Do wavelet transform to the training samples, and obtain the 1-level and 2-level low frequency components respectively.

  2. 2.

    Divide the 1-level and 2-level low frequency components into small blocks, then extract LBP features for each small block.

  3. 3.

    Concatenate the LBP features of all the small blocks to form the multi-scale LBP feature of the original face image.

  4. 4.

    The input test sample is also processed according to step 1–3 and obtain its multi-scale LBP feature.

  5. 5.

    Use χ2 statistic, that is, \( \chi^{2} \left( {S,M} \right) = \sum\nolimits_{k} {\frac{{\left( {S_{k} - M_{k} } \right)^{2} }}{{S_{k} + M_{k} }}} \) (S denote a test sample and M denote the model) to measure the similarity of histograms between test samples and training samples, then form the weight matrix W in weighted l 1-minimization problem (10).

  6. 6.

    Solve the weighed \( l_{s}^{1} \) problem (10) and gain the coding efficient \( \hat{x}_{i} \) of the test sample y.

  7. 7.

    Calculate the reconstruction error \( r_{i} (y) = \left\| {y - A\delta_{i} (\hat{x})} \right\|_{2} \) for each class i, then classify the test sample y based on which class yields the least reconstruction error.

5 Experiments and Analysis

In this section, we conduct experiments on publicly available databases for face recognition. The XM2VTS and AR databases are used to verify the performance of the proposed method and its competing methods, i.e. Nearest Neighbor Classifier (NNC), SRC and WSRC. As [13] does, we use three methods for dimensionality reduction, namely Eigenfaces [26], Fisherfaces [3] and Randomfaces [27]. We use the SPAMS package [28, 29] to solve the stable weighted l 1-minimization problem, and the basis function of wavelet transform is coif4. In SRC, the error tolerance ε is 0.005. In WSRC, the error tolerance ε and s are 10−4 and 0.5, 10−4 and 1.5 for XM2VTS and AR respectively. In Fisherfaces, LDA is preceded by PCA to avoid the problem of rank deficiency, and during our experiment, we choose components than account for 90 % energy.

One concern about Randomfaces is its stability, i.e., for an individual trial, the selected features could be bad. In order to reduce variation, when using Randomfaces, we generate 5 random projection matrices. And classification is based on the minimum average residual or distance.

5.1 Experiments on the XM2VTS Database

The XM2VTS database is a multi-modal database which consists of video sequences of talking faces recorded for 295 subjects at one month intervals. The data has been recorded in 4 sessions with 2 shots taken per session. From each session two facial images have been extracted to create an experimental face database of size 55 × 51. In our experiment, we chose a subset of the dataset consisting of 100 subjects. For each subject, four images are used as training samples, the rest for testing, and the face image is divided into 8 × 8 blocks when extracting the LBP features. Figure 3 shows the recognition performance for various features, in conjunction with four different classifiers: NNC, SRC, WSRC and the proposed method. Table 1 shows the detailed recognition accuracy of the methods considered.

Fig. 3.
figure 3

Curves of recognition rate by different methods versus feature dimension on the XM2VTS database.

Table 1. Recognition rate (%) of different methods on the XM2VTS database and the associated dimension of features

5.2 Experiments on the AR Database

The AR face database consists of over 4,000 frontal images for 126 subjects. For each subject, 26 pictures were taken in two separate sessions. These images include more facial variations, including illumination change, expressions, and facial disguises. In the experiment, we chose a subset of the dataset (with only illumination and expression changes) consisting of 20 male subjects and 20 female subjects. For each subject, the seven images from Session 1 were used for training, and the other seven from Session 2 for testing, and the face image is divided into 2 × 5 blocks when extracting the multi-scale LBP features. The comparison of competing methods is given in Table 2.

Table 2. Recognition rate (%) of different methods on the AR database and the associated dimension of features

Based on the results obtained on the XM2VTS database, we have the following observations:

  1. 1.

    In lower dimensions (e.g. the first two rows in the first column in Table 1), SRC performs worse than NNC. The reason for this is that in lower dimensions, the linear measurements are insufficient, so the sparse recovery is not correct, this may have a direct impact on the performance of SRC.

  2. 2.

    WSRC outperforms SRC in most cases when using Eigenfaces and Randomfaces for dimensionality reduction, especially in lower dimensions. This indicates that in lower dimensions, the data locality contains more discriminative information than data linearity, and the imposed locality dose boost recognition performance.

  3. 3.

    It is interesting to find that when using Fisherfaces, in lower dimensions, the performance of NNC is better than that of SRC and WSRC, that is because the aim of LDA is to maximize the ratio of the between-class scatter matrix and the within-class scatter matrix, and the simple classifier NNC could separate data from different classes.

  4. 4.

    Whatever dimensionality reduction method it utilizes, recognition accuracy of the proposed method WSRC-MSLBP outperforms all the other three approaches significantly. This is because we also take data locality into consideration, in addition, we use the multi-scale LBP feature to measure similarity of test samples and training samples. Thus we can better preserve similarity between test samples and training samples, at the same time, we can make full use of the discriminative power of LBP features.

Similar results can also be seen on the AR database, the proposed method WSRC-MSLBP achieves state-of-the-art results.

6 Conclusions

In this paper, we propose a new weighted sparse representation method called WSRC-MSLBP, which uses the similarity of the multi-scale LBP features of face images to form the weight matrix, this can better make use of data locality, so as to boost the performance of face recognition. Experiments conducted on the XM2VTS and AR databases show the feasibility and effectiveness of the new method. However, in this paper, we do not consider the situation when face images are corrupted or occluded explicitly, and it is not uncommon in real-world situations, so in future, we will investigate a more robust method for face recognition.