Keywords

1 Introduction

In recent years, micro-expressions have received more and more attention. In many cases, people hide, camouflage or suppress their true emotions [1], so they produce partial, fast facial expressions, which we call micro-expressions. Compared to ordinary expressions, the short duration of micro-expressions is a typical feature, usually they last 1/25 s to 1/3 s [2]. In addition, micro-expressions have potential uses in many areas, such as national security, interrogation, and medical care. It should be noted that only trained people can distinguish micro-expressions, but even after training, the recognition rate is only 47% [3]. Therefore, the research of micro-expression recognition is of great significance.

Previous research on facial expressions focused on facial micro-expressions [4] found in macroscopic expressions. In recent years, spontaneous facial expressions have attracted more and more researchers’ attention. The recognition of micro-expressions requires a large amount of data for training and modeling, but it is difficult for non-professionals to collect data, which is also the difficulty of micro-expression recognition. Commonly used spontaneous micro-expression data sets are: SMIC [5] of the University of Oulu and CASME [6], CASME2 [7] of the Chinese Academy of Sciences. The SMIC dataset consists of three subsets of HS, VIS, and NIR captured by a high-speed camera, a normal camera, and a near-infrared camera, respectively.

The object of micro-expression processing is a video clip, and a gray-scale video clip can be regarded as 3D, and many micro-expression algorithms focus on extracting 3D texture features. Local binary pattern from three orthogonal planes (LBP-TOP) [8] is an extension of LBP in three-dimensional space and is widely used in micro-expression analysis. Since then, LBP-TOP has proven to be effective in micro-expression recognition, and many researchers have proposed improvements based on LBP-TOP. For example, Huang et al. proposed a Completed Local Quantized Pattern (CLQP) [9] to reduce the dimensions of features. Subsequently, an integral projection method based on a difference image (STLBP-IP) [10] is also proposed. This method first obtains the difference image of the micro-expression sequence, and then uses the integral projection method to combine the LBP to obtain the feature vector. In 2017, Huang et al. [11] proposed an RPCA-based integral projection (STLBP-RIP) method for identifying spontaneous micro-expressions. This method has better performance than other methods.

In this paper, we present a new algorithm Dual-cross patterns with RPCA of Key frame (DCP-RKF) for feature extraction of micro-expressions. For each video sequence, starting frames and ending frames are used as standard frames, and structural similarity index (SSIM) [12] is used to find key frames in the video sequence. Sparse information is extracted from key frames using RPCA, and feature extraction is performed using DCP [13].

2 Key Frame Based on Structural Similarity (SSIM)

The spatial domain SSIM index is based on similarities of local luminance, contrast, and structure between reference and distorted image. In fact, because it is a symmetric measure, it can be thought of as a similarity measure for comparing any two signals [16]. Given two image \( x \) and \( y \), SSIM index is defined as

$$ SSIM(x,y) = \frac{{(2\mu_{x} \mu_{y} + c_{1} )(2\delta_{xy} + c_{2} )}}{{(\mu_{x}^{2} + \mu_{y}^{2} + c_{1} )(\delta_{x}^{2} + \delta_{y}^{2} + c_{2} )}} $$
(1)

Where \( \mu_{x} \), \( \mu_{y} \) are the pixel average of the image \( x \) and \( y \), \( \sigma_{x}^{2} \) and \( \sigma_{y}^{2} \) are the variance of \( x \), \( y \), \( \sigma_{xy} \) is the covariance of \( x \), \( y \). \( c_{1} \) and \( c_{2} \) are used to maintain stability when the pixel average are close to zero. By default, \( c_{1} = (0.01*L)^{2} \), \( c_{2} = (0.03*L)^{2} \), where L is the specified ‘Dynamic Range’ value. SSIM range from 0 to 1, when the two images are identical, the value of SSIM is equal to 1.

The pioneering work by Wang et al. [12] showed that SSIM-motivated optimization for video coding played a very important role in video processing, which is more relative to micro-expression recognition.

For micro-expression video sequence, the traditional feature extraction method is to consider the entire sequence or part of it for reproduction. There are always problems with alignment, lighting, etc. in the micro-expression database, so too much data is the bane of accurate identification [14]. A novel proposition is presented in this paper, whereby we utilize only one image per video, called key frame. The key frame of a video contains the highest intensity of expression changes among all frames, while the onset and offset frame is the perfect choice of a reference frame with neutral expression and SSIM is used to extract key frame.

Given a micro-expression video sequence \( f_{i} |i = 1, \ldots ,n \), \( R_{1} \) and \( R_{2} \) are the reference frames of this sequence which are the first and last frames, respectively, defined as \( R_{1} = f_{1} \), \( R_{2} = f_{n} \). For each frame in the video sequence, the total SSIM is represented as:

$$ TSSIM_{i} = SSIM_{1i} + SSIM_{2i} = SSIM(f_{i} ,R_{1} ) + SSIM(f_{i} ,R_{2} ) $$
(2)

Combine the previously proposed SSIM index (Eq. (1)), Eq. (2) can be re-defined as:

$$ TSSIM_{i} = \frac{{(2\mu_{{f_{i} }} \mu_{{R_{1} }} + c_{1} )(2\delta_{{f_{i} R_{1} }} + c_{2} )}}{{(\mu_{{f_{i} }}^{2} + \mu_{{R_{1} }}^{2} + c_{1} )(\sigma_{{f_{i} }}^{2} + \sigma_{{R_{1} }}^{2} + c_{1} )}} + \frac{{(2\mu_{{f_{i} }} \mu_{{R_{2} }} + c_{1} )(2\delta_{{f_{i} R_{2} }} + c_{2} )}}{{(\mu_{{f_{i} }}^{2} + \mu_{{R_{2} }}^{2} + c_{1} )(\sigma_{{f_{i} }}^{2} + \sigma_{{R_{2} }}^{2} + c_{1} )}} $$
(3)

Where \( i\, = \,\text{2},\text{3}, \cdots ,\text{(n} - \text{1}) \). According to the definition of total SSIM, we can get TSSIM except for the first and last frames. Finally, by comparing the size of TSSIM, can get key frame. We think that TSSIM value is the smallest, that is, the frame with the largest difference compared with the reference frame is the key frame.

$$ keyframe = f_{i} = \hbox{min} \{ TSSIM_{i} \} $$
(4)

3 Sparse Information Extracted from Key Frame Using RPCA

Although the key frame based on structural similarity preserves the main information of different micro-expression and discriminative ability, it also has a lot of facial information. Just as STLBP-IP [10] considers video clips from difference image method can well characterize the different micro-expression. Next, we exploit the nature of STLBP-IP to get the motion feature from the robust principal component analysis.

Based on Eq. 4, we can obtain the key frame of any video sequence. For convenience, we denote it as \( M \). First we know that \( M \) is a large data matrix and the data are characterized by low-rank subspaces, so it may be decomposed as

$$ M = L_{0} + S_{0} $$
(5)

Where \( L_{0} \) is a low-rank matrix and \( S_{0} \) is sparse matrix, aiming at recovering \( S_{0} \). This problem can be solved by tractable convex optimization. Equation 5 is formulated as follows

$$ \min\,||L||_{*} + \lambda ||S||_{1} \Leftrightarrow {\mathbf{subject\,\,\,to}}\,\,L + S = M $$
(6)

Where \( ||.||_{*} \) denotes the nuclear norm, which is the sum of its singular values, \( \lambda \) is a positive weighting parameter. The iterative threshold technique minimizes the combination of the \( L_{0} \) norm and the ‘nuclear’ norm, and the scheme converges very slowly.

Now discuss the Augmented Lagrangian Multiplier (ALM). The ALM method operates on the augmented Lagrangian

$$ l(L,S,Y) = ||L||_{*} + \lambda ||S||_{1} + \langle Y,M - L - S\rangle + \frac{\mu }{2}||M - L - S||_{F}^{2} $$
(7)

Where \( Y \) is a Lagrange multiplier and \( \mu \) is a positive scalar. A genetic Lagrange multiplier algorithm would solve PCP (principle component pursuit) by repeatedly setting \( (L_{k} ,S_{k} ) \) = arg min \( l(L,S,Y_{k} ) \), and then updating the Lagrange multiplier matrix via \( Y_{k + 1} \) = \( Y_{k} \) + \( \mu (M - L_{k} - S_{k} ) \). Equation 7 can resolved by ALM proposed by EJ et al. [15] Fig. 1 shows the key frame selected from a micro-expression video clip, in which it is labeled as negative. It is found from Fig. 2 that after using RPCA, the sparse part we extracted is used to extract the feature part, and the information is reduced more, which also reflects the simplicity of the proposed method. As seen from Fig. 1, the subtle motion image obtained by RPCA well characterizes the specific regions of facial movements.

Fig. 1.
figure 1

Sparse information extracted from key frame using RPCA

4 Dual-Cross Patterns (DCP)

DCP is a kind of local binary descriptors focus on local sampling and pattern encoding, which are the important part of a face image descriptor. DCP encodes the second-order statistical information in the most abundant direction of the face image. The research of Ding et al. [13] shows that DCP has strong recognition ability and strong robustness to posture, expression and illumination changes. Compared to LBP, the local sampling method of DCP is different, as shown in Fig. 2.

The purpose of DCP is to perform local sampling and mode encoding on the direction in which the amount of information contained in the face image is the largest. After the face image is normalized, some facial expressions such as eyes, nose, mouth, and eyebrows extend horizontally or outward, and converge toward the diagonal direction (\( \pi /4 \) and \( 3\pi /4 \)). As shown in Fig. 2(a), for each pixel in an image, sample in 8 directions, such as 0, \( \pi /4 \), \( \pi /2 \), \( 3\pi /4 \), \( \pi \), \( 5\pi /4 \), \( 3\pi /2 \) and \( 7\pi /4 \). Two pixels are sampled in each direction. The final sampling points are \( \{ A_{0} ,B_{0} ;A_{1} ,B_{1} ; \ldots ;A_{7} ,B_{7} \} \), we define the radius of A is \( R_{in} \) and the radius of B is \( R_{ex} \).

Define the encoding for each direction as follows

$$ DCP_{i} = S(I_{{A_{i} }} - I_{O} ) \times 2 + S(I_{{B_{i} }} - I_{{A_{i} }} ),0 \le i \le 7 $$
(8)

Where \( S(t) = \left\{ {_{0, \, t < 0}^{1, \, t \ge 0} } \right. \), and \( I_{o} \), \( I_{A} \), \( I_{B} \) are the gray value of points \( O \), \( A_{i} \) and \( B_{i} \), respectively.

Fig. 2.
figure 2

Local sampling of DCP

In order to reflect the horizontal and diagonal information of the image, the \( DCP_{i} \) is further divided into two cross encoders. We define \( \left\{ {DCP_{0} ,DCP_{2} ,DCP_{4} ,DCP_{6} } \right\} \) as the first subset and name is DCP-1; \( \left\{ {DCP_{1} ,DCP_{3} ,DCP_{5} ,DCP_{7} } \right\} \) as the second subset and name is DCP − 2 as shown in Fig. 2(b). The codes at each pixel are represented as

$$ DCP{ - }1 = \sum\limits_{i = 0}^{3} {DCP_{2i} \times 4^{i} } $$
(9)
$$ DCP{ - }2 = \sum\limits_{i = 0}^{3} {DCP_{2i + 1} \times 4^{i} } $$
(10)

Thus, the DCP descriptor for each pixel in an image can be represented by the two codes generated by the cross encoders.

5 Results and Discussion

For evaluating DCP-RKF, the experiments are implemented on SMIC-HS databases for micro-expression recognition [10]. The SMIC-HS database consists of 16 subjects with 164 spontaneous micro-expressions recorded by a 100-fps camera and spatial resolution with 640 × 480 pixel size. There are 3 classes of the micro-expressions in this database: negative (70 samples), positive (51 samples) and surprise (43 samples)

For SMIC-HS databases, we firstly use active shape model (ASM) to extract the 68 facial landmarks for a micro-expression image and aligned to a standard frame. And then we crop facial images into 170 × 139. In the experiments, we use leave-one-sample-out cross validation protocol, one of which was used for testing and the remaining samples were used for training. For the classification, we use the chi-square distance.

We introduce the Dual-cross patterns with RPCA of key frame (DCP-RKF). The block size N of sparse key frame, the inner and outer radius \( (R_{in} ,R_{ex} ) \) of DCP are two important parameters for DCP-RKF, which determine the complexity of the algorithm and the performance of the classification. In this subsection, our aim is to evaluate the effects of parameters N and \( (R_{in} ,R_{ex} ) \). This paper evaluates the performance of DCP-RKF caused by various N on SMIC-HS database. We know that the number of blocks of a sparse key frame is represented by the number of rows and the number of columns, defined as \( N = (row,col) \). In order to avoid bias and compare the performance of features at a more general level, we extract features by changing the number of blocks without regard to the radius effect of DCP. The results of DCP-RKF on SMIC-HS databases are presented in Fig. 3, at this point we take \( (R_{in} ,R_{ex} ) \) as (5, 7).

Fig. 3.
figure 3

Performance of DCP-RKF using different blocks of sparse Keyframe on SMIC-HS databases (%)

It is known from Fig. 3 that when the radius of the DCP is (5, 7), the DCP-RKF recognition rate is up to 62.8% when the blocks are at (8, 9) and (10, 10). If the block is (1, 1), that is, without block, the overall recognition rate will be lower, which also indicates that the block is helpful for the recognition rate. Theoretically, we believe that blocking helps to improve the positional information of the micro-expressions. Overall, within a certain range, as the number of blocks increases, the recognition rate generally has an upward trend. However, exceeding a certain range, the increase in the number of blocks does not increase the recognition rate.

Based on the designed N(8, 9), we verify the influence of \( (R_{in} ,R_{ex} )\)\( (R_{in} ,R_{ex} |1,2,3,4,5,6,7,8,9 \, \text{and} \, R_{in} < R_{ex} ) \) on SMIC-HS database, of which results are shown in Table 1. From Table 1, we can see that when the sparse Keyframe is divided into 8 × 9 blocks, the micro-expression recognition rate of DCP-RKF is related to the radius of the DCP. For DCP-RKF, the greater the difference between \( R_{in} \) and \( R_{ex} \), the better recognition we obtain. However, if the radii differ too much, the effect may be reduced. In the case of 8 × 9 blocks, the best recognition rate 63.41% we get is at radius (4, 9).

Table 1. Recognition rates with different radius of DCP under the 8 × 9 blocks (%)

To verify the proposed method, we compare the recognition rate of DCP-RKF with the algorithm of LBP-TOP [7], STLBP-IP [10], STLBP-RIP [11] on SMIC-HS database. It should be noted here that DCP-1-RKF and DCP-2-RKF means that we only use DCP-1, DCP-2 for feature extraction of sparse Key frame.

For DCP-RKF, we use 8 × 9 blocks on sparse key frame, and the radius of DCP is (4, 9). Leave-one-sample-out cross validation protocol is used to select the training and testing samples, the last chi-square distance is used for classification. For LBP-TOP, STLBP-IP, STLBP-RIP, in order to reflect the experimental contrast, we use the 8*9 block, and the other optimal parameters proposed in the respective articles, the same classification method, and repeatedly implement the related algorithm on the SMIC-HS database. Results on recognition rate are reported in Table 2. As seen from the table, LBP-TOP achieves the recognition rate of 55.49%, at the same time, STLBP-IP and STLBP-RIP only achieve a recognition rate of 50%. However, DCP-RKF reaches the best recognition rate of 63.41%, which is 7.92% higher than LBP-TOP. These results show that DCP-RKF has good geometric and texture features, and can be well applied to micro-expression feature extraction.

Table 2. Micro-expression recognition rates of different methods (%)

The confusion matrix of LBP-TOP, STLBP-IP, STLBP-RIP and our methods are shown in Fig. 4. Compared to other methods, DCP-RKF performs better on all emoticons (negative, positive, surprise). On the negative micro-expression, DCP-RKF achieved a recognition rate of 67.14%, higher than 62.86% of LBP-TOP, 50% of STLBP-IP, and 60% of STLBP-RIP. Similarly, on positive and surprise micro-expression, DCP-RKF achieved 64.7% and 55.81%, which are also higher than the other three methods.

Fig. 4.
figure 4

The confusion matrix of (a) LBP-TOP, (b) STLBP-IP, (c) STLBP-RIP, (d) DCP-RKF for micro-expression recognition on SMIC-HS databases

6 Conclusions

In this paper, we propose Dual-cross patterns with RPCA of key frame (DCP-RKF) for micro-expression recognition. Specifically, we first use SSIM to obtain the key frame of the micro-expression sequence, then apply RPCA to obtain the sparse information of the key frame, and finally use DCP to extract the features. Experimental results demonstrate that our proposed method gets higher recognition rates and achieves promising performance, compared with the state-of-the-art performance on the SMIC-HS micro-expression database.