Keywords

1 Introduction

In recent years, human facial expression recognition (FER) emerged as an important research area, due to the facial expression can effectively represent the emotional state, cognitive activities and personality characteristics of human beings. Thus, the facial expression recognition has been widely used in the computer vision applications, such as human computer interaction (HCI) [1], psychology and cognitive science [2], access control and surveillance systems [3], driver state surveillance, and etc.

Early researches about facial expression recognition mainly focus on recognizing expressions from a static image or recognizing the video sequences by analyzing each frame. For the static image based facial expression recognition, the methods of Gabor wavel [4] and local binary pattern (LBP) [5] are usually used to explore the texture information in the face images to represent the facial expression. Moreover, active shape model (ASM) [6] and active appearance model (AAM) [7] are commonly used to extract the facial landmarks to describe the changes of facial expression. The static image based methods can effectively extract texture and spatial information in the image but they cannot model the variability in morphological and contextual factors. Therefore, some studies try to capture the dynamic variation of facial physical structure by exploring the spatial-temporal in video sequences, such as 3D-HOG [8], LBP-TOP [9] and 3D-SIFT [10]. Because the video sequences of facial expression contain not only image appearance information in the spatial domain, but also the evolution details in the temporal domain, the facial appearance information, together with the expression evolution information, can further enhance recognition performance. More recently, deep learning, particularly, the convolutional neural networks (CNNs) have shown their strength in computer vision applications and also been popular used to solve the FER problem. For example, Yu et al. [11] utilize an ensemble of CNNs, and employ data augmentation at both training and testing stage in order to improve the performance of FER. Jung et al. [12] propose a small CNN architecture to capture the dynamic variations of facial appearance.

Fig. 1.
figure 1

The framework of our method for FER.

Inspired by the advantages of deep convolutional neural networks and spatial-temporal video representations, we design a static pyramid CNN-based feature generated on the apex frame and further combine it with dynamic appearance-based LBP-TOP feature extracted in the video sequences to enhance the performance of FER. The overall flow diagram of the proposed method for FER is shown in Fig. 1. First, according to prior knowledge that the video frame with the largest expression intensity plays an important role in FER, the expression intensity of each facial frame can be estimated by calculating the displacement of facial landmarks and the facial frame with maximum displacement is selected as the apex frame. Second, making use of the superior representation capbility of deep convolutional feature in the pre-trained CNN architecture, a pyramid CNN-based feature representation is generated on the apex frame to capture the information on global and local regions of human face. Afterwards, dynamic LBP-TOP feature is extracted to model the spatial-temporal information of the whole video sequences. Finally, the multi-features generated by combining the static and dynamic feature representations are fed into a classifier to accomplish the FER. The main contributions of this paper are twofold. First, in order to capture the possible slight asymmetry between left and right side of human face as well as the local subtle motion of facial local regions when facial expression changes, we propose to conduct a two level image pyramid on the apex frame, and extract the deep convolutional features on each region in the image pyramid to boost the representation capability of facial expression. Second, the static image feature representation based on the apex frame and the dynamic feature representation based on spatial-temporal information in the video sequences are combined together to further improve the performance of facial expression recognition.

The rest of paper is organized as follows. Section 2 shows detailed description of the proposed pyramid CNN-based feature generation and LBP-TOP feature. The experimental results and discussions are presented in Sect. 3. Conclusions are given in Sect. 4.

2 Methodology

The proposed FER system consists of three procedures: face preprocessing (face detection, face registration and facial expression intensity estimation), static pyramid CNN-based feature generation on apex frame, LBP-TOP feature extraction on video sequences. The details of each procedure are described as follows.

In the preprocessing stage of FER system, the region of face should be firstly detected and cropped in each frame to eliminate the interference from unnecessary noise. Following the usual protocol, the Viola-Jones face detection [13] model is used to detect the face region in the video frame. To further improve the accuracy of detected face region, the method in [14] is employed to detect the facial landmarks, and the facial landmarks at the outermost side are selected to determine the boundary of final face region.

Once we obtained the final accurate face region of each frame in the video sequences, the face registration technology [15] is then adopted to remove the influence of scale, rotation and translation changes of face region. Consequently, the difference of facial expression between the two frames is confined to the changes of facial muscle and the registered face image can also provide the optimal input for subsequent feature extraction.

2.1 Static Pyramid CNN-based Feature

As it is known to all, the frame with the largest expression intensity in video sequence contains rich discriminative expression information and it is usually termed as apex frame. Based on this factor, we can select the apex frame from video sequences and generate a feature representation for the apex frame to improve the performance of FER. In this paper, we propose to calculate the displacement of facial landmarks to estimate the facial expression intensity and select the frame with maximum displacement as apex frame. This procedure is adaptive, and it can not only apply to the expression video whose intensity changes like neutral-onset-apex in traditional datasets, but also to neutral-onset-apex-offset-neutral intensity transformation in other expression video. Inspired by the work in [14], which propose a machine learning approach called supervised descent method (SDM) to detect high accuracy positions of facial landmarks, we adopt SDM to detect the facial landmarks in our method. Under the assumption that the human facial expression transforms as neutral-onset-apex or neutral-onset-apex-offset-neutral, the first face frame in the video sequences can be treated as neutral facial expression. Therefore, for \(X_{1}^{i}\) and \(Y_{1}^{i}\) which denote the coordinate of \({{i}^{th}}\) landmarks in the first face frame, \(X_{t}^{i}\) and \(Y_{t}^{i}\) which are the coordinate of \({{i}^{th}}\) landmarks in the \({{t}^{th}}\) face frame, the landmarks displacement \(D_{t}\) between first frame and \({{t}^{th}}\) frame can be calculated as:

$$\begin{aligned} {{D}_{t}}=\sum \limits _{i=1}^{n}{|X_{t}^{i}-X_{1}^{i}|}+\sum \limits _{i=1}^{n}{|Y_{t}^{i}-Y_{1}^{i}|} \end{aligned}$$
(1)

where n denotes the number of landmarks detected by SDM and it is 66 in total as usually use. Thereby, the frame with Maximum value of \(D_{t}\) will be chosen as the apex frame. Figure 2 demonstrates the procedure of selecting an apex frame in the video sequences.

Fig. 2.
figure 2

The procedure of apex frame selection.

As aforementioned, along with the facial expression changes, the left and right face may be asymmetric in certain frames, especially in the eyes and mouth regions, so it is necessary to analyze the local region in the whole face from the apex frame. We propose to use the apex frame to conduct a two-level image pyramid and generate deep convolutional features on the image pyramid to represent the apex frame, such that the global and local information of the human face can be both captured to enhance the discrimination capability of the static apex frame. The proposed pyramid CNN-based face representation is established at two scale levels. The first level corresponds to the full apex face frame, and the second level consist of 4 regions by equally partitioning the full face region. Therefore, we can obtain five deep features by passing each region through a pre-trained CNN architecture: \({C}_{0}\) denotes the deep feature from the first level, and \({C}_{1}\), \({C}_{2}\), \({C}_{3}\), \({C}_{4}\) denote the deep features from the second level. Afterwards, we can concatenate the five deep features as: C = [\({C}_{0}\), \({C}_{1}\), \({C}_{2}\), \({C}_{3}\), \({C}_{4}\)], and the final face deep representation C is with the dimensions of 5\(*\)512. The static pyramid CNN-based features extraction process is shown in Fig. 3.

Fig. 3.
figure 3

Feature extraction procedure based on pyramid CNN model.

For the deep features, we propose to use the deep convolutional representation, ranther than the general outputs from the fully connected layers in CNN. Given a pre-trained CNN model with L convolutional layers and for an input image we can extract its CNN feature maps after resizing it to 224\(*\)224 for VGG [18] networks. A feature map can be denoted by \(\bar{F_{i}}=\left\{ F_{ij}:i=1...L;j=1...C_{i}\right\} \), where \(F_{ij}\) is equal to the \({{j}^{th}}\) feature map at \({{i}^{th}}\) convolutional layer, and \(C_{i}\) equal to the number of convolutional kernels. The size of \(F_{ij}\) is \(W_{i}\) \(\times \) \(H_{i}\), where \(W_{i}\) and \(H_{i}\) are the width and height of each channel. Assuming that (xy) is the coordinate of feature map \(F_{i,j}\), and \(f_{i}(x,y)\) is the response value at the \({{i}^{th}}\) convolutional layer with a spatial coordinate (xy). Then, the image representation by max-pooling can be describe as follows:

$$\begin{aligned} \dot{V}_{i}=[\dot{V}_{F_{i,j}}:j=1...C_{i}] \end{aligned}$$
(2)
$$\begin{aligned} \dot{V}_{F_{i,j}}=max(f_{i}(x,y)) \end{aligned}$$
(3)

2.2 LBP-TOP Feature

Local binary patterns from three orthogonal planes (LBP-TOP) [9] is an extension of LBP from two-dimensional space to three-dimensional space, and LBP-TOP extracts local binary patterns features from three orthogonal planes (i.e., XY, XT and YT) of video sequences. Compared with LBP, LBP-TOP does not only contains the texture information of XY plane, but also takes into account the texture information of XT and YT, while the texture information of XT and YT record important dynamic textures. For each plane, a histogram of dynamic texture can be defined as:

$$\begin{aligned} {{H}_{i,j}}=\sum \nolimits _{x,y,t}{I\left\{ {{f}_{j}}\left( x,y,t \right) =i \right\} } \end{aligned}$$
(4)
$$\begin{aligned} {i}=0,\cdots ,{{n}_{j-1}};j=0,1,2 \end{aligned}$$

where \(n_{j}\) is the number of different labels produced by the LBP operator in the \({{j}^{th}}\) plane (j = 0: XY, 1: XT and 2: YT), \(~{{f}_{j}}\left( x,y,t \right) \) denotes the LBP code of central pixel (x, y, t) in the \({{j}^{th}}\) plane and if A is true, \(I\{A\}=1\), else \(I\{A\}=0\). Afterwards, statistical histograms of three different planes will be concatenated into one histogram. Generally, considering the motion of different face regions, a block-based method is introduced which cascade histogram extracted from all block volume. In the experiment, each sequence volume is divided into 8\(*\)8 non-overlapping blocks. The procedure of LBP-TOP method extracting features from block shown in Fig. 4.

Fig. 4.
figure 4

Feature extraction procedure based on block-based LBP-TOP.

Furthermore, the static pyramid CNN based feature and LBP-TOP feature are cascaded as a final face video representation for training and testing, and we can note that the strength of the final representation not only contains the static feature from apex frame which has the maximum expression intensity in face frames, but also takes into account the spatio-temporal information in the video sequences.

3 Experimental Results

3.1 Dataset

The extended Cohn-Kanade dataset (CK+) dataset [16]: there are totally 593 frontal video sequences from 123 subjects. The sequences vary in duration, from 10 frames to 60 frames per video, starting from neutral to the apex of the facial expression. The CK+ dataset contains 327 expression-labeled, each of which has seven expressions, but only 309 image sequences with six basic expressions (anger, disgust, fear, happiness, sadness, and surprise) were considered in our study.

The Oulu-CASIA dataset [17]: it consists of six expressions from 80 subjects. All the image sequences were taken under three visible light conditions: normal, weak and dark. The number of video sequences is 480 (80 subjects by six expressions) for each illumination, so there are totally 2880 (4806) video sequences in the dataset. All expression sequences begin at the neutral frame and end with the apex frame. In the experiments, we evaluate our method under normal illumination condition.

3.2 Experimental Results on CK+ Dataset and Oulu-CASIA Dataset

In this part, we evaluate the proposed framework on both CK+ and Oulu-CASIA datasets. We firstly test the performance of apex frame selection based on the facial expression estimation by calculating the facial landmarks displacement. Figure 5 shows the selected apex frames from different video sequences. As the transformation of facial expression is from neutral to apex in both datasets, the estimated apex frame by our proposed method is almost the last frame of each video sequence which prove the correctness of the apex frame selection method.

Fig. 5.
figure 5

Apex frames selected from (a) CK+ dataset and (b) Oulu-CASIA dataset.

We further evaluate the performance of the outputs from fully connected layers in CNN and the outputs from the last convolutional layer in CNN. Additionally, the performance of both two types of deep features on a single face image as well as the proposed two level face image pyramid is also compared. The accuracy of a specific expression recognition is measured by the ratio of correctly recognized samples over the total number of samples in the specific expression, while the total accuracy is calculated by the ratio of all correctly recognized samples over the total number of all testing samples. As illustration in Fig. 6, we can note that the deep convolutional features with only 512 dimensions show competitive or even higher accuracy than the fully connected features with 4096 dimensions on two datasets. Furthermore, we apply the deep features on the proposed two level image pyramid to evaluate the performance of pyramid CNN-based representation, and we can see that the two level image pyramid CNN-based representation indeed improve the facial expression accuracy when comparing a single face image.

Fig. 6.
figure 6

Comparison recognition rate of four different dimensional CNN feature on (a) CK+ dataset and (b) Oulu-CASIA dataset.

Moreover, we conduct experiments to evaluate effectiveness of combining static pyramid CNN-based feature and dynamic spatial-temporal LBP-TOP feature for facial expression recognition in video sequences. Tables 123 and 4 show the confusion matrices obtained by using just LBP-TOP feature as well as the combination of both features with multiclass SVM classifier. The confusion matrix includes recognition accuracy of each expression and the total classification accuracy. Based on the results on both datasets, the combination of two features achieved higher total recognition accuracy and specific expression recognition accuracy than that of just using dynamic LBP-TOP feature or static pyramid CNN-based feature. Especially for CK+ dataset, the proposed framework significantly improves the performance on the expressions of anger, disgust, happiness and surprise, meanwhile the recognition accuracy of fear expression has greatly improved compared to the LBP-TOP feature.

Table 1. Confusion matrix of LBP-TOP on CK+ dataset
Table 2. Confusion matrix of LBP-TOP on Oulu-CASIA dataset
Table 3. Confusion matrix of combination feature on CK+ dataset
Table 4. Confusion matrix of combination feature on Oulu-CASIA dataset

3.3 Comparison with State-of-the-art

In the following, we compare the proposed framework with published state-of-the-art methods on each dataset. As shown in Table 5, we can note that, for facial expression recognition, our method achieves the highest recognition accuracy which further proves the discrimination and robustness of our video sequences representation by combining static pyramid CNN-based feature and dynamic LBP-TOP feature.

Table 5. The Comparison with the state-of-the-art on both datasets

4 Conclusions

In this paper, we presented a novel FER method, where the static and dynamic feature are integrated together to boost the FER performance. For the procedure of static feature extraction, in order to capture the global and local information of human face, a pyramid CNN model is constructed to extract features from apex frames which are selected adaptively by using the displacement information of landmarks. Moreover, spatial-temporal LBP-TOP feature is employed as the dynamic feature which is cascaded with static pyramid CNN-based feature to classify expressions by using multiclass SVM with one-versus-one strategy. The evaluation results show that our method is competitive or even superior when comparing to the state-of-the-art methods on two facial expressions datasets.