Combination of Pyramid CNN Representation and Spatial-Temporal Representation for Facial Expression Recognition

Xu, Shulin; Pu, Nan; Qian, Li; Xiao, Guoqiang

doi:10.1007/978-981-10-7302-1_4

Shulin Xu¹⁶,
Nan Pu¹⁶,
Li Qian¹⁶ &
…
Guoqiang Xiao¹⁶

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 772))

Included in the following conference series:

CCF Chinese Conference on Computer Vision

2464 Accesses
2 Citations

Abstract

In this paper, we propose a novel framework for facial expression recognition in video sequences by combining deep convolutional feature and spatial-temporal feature. Firstly, apex frame of every sequence is selected adaptively by calculating the displacement of facial landmarks. Then, pyramid CNN-based feature is extracted on the apex frame to capture the information of global and local regions of human face. Afterwards, spatial-temporal LBP-TOP feature is generated from video sequence and integrated with pyramid CNN-based feature to represent video, which reflect dynamic and static texture information of facial expressions. Finally, the multiclass support vector machine (SVM) with one-versus-one strategy is applied to classify facial expressions. Experimental results on the extended Cohn-Kanade (CK+) and Oulu-CASIA datasets demonstrate the superiority of our proposed method.

You have full access to this open access chapter, Download conference paper PDF

Face Expression Recognition in Video Using Hybrid Feature Extractor and CNN-LSTM

Enhanced spatio-temporal 3D CNN for facial expression classification in videos

Article 28 June 2023

Dynamic Facial Expression Recognition Based on Trained Convolutional Neural Networks

Keywords

1 Introduction

In recent years, human facial expression recognition (FER) emerged as an important research area, due to the facial expression can effectively represent the emotional state, cognitive activities and personality characteristics of human beings. Thus, the facial expression recognition has been widely used in the computer vision applications, such as human computer interaction (HCI) [1], psychology and cognitive science [2], access control and surveillance systems [3], driver state surveillance, and etc.

Early researches about facial expression recognition mainly focus on recognizing expressions from a static image or recognizing the video sequences by analyzing each frame. For the static image based facial expression recognition, the methods of Gabor wavel [4] and local binary pattern (LBP) [5] are usually used to explore the texture information in the face images to represent the facial expression. Moreover, active shape model (ASM) [6] and active appearance model (AAM) [7] are commonly used to extract the facial landmarks to describe the changes of facial expression. The static image based methods can effectively extract texture and spatial information in the image but they cannot model the variability in morphological and contextual factors. Therefore, some studies try to capture the dynamic variation of facial physical structure by exploring the spatial-temporal in video sequences, such as 3D-HOG [8], LBP-TOP [9] and 3D-SIFT [10]. Because the video sequences of facial expression contain not only image appearance information in the spatial domain, but also the evolution details in the temporal domain, the facial appearance information, together with the expression evolution information, can further enhance recognition performance. More recently, deep learning, particularly, the convolutional neural networks (CNNs) have shown their strength in computer vision applications and also been popular used to solve the FER problem. For example, Yu et al. [11] utilize an ensemble of CNNs, and employ data augmentation at both training and testing stage in order to improve the performance of FER. Jung et al. [12] propose a small CNN architecture to capture the dynamic variations of facial appearance.

Inspired by the advantages of deep convolutional neural networks and spatial-temporal video representations, we design a static pyramid CNN-based feature generated on the apex frame and further combine it with dynamic appearance-based LBP-TOP feature extracted in the video sequences to enhance the performance of FER. The overall flow diagram of the proposed method for FER is shown in Fig. 1. First, according to prior knowledge that the video frame with the largest expression intensity plays an important role in FER, the expression intensity of each facial frame can be estimated by calculating the displacement of facial landmarks and the facial frame with maximum displacement is selected as the apex frame. Second, making use of the superior representation capbility of deep convolutional feature in the pre-trained CNN architecture, a pyramid CNN-based feature representation is generated on the apex frame to capture the information on global and local regions of human face. Afterwards, dynamic LBP-TOP feature is extracted to model the spatial-temporal information of the whole video sequences. Finally, the multi-features generated by combining the static and dynamic feature representations are fed into a classifier to accomplish the FER. The main contributions of this paper are twofold. First, in order to capture the possible slight asymmetry between left and right side of human face as well as the local subtle motion of facial local regions when facial expression changes, we propose to conduct a two level image pyramid on the apex frame, and extract the deep convolutional features on each region in the image pyramid to boost the representation capability of facial expression. Second, the static image feature representation based on the apex frame and the dynamic feature representation based on spatial-temporal information in the video sequences are combined together to further improve the performance of facial expression recognition.

The rest of paper is organized as follows. Section 2 shows detailed description of the proposed pyramid CNN-based feature generation and LBP-TOP feature. The experimental results and discussions are presented in Sect. 3. Conclusions are given in Sect. 4.

2 Methodology

The proposed FER system consists of three procedures: face preprocessing (face detection, face registration and facial expression intensity estimation), static pyramid CNN-based feature generation on apex frame, LBP-TOP feature extraction on video sequences. The details of each procedure are described as follows.

In the preprocessing stage of FER system, the region of face should be firstly detected and cropped in each frame to eliminate the interference from unnecessary noise. Following the usual protocol, the Viola-Jones face detection [13] model is used to detect the face region in the video frame. To further improve the accuracy of detected face region, the method in [14] is employed to detect the facial landmarks, and the facial landmarks at the outermost side are selected to determine the boundary of final face region.

Once we obtained the final accurate face region of each frame in the video sequences, the face registration technology [15] is then adopted to remove the influence of scale, rotation and translation changes of face region. Consequently, the difference of facial expression between the two frames is confined to the changes of facial muscle and the registered face image can also provide the optimal input for subsequent feature extraction.

2.1 Static Pyramid CNN-based Feature

As it is known to all, the frame with the largest expression intensity in video sequence contains rich discriminative expression information and it is usually termed as apex frame. Based on this factor, we can select the apex frame from video sequences and generate a feature representation for the apex frame to improve the performance of FER. In this paper, we propose to calculate the displacement of facial landmarks to estimate the facial expression intensity and select the frame with maximum displacement as apex frame. This procedure is adaptive, and it can not only apply to the expression video whose intensity changes like neutral-onset-apex in traditional datasets, but also to neutral-onset-apex-offset-neutral intensity transformation in other expression video. Inspired by the work in [14], which propose a machine learning approach called supervised descent method (SDM) to detect high accuracy positions of facial landmarks, we adopt SDM to detect the facial landmarks in our method. Under the assumption that the human facial expression transforms as neutral-onset-apex or neutral-onset-apex-offset-neutral, the first face frame in the video sequences can be treated as neutral facial expression. Therefore, for $X_{1}^{i}$ and $Y_{1}^{i}$ which denote the coordinate of ${{i}^{th}}$ landmarks in the first face frame, $X_{t}^{i}$ and $Y_{t}^{i}$ which are the coordinate of ${{i}^{th}}$ landmarks in the ${{t}^{th}}$ face frame, the landmarks displacement $D_{t}$ between first frame and ${{t}^{th}}$ frame can be calculated as:

$$\begin{aligned} {{D}_{t}}=\sum \limits _{i=1}^{n}{|X_{t}^{i}-X_{1}^{i}|}+\sum \limits _{i=1}^{n}{|Y_{t}^{i}-Y_{1}^{i}|} \end{aligned}$$

(1)

where n denotes the number of landmarks detected by SDM and it is 66 in total as usually use. Thereby, the frame with Maximum value of $D_{t}$ will be chosen as the apex frame. Figure 2 demonstrates the procedure of selecting an apex frame in the video sequences.

As aforementioned, along with the facial expression changes, the left and right face may be asymmetric in certain frames, especially in the eyes and mouth regions, so it is necessary to analyze the local region in the whole face from the apex frame. We propose to use the apex frame to conduct a two-level image pyramid and generate deep convolutional features on the image pyramid to represent the apex frame, such that the global and local information of the human face can be both captured to enhance the discrimination capability of the static apex frame. The proposed pyramid CNN-based face representation is established at two scale levels. The first level corresponds to the full apex face frame, and the second level consist of 4 regions by equally partitioning the full face region. Therefore, we can obtain five deep features by passing each region through a pre-trained CNN architecture: ${C}_{0}$ denotes the deep feature from the first level, and ${C}_{1}$, ${C}_{2}$, ${C}_{3}$, ${C}_{4}$ denote the deep features from the second level. Afterwards, we can concatenate the five deep features as: C = [${C}_{0}$, ${C}_{1}$, ${C}_{2}$, ${C}_{3}$, ${C}_{4}$], and the final face deep representation C is with the dimensions of 5$*$512. The static pyramid CNN-based features extraction process is shown in Fig. 3.

For the deep features, we propose to use the deep convolutional representation, ranther than the general outputs from the fully connected layers in CNN. Given a pre-trained CNN model with L convolutional layers and for an input image we can extract its CNN feature maps after resizing it to 224$*$224 for VGG [18] networks. A feature map can be denoted by $\bar{F_{i}}=\left\{ F_{ij}:i=1...L;j=1...C_{i}\right\} $, where $F_{ij}$ is equal to the ${{j}^{th}}$ feature map at ${{i}^{th}}$ convolutional layer, and $C_{i}$ equal to the number of convolutional kernels. The size of $F_{ij}$ is $W_{i}$ $\times $ $H_{i}$, where $W_{i}$ and $H_{i}$ are the width and height of each channel. Assuming that (x, y) is the coordinate of feature map $F_{i,j}$, and $f_{i}(x,y)$ is the response value at the ${{i}^{th}}$ convolutional layer with a spatial coordinate (x, y). Then, the image representation by max-pooling can be describe as follows:

$$\begin{aligned} \dot{V}_{i}=[\dot{V}_{F_{i,j}}:j=1...C_{i}] \end{aligned}$$

(2)

$$\begin{aligned} \dot{V}_{F_{i,j}}=max(f_{i}(x,y)) \end{aligned}$$

(3)

2.2 LBP-TOP Feature

Local binary patterns from three orthogonal planes (LBP-TOP) [9] is an extension of LBP from two-dimensional space to three-dimensional space, and LBP-TOP extracts local binary patterns features from three orthogonal planes (i.e., XY, XT and YT) of video sequences. Compared with LBP, LBP-TOP does not only contains the texture information of XY plane, but also takes into account the texture information of XT and YT, while the texture information of XT and YT record important dynamic textures. For each plane, a histogram of dynamic texture can be defined as:

$$\begin{aligned} {{H}_{i,j}}=\sum \nolimits _{x,y,t}{I\left\{ {{f}_{j}}\left( x,y,t \right) =i \right\} } \end{aligned}$$

(4)

$$\begin{aligned} {i}=0,\cdots ,{{n}_{j-1}};j=0,1,2 \end{aligned}$$

where $n_{j}$ is the number of different labels produced by the LBP operator in the ${{j}^{th}}$ plane (j = 0: XY, 1: XT and 2: YT), $~{{f}_{j}}\left( x,y,t \right) $ denotes the LBP code of central pixel (x, y, t) in the ${{j}^{th}}$ plane and if A is true, $I\{A\}=1$, else $I\{A\}=0$. Afterwards, statistical histograms of three different planes will be concatenated into one histogram. Generally, considering the motion of different face regions, a block-based method is introduced which cascade histogram extracted from all block volume. In the experiment, each sequence volume is divided into 8$*$8 non-overlapping blocks. The procedure of LBP-TOP method extracting features from block shown in Fig. 4.

Furthermore, the static pyramid CNN based feature and LBP-TOP feature are cascaded as a final face video representation for training and testing, and we can note that the strength of the final representation not only contains the static feature from apex frame which has the maximum expression intensity in face frames, but also takes into account the spatio-temporal information in the video sequences.

3 Experimental Results

3.1 Dataset

The extended Cohn-Kanade dataset (CK+) dataset [16]: there are totally 593 frontal video sequences from 123 subjects. The sequences vary in duration, from 10 frames to 60 frames per video, starting from neutral to the apex of the facial expression. The CK+ dataset contains 327 expression-labeled, each of which has seven expressions, but only 309 image sequences with six basic expressions (anger, disgust, fear, happiness, sadness, and surprise) were considered in our study.

The Oulu-CASIA dataset [17]: it consists of six expressions from 80 subjects. All the image sequences were taken under three visible light conditions: normal, weak and dark. The number of video sequences is 480 (80 subjects by six expressions) for each illumination, so there are totally 2880 (4806) video sequences in the dataset. All expression sequences begin at the neutral frame and end with the apex frame. In the experiments, we evaluate our method under normal illumination condition.

3.2 Experimental Results on CK+ Dataset and Oulu-CASIA Dataset

In this part, we evaluate the proposed framework on both CK+ and Oulu-CASIA datasets. We firstly test the performance of apex frame selection based on the facial expression estimation by calculating the facial landmarks displacement. Figure 5 shows the selected apex frames from different video sequences. As the transformation of facial expression is from neutral to apex in both datasets, the estimated apex frame by our proposed method is almost the last frame of each video sequence which prove the correctness of the apex frame selection method.

We further evaluate the performance of the outputs from fully connected layers in CNN and the outputs from the last convolutional layer in CNN. Additionally, the performance of both two types of deep features on a single face image as well as the proposed two level face image pyramid is also compared. The accuracy of a specific expression recognition is measured by the ratio of correctly recognized samples over the total number of samples in the specific expression, while the total accuracy is calculated by the ratio of all correctly recognized samples over the total number of all testing samples. As illustration in Fig. 6, we can note that the deep convolutional features with only 512 dimensions show competitive or even higher accuracy than the fully connected features with 4096 dimensions on two datasets. Furthermore, we apply the deep features on the proposed two level image pyramid to evaluate the performance of pyramid CNN-based representation, and we can see that the two level image pyramid CNN-based representation indeed improve the facial expression accuracy when comparing a single face image.

Moreover, we conduct experiments to evaluate effectiveness of combining static pyramid CNN-based feature and dynamic spatial-temporal LBP-TOP feature for facial expression recognition in video sequences. Tables 1, 2, 3 and 4 show the confusion matrices obtained by using just LBP-TOP feature as well as the combination of both features with multiclass SVM classifier. The confusion matrix includes recognition accuracy of each expression and the total classification accuracy. Based on the results on both datasets, the combination of two features achieved higher total recognition accuracy and specific expression recognition accuracy than that of just using dynamic LBP-TOP feature or static pyramid CNN-based feature. Especially for CK+ dataset, the proposed framework significantly improves the performance on the expressions of anger, disgust, happiness and surprise, meanwhile the recognition accuracy of fear expression has greatly improved compared to the LBP-TOP feature.

Table 1. Confusion matrix of LBP-TOP on CK+ dataset

Full size table

Table 2. Confusion matrix of LBP-TOP on Oulu-CASIA dataset

Full size table

Table 3. Confusion matrix of combination feature on CK+ dataset

Full size table

Table 4. Confusion matrix of combination feature on Oulu-CASIA dataset

Full size table

3.3 Comparison with State-of-the-art

In the following, we compare the proposed framework with published state-of-the-art methods on each dataset. As shown in Table 5, we can note that, for facial expression recognition, our method achieves the highest recognition accuracy which further proves the discrimination and robustness of our video sequences representation by combining static pyramid CNN-based feature and dynamic LBP-TOP feature.

Table 5. The Comparison with the state-of-the-art on both datasets

Full size table

4 Conclusions

In this paper, we presented a novel FER method, where the static and dynamic feature are integrated together to boost the FER performance. For the procedure of static feature extraction, in order to capture the global and local information of human face, a pyramid CNN model is constructed to extract features from apex frames which are selected adaptively by using the displacement information of landmarks. Moreover, spatial-temporal LBP-TOP feature is employed as the dynamic feature which is cascaded with static pyramid CNN-based feature to classify expressions by using multiclass SVM with one-versus-one strategy. The evaluation results show that our method is competitive or even superior when comparing to the state-of-the-art methods on two facial expressions datasets.

References

Abdat, F., Maaoui, C., Pruski, A.: Human-computer interaction using emotion recognition from facial expression. In: 2011 Fifth UKSim European Symposium on Computer Modeling and Simulation (EMS), pp. 196–201. IEEE (2011)
Google Scholar
Russell, J.A.: Core affect and the psychological construction of emotion. Psychol. Rev. 110(1), 145 (2003)
Article Google Scholar
Bettadapura, V.: Face expression recognition and analysis: the state of the art. arXiv preprint arXiv:12036722 (2012)
Bartlett, M.S., Littlewort, G., Frank, M., Lainscsek, C., Fasel, I., Movellan, J.: Recognizing facial expression: machine learning and application to spontaneous behavior. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005, pp. 568–573, June 2005
Google Scholar
Ojala, T., Pietikinen, M., Menp, T.: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 24(7), 971–987 (2002)
Article Google Scholar
Shbib, R., Zhou, S.: Facial expression analysis using active shape model. Int. J. Signal Process. Image Process. Pattern Recogn. 8(1), 9–22 (2015)
Google Scholar
Lucey, S., Ashraf, A.B., Cohn, J.F.: Investigating spontaneous facial action recognition through aam representations of the face. In: Face Recognition, pp. 275–286. Pro Literatur Verlag, Mammendorf (2007)
Google Scholar
Klaser, A., Marszaek, M., Schmid, C.: A spatio-temporal descriptor based on 3d-gradients. In: British Machine Vision Conference (BMVC) (2008)
Google Scholar
Zhao, G., Pietikinen, M.: Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 915–928 (2007)
Article Google Scholar
Scovanner, P., Ali, S., Shah, M.: A 3-dimensional sift descriptor and its application to action recognition. In: Proceedings of the 15th International Conference on Multimedia. ACM (2007)
Google Scholar
Yu, Z., Zhang, C.: Image based static facial expression recognition with multiple deep network learning. In: ACM International Conference on Multimodal Interaction (MMI), pp. 435–442 (2015)
Google Scholar
Jung, H., Lee, S., Yim, J., Park, S., Kim, J.: Joint fine-tuning in deep neural networks for facial expression recognition. In: The IEEE International Conference on Computer Vision (ICCV). IEEE (2015)
Google Scholar
Viola, P., Jones, M.J.: Robust real-time face detection. In: 2001 Proceedings of the Eighth IEEE International Conference on Computer Vision, ICCV, 2001, DBLP 2004, p. 747 (2001)
Google Scholar
Xiong, X., Torre, F.D.L.: Supervised descent method and its applications to face alignment. In: Computer Vision and Pattern Recognition, pp. 532–539. IEEE (2013)
Google Scholar
Maintz, J., Viergever, M.: A survey of medical image registration. Med. Image Anal. 2, 1–36 (1998)
Article Google Scholar
Lucey, P., Cohn, J.F., Kanade, T., et al.: The extended Cohn-Kanade Dataset (CK+): A complete dataset for action unit and emotion-specified expression. In: Computer Vision and Pattern Recognition Workshops, pp. 94–101. IEEE (2010)
Google Scholar
Zhao, G., Huang, X., Taini, M., Li, S.Z., Pietikinen, M.: Facial expression recognition from near-infrared videos. Image Vis. Comput. 29(9), 607–619 (2011)
Article Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Computer Science (2014)
Google Scholar
Liu, M., Shan, S., Wang, R., Chen, X.: Learning expressionlets on spatio-temporal manifold for dynamic facial expression recognition. In: Proceedings of the IEEE International Conference Computer Vision and Pattern Recognition, pp. 1749–1756, June 2014
Google Scholar
Liu, M., Shan, S., Wang, R., et al.: Learning expressionlets via universal manifold model for dynamic facial expression recognition. IEEE Trans. Image Process. 25(12), 5920–5932 (2016). A Publication of the IEEE Signal Processing Society
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer and Information Science, Southwest University, Chongqing, China
Shulin Xu, Nan Pu, Li Qian & Guoqiang Xiao

Authors

Shulin Xu
View author publications
You can also search for this author in PubMed Google Scholar
Nan Pu
View author publications
You can also search for this author in PubMed Google Scholar
Li Qian
View author publications
You can also search for this author in PubMed Google Scholar
Guoqiang Xiao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guoqiang Xiao .

Editor information

Editors and Affiliations

Civil Aviation University of China, Tianjin, China
Jinfeng Yang
Tianjin University, Tianjin, China
Qinghua Hu
Nankai University, Tianjin, China
Ming-Ming Cheng
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Liang Wang
Nanjing University of Information Science and Technology, Nanjing, China
Qingshan Liu
Huazhong University of Science and Technology, Wuhan, China
Xiang Bai
Xi’an Jiaotong University, Xi’an, China
Deyu Meng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xu, S., Pu, N., Qian, L., Xiao, G. (2017). Combination of Pyramid CNN Representation and Spatial-Temporal Representation for Facial Expression Recognition. In: Yang, J., et al. Computer Vision. CCCV 2017. Communications in Computer and Information Science, vol 772. Springer, Singapore. https://doi.org/10.1007/978-981-10-7302-1_4

Download citation

DOI: https://doi.org/10.1007/978-981-10-7302-1_4
Published: 30 November 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-7301-4
Online ISBN: 978-981-10-7302-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Combination of Pyramid CNN Representation and Spatial-Temporal Representation for Facial Expression Recognition

Abstract

Similar content being viewed by others

Face Expression Recognition in Video Using Hybrid Feature Extractor and CNN-LSTM

Enhanced spatio-temporal 3D CNN for facial expression classification in videos

Dynamic Facial Expression Recognition Based on Trained Convolutional Neural Networks

Keywords

1 Introduction

2 Methodology

2.1 Static Pyramid CNN-based Feature

2.2 LBP-TOP Feature

3 Experimental Results

3.1 Dataset

3.2 Experimental Results on CK+ Dataset and Oulu-CASIA Dataset

3.3 Comparison with State-of-the-art

4 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us