Elsevier

Neurocomputing

Volume 413, 6 November 2020, Pages 145-157
Neurocomputing

SAANet: Siamese action-units attention network for improving dynamic facial expression recognition

https://doi.org/10.1016/j.neucom.2020.06.062Get rights and content

Abstract

Facial expression recognition (FER) has a wide variety of applications ranging from human–computer interaction, robotics to health care. Although FER has made significant progress with the success of Convolutional Neural Network (CNN), it is still challenging especially for the video-based FER due to the dynamic changes in facial actions. Since the specific divergences exists among different expressions, we introduce a metric learning framework with a siamese cascaded structure that learns a fine-grained distinction for different expressions in video-based task. We also develop a pairwise sampling strategy for such metric learning framework. Furthermore, we propose a novel action-units attention mechanism tailored to FER task to extract spatial contexts from the emotion regions. This mechanism works as a sparse self-attention fashion to enable a single feature from any position to perceive features of the action-units (AUs) parts (eyebrows, eyes, nose, and mouth). Besides, an attentive pooling module is designed to select informative items over the video sequences by capturing the temporal importance. We conduct the experiments on four widely used datasets (CK+, Oulu-CASIA, MMI, and AffectNet), and also do experiment on the wild dataset AFEW to further investigate the robustness of our proposed method. Results demonstrate that our approach outperforms existing state-of-the-art methods. More in details, we give the ablation study of each component.

Introduction

Facial expression recognition (FER), as the task of classifying the expression on images or video-sequences, has become an increasingly dynamic topic in the field of computer vision in recent years. There is a wide range of applications demanding for understanding human emotion through facial expression such as human–computer interaction [47], robotics [46], and health care [29], in which facial expression recognition plays an important role.

With the tremendous breakthrough of the Convolutional Neural Networks (CNN), much significant progress has been made to deal with the image-based or video-based facial expression recognition problems [50], [57], [12]. Some works [7] try to improve the performance to recognize facial expressions via classifying with some off-the-shelf features extracted directly from the images. However, most of these previous works treat FER as a naive classification problem, which ignore the observation that the different expressions made by the same person share similar appearances with limited changes on parts of the face. Therefore, it encourages us to treat FER task as a fine-grained classification problem to investigate the detailed differences.

To gain discriminative features, we adapt a video-based metric learning framework for this task. Xu et al. [51] gains state-of-the-art performance on Person Re-identification task via a metric learning framework, which proves that such metric learning framework is efficient for capturing discriminative features. It considers the intra-class similarity and inter-class distinction, which can learn the most discriminative features for facial expression. Since Recurrent Neural Network (RNN) has shown great success by learning long-distance dependencies from sequential data with temporal representation [34], [56], we integrate a cascaded CNN-RNN structure as the backbone for video-based metric learning framework. Moreover, we propose a specific pairwise sampling strategy tailored to FER in metric training, to ensure our model focus on learning the most discriminative features for facial expression.

Meanwhile, Xu et al. [51] utilizes an attentive pooling module to assign weights to distinguish the relevant frame pairs. This module helps to capture the most contributed video frames to final classification. Similar to it, as for video-based FER, Zhao et al. [59] introduces that facial expression has the dynamic variation of expressional strength, and the image frames in one video also do not contribute equally to the final classification. The peak (vigorous intensity) expression, which shows the complete expression, can extract more useful feature than non-peak (weak intensity) expression of the same type and from the same subject. Therefore, in our video-based metric framework, an attentive pooling module is also applied to learn this time series of emotional expression to find peak frames in video-sequences for FER.

To capture more expression guided spatial contexts, the Action Units (AUs) [42], [43] based works reveal that facial expression is the result of the motions of facial muscles among the key regions (eyebrows, eyes, nose, and mouth). It is essential to guide our model to pay more attention to those subtle attributes of expression key-points such as mouth corner radian, facial wrinkles, eyebrows, etc, while ignoring the little-contributed but obvious differences, such as hair color, face shape.Recently, Non-local Network [49] proposes a self-attention approach, which takes long-range contexts as spatial information and computes the pixel-wise responses as a weighted sum of the features at all positions. Motivated by its attentive ability, we propose an action-units attention mechanism (AU attention) tailored to FER task, which concentrates the contextual information on the regions of AUs. As the features of AU areas are much more crucial, we replace the aforementioned non-local operation on the whole feature map by harvesting the contextual information only on the AU regions. Meanwhile, since it does not need to generate huge attention map to record the relationship for each pixel-pair in feature map, we reduce this dense computation in the whole feature map to a sparse fashion which only focuses on the pixels S of AU regions. The complexity of non-local operation in time and space are both O((W×H)×(W×H)), where W×H denotes the spatial dimension of input feature map. As a comparison, our sparse computation reduces the complexity from O((W×H)×(W×H)) to O(S×(W×H)), where S<W×H.

To sum up, we propose a novel Siamese Action-units Attention Network (SAANet) to establish a metric learning framework for FER, which integrates both the spatial and temporal attention modules: action-units attention module (AU attention) and attentive pooling module. For the backbone model in the siamese structure, we utilize a cascaded structure of VGG16 [40] with the Batch Normalization (BN) layers [10] and the Bidirectional Long Short Term Memory (BiLSTM) [38] as the baseline. As for the spatial attention, we propose a novel AU attention module, which learns the weighted feature representation of AU regions on faces, to enhance the spatial feature learning. We apply this module between the CNN and RNN, it can first help the model focus on the expressional regions, and then the RNN can learn the expression variation in time dimension on such regions. As for the temporal attention module, an attentive pooling matrix is utilized to further learn the expression intensity in time series following the RNN, to compute the attention vectors as contributed weights of each step in video. Moreover, to meet the video-based FER task, we utilize the Hinge loss to emphasize the intra-class similarity and increase inter-class distance within the pairwise input. Therefore, our network learns a discriminative feature representation and a similarity measurement simultaneously, thus making fully utilization of the emotion annotations.

The main contributions are as follows:

  • We develop a metric learning framework with a siamese CNN-RNN baseline model to investigate detailed fine-grained distinct among facial expressions. To meet the FER task, a specific pairwise sampling strategy is proposed for such metric learning framework.

  • In the spatial domain, we develop a novel AU attention mechanism to aggregate the spatial contextual information on the crucial AU regions from long-range dependencies in a more efficient and effective way.

  • In the temporal domain, an attentive pooling module is taken to harvest the temporal correlations within the video frames.

  • We conduct experiments on four benchmark datasets (CK+, Oulu-CASIA, MMI, and AffectNet). The experimental results demonstrate that the proposed architecture outperforms state-of-the-art methods.

Section snippets

Facial expression recognition

Facial expression recognition (FER) has been studied over decades, the main approaches are based on local feature extraction, facial action units (AUs), temporal information, and convolutional neural networks (CNN). Although the local feature extraction methods such as Gabor filter and HOG are widely utilized to extract visual features, nowadays CNN can dig out deeper and more contexual information than the traditional approaches.

There are many CNN attempts which are proposed to solve the FER

Siamese action-units attention network

In this section, we introduce the details of our proposed Siamese Action-units Attention Network (SAANet). We first develop a hybrid siamese neural network (siamese CNN-RNN) to get the basic facial feature representation for metric learning. Then, we develop a novel action-units attention (AU attention) module to obtain contextual information on faces in the spatial domain. At last, we utilize an attentive pooling module to further investigate the similarity of same expressions and the

Emotion pairwise sampling strategy

Our pairwise training strategy aims to learn the detailed feature representation of facial expressions. In the normal metric learning manner, the network takes the pair of the same category and different categories as input alternately. It can force the feature representations of the examples in intra-class to have close distance, while the distance between inter-classes will be far. In this paper, we discover that we can not adopt the naive sampling strategy to get the inter-class and

Experiments

In this section, we evaluate the performance of our proposed model on four facial expression datasets: CK+ [30], MMI [44], Oulu-CASIA [58], and AffectNet [33]. We also investigate the effectiveness of the strategies of Global-AU attention and the video-based metric learning framework, and give the visualization results.

Conclusion

In this paper, we have presented a novel Siamese Action-units Attention Network (SAANet), a metric learning based method for facial expression recognition. We first apply the attentive pooling module to metric learning framework as a video-based metric part, where the attentive pooling module can capture the expression intensities in the temporal domain. In the spatial domain, we also propose a novel action-units attention module to learn the spatial contextual information from AUs-based

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

CRediT authorship contribution statement

Daizong Liu: Conceptualization, Methodology, Software. Xi Ouyang: Data curation, Writing - original draft. Shuangjie Xu: Visualization, Investigation. Pan Zhou: Supervision. Kun He: Writing - review & editing. Shiping Wen: Writing - review & editing.

Acknowledgement

This work was supported in part by National Science Foundation of China with Grant No.61401169.

Daizong Liu received the B.S degree at the School of Electrical and Communications, Wuhan University of Technology, Wuhan, China, in 2018. He is currently working toward the M.S. degree at the School of Electronic Information and Communication, Huazhong University of Science and Technology. His research interests include computer vision, deep learning as they apply to image and video analysis and understanding.

References (60)

  • H. Ding, S.K. Zhou, R. Chellappa, Facenet2expnet: Regularizing a deep face recognition net for expression recognition,...
  • Y. Fan, J.C. Lam, V.O. Li, Video-based emotion recognition using deeply-supervised neural networks, in: The...
  • S. Happy et al.

    Automatic facial expression recognition using features of salient facial patches

    IEEE Trans. Affect. Comput.

    (2015)
  • B. Hassani et al.

    Facial expression recognition using enhanced deep 3d convolutional neural networks

    CoRR

    (2017)
  • P. Hu et al.

    Learning supervised scoring ensemble for emotion recognition in the wild

  • S. Ioffe et al.

    Batch normalization: accelerating deep network training by reducing internal covariate shift

    International Conference on Machine Learning

    (2015)
  • S. Jain et al.

    Facial expression recognition with temporal modeling of shapes

    IEEE International Conference on Computer Vision Workshops

    (2011)
  • H. Jung et al.

    Joint fine-tuning in deep neural networks for facial expression recognition

  • A. Kacem et al.

    A novel space-time representation on the positive semidefinite cone for facial expression recognition

  • Y. Kim, B. Yoo, Y. Kwak, C. Choi, J. Kim, Deep generative-contrastive networks for facial expression recognition. arXiv...
  • S. Li et al.

    Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild

  • Y. Li et al.

    Identity-enhanced network for facial expression recognition

    Asian Conference on Computer Vision (ACCV)

    (2018)
  • Y. Li et al.

    Occlusion aware facial expression recognition using cnn with attention mechanism

    IEEE Trans. Image Process.

    (2018)
  • Z. Li et al.

    Learning locally-adaptive decision functions for person verification

  • S. Liao et al.

    Person re-identification by local maximal occurrence representation and metric learning

  • Z. Lin, M. Feng, C.N.d. Santos, M. Yu, B. Xiang, B. Zhou, Y. Bengio, A structured self-attentive sentence embedding....
  • C. Liu, T. Tang, K. Lv, M. Wang, Multi-feature based emotion recognition for video clips, in: Proceedings of the...
  • M. Liu et al.

    Deeply Learning Deformable Facial Action Parts Model for Dynamic Expression Analysis

    (2014)
  • M. Liu et al.

    Learning expressionlets on spatio-temporal manifold for dynamic facial expression recognition

  • M. Liu et al.

    Learning expressionlets via universal manifold model for dynamic facial expression recognition

    IEEE Trans. Image Process.

    (2016)
  • Cited by (50)

    • A systematic review on affective computing: emotion models, databases, and recent advances

      2022, Information Fusion
      Citation Excerpt :

      For dimensional emotion recognition, Kollias and Zafeiriou [285] proposed RNN subnets to explore the temporal dynamics of low-, mid- and high-level features extracted from the trained CNNs. Liu et al. [286] proposed a framework of dynamic FER based on the siamese action-units attention network (SAANet). Specifically, the SAANet is a pairwise sampling strategy, consisting of CNNs with global-AU attention modules, a BiLSTM module and an attentive pooling module.

    • Clip-aware expressive feature learning for video-based facial expression recognition

      2022, Information Sciences
      Citation Excerpt :

      This demonstrates that our method discovers the more informative emotion-related cues by modeling the emotion transition relation in videos. Comparison study (MMI): In comparison with the state-of-the-art video-based FER methods, Table 3 lists the average accuracy on MMI dataset using frame-based methods (i.e., AUDN [41], DeRL [8], WMDCNN [42] and CER [7], sequence-based methods (i.e., LSTM [13], Deep generative-contrastive networks (DGCN) [9], LPQ-TOP + SRC [6], SAANet [43], and WMCNN-LSTM [42]) and our CEFLNet. The proposed method achieved an average accuracy of 91% with a standard deviation of 4.36, which outperformed existing state-of-the-art FER methods.

    View all citing articles on Scopus

    Daizong Liu received the B.S degree at the School of Electrical and Communications, Wuhan University of Technology, Wuhan, China, in 2018. He is currently working toward the M.S. degree at the School of Electronic Information and Communication, Huazhong University of Science and Technology. His research interests include computer vision, deep learning as they apply to image and video analysis and understanding.

    Xi Ouyang is pursing his Ph.D. at Med-X Research Institute, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, China. He received his B.S. degree and M.S. degree both at the School of Electronic Information and Communications of Huazhong University of Science and Technology, Wuhan, China. He was a research intern in Panasonic R&D Center Singapore in 2016–2017, and worked on deep learning research. His current research interests include: medical image analysis, deep learning and machine learning.

    Shuangjie Xu received the B.S degree from Department of Electronics and Information Engineering, Huazhong University of Science and Technology, Wuhan, China, in 2016, and is currently working toward the M.S. degree in his undergraduate school. His research interests include computer vision, machine learning, deep learning as they apply to image and video analysis and understanding.

    Pan Zhou is currently an associate professor and the PhD advisor with School of Cyber Science and Engineering, Huazhong University of Science and Technology (HUST), Wuhan, P.R. China. He received his Ph.D. in the School of Electrical and Computer Engineering at the Georgia Institute of Technology (Georgia Tech) in 2011, Atlanta, USA. He received his B.S. degree in the Advanced Class of HUST, and a M.S. degree in the Department of Electronics and Information Engineering from HUST,Wuhan, China, in 2006 and 2008, respectively. He held honorary degree in his bachelor and merit research award of HUST in his master study. He was a senior technical member at Oracle Inc, America during 2011 to 2013, Boston, MA, USA, and worked on hadoop and distributed storage system for big data analytics at Oralce cloud Platform. His current research interest includes: communication and information networks, security and privacy, machine learning and big data.

    Kun He is currently a professor in the Department of Computer Science, Huazhong University of Science and Technology (HUST), Wuhan, P.R. China. She was a Mary Shepard B. Upson Visiting Professor for the 2016–2017 Academic year in Engineering, Cornell University, NY, USA. She received her Ph.D. degree in the Department of Automatic Control from HUST in 2006, Wuhan, P.R China. She received her B.S degree in the Department of Physics from Wuhan University in 1993, Wuhan, P.R. China, and her M.S. degree in the Department of Computer Science from Huazhong Normal University in 2002, Wuhan, P.R. China. Her research interest includes: optimization, machine learning, deep learning.

    Shiping Wen received the M. Eng. degree in Control Science and Engineering, from School of Automation, Wuhan University of Technology, Wuhan, China, in 2010, and received the Ph.D degree in Control Science and Engineering, from School of Automation, Huazhong University of Science and Technology, Wuhan, China, in 2013. He is currently an Professor at Centre for Artificial Intelligence, Faculty of Engineering Information Technology, University of Technology Sydney, Australia. His current research interests include memristor-based circuits and systems, neural networks, and deep learning.

    View full text