Elsevier

Neurocomputing

Volume 444, 15 July 2021, Pages 378-389
Neurocomputing

Multi-attention based Deep Neural Network with hybrid features for Dynamic Sequential Facial Expression Recognition

https://doi.org/10.1016/j.neucom.2019.11.127Get rights and content

Abstract

In interpersonal communication, the expression is an import way to express one’s emotions. In order to make computers understand facial expressions like human beings, a large number of researchers have put a lot of time and energy into it. But for now, most of the work of dynamic sequence facial expression recognition fails to make full use of the combined advantages of shallow features (prior knowledge) and depth features (high-level semantic). Therefore, this paper implements a dynamic sequence facial expression recognition system that integrates shallow features and deep features with the attention mechanism. In order to extract the shallow features, an Attention Shallow Model (ASModel) is proposed by using the relative position of facial landmarks and the texture characteristics of the local area of the face to describe the Action Units of the Facial Action Coding System. And with the advantage of the deep convolutional neural network in expressing high-level features, a Attention Deep Model (ADModel) is also designed to extract deep features on sequence facial images. Finally, the ASModel and the ADModel are integrated to a Multi-attention Shallow and Deep Model (MSDModel) to complete the dynamic sequence facial expression recognition. There are three kinds of attention mechanism introduced, such as Self-Attention (SA), Weight-Attention (WA), and Convolution-Attention (CA). We verify our dynamic expression recognition system on three publicly available databases include CK+, MMI, and Oulu-CASIA and get superior performance than other state-of-art results.

Introduction

In daily communication, people convey emotional information through facial expressions to achieve mutual understanding. It can usually grasp a person’s psychological activities by reading facial expressions, which will help us build better interpersonal relationships in life. By learning to “read” expressions, it’s easy to judge a person’s emotional state, such as happy, angry, sad, etc., so as to know what can be done and what can be said now. That is often a sign of a person’s “high EQ”. At the same time, when bored with the other person’s behavior, you can also make a certain expression to let their “enough”. But it depends on whether the others can “read” the expressions as well as you.

The types of expressions are complex and varied, and different expressions can be combined to form a variety of complex expressions, but the basic human expression is universal in all mankind. Six basic emotional categories are defined by [1], namely anger, disgust, fear, happiness, sadness, and surprise. The judgment of facial expression is often a subjective process accumulated from childhood. Therefore, in order to enable computers to understand expressions as well as human beings, prior knowledge needs to be fully utilized when designing expression recognition algorithms, like the Facial Action Coding System (FACS) [1]. FACS defines 64 different Action Units (AUs) to describe facial movement through facial muscle contractions, which correspond to different facial expressions. There is a lot of works [2], [3], [4], [5], [6], [7] based on FACS to achieve sequential expression recognition by tracking image features and facial movement in sequence images.

How to make the computer recognize the facial expressions and improve the intelligence of human–computer interaction, the majority of researchers have invested a lot of time and energy. According to the different data processed, the work of facial expression recognition can be roughly divided into two types, which one is the single static facial expression recognition and the other is the dynamic sequence facial expression recognition. The single static facial expression recognition is to label a single face image with expression tag and the image is generally the peak state of expression. While the dynamic sequence expression recognition is to label a sequence of continuously changing facial expressions with a tag. Because it does not only needs to extract the spatial domain features of images but also takes into account the changing characteristics of spatial domain in the time dimension, the dynamic sequence expression recognition is much more complicated than the single in data dimension and feature dimension. In this paper, the research object is the dynamic sequence facial expression recognition, that is, to give a continuous sequence of images that only represent a certain expression, through some algorithms, labeled the sequence with the corresponding expression category.

Because it’s hard to detect facial muscle contractions from facial images, different AUs can be described with the location relationship of facial landmarks and local texture features of the face. For example, if the mouth is opened or closed it can use the relative distance between the upper lip and the lower lip. Or if the corners of the mouth are raised or lowered, it also can use the distance between the key points of the corners of the mouth and the top of the nose, when supposed the top of the nose is constant when the face is positioned. Besides, only using the relationship between key points is not sufficient enough to describe different AUs. For example, the upturning of the corners of the mouth may cause changes in the muscles around the corners of the mouth, which is far from enough to describe by corners point of the mouth. Therefore, the region texture changes also need to be detected. With the characteristics of rotation invariance and gray invariance to extract texture features, the LBP [8] algorithm has been chosen to describe local texture features of facial images.

The rise of Convolutional Neural Networks (CNNs) constantly refreshes the records of various image scene tasks created by traditional methods, such as [9], [10]. The ALexNet [9] was firstly proposed in the ImageNet1 competition, and won the champion of the image classification task. Since then, the development of CNNs in image tasks has become out of control. R-CNN [10] has been proposed for the target detection tasks using convolution neural network and greatly improved the target detection rate. Because of the end-to-end characteristics in solving image classification tasks and advantages in expressing high-level semantic features of images, CNNs has also been used in the field of facial expression recognition by scholars [11], [12], [13], [14]. However, the input of CNNs is generally 2-dimensional (2D) plane image data. In order to process the sequence image data of facial expressions, the AlexNet structure has been improved so that it can process 3-dimensional (3D) image data with the time dimension.

Shallow features can utilize the people’s prior knowledge of facial expression recognition tasks while deep features can express high-level semantic features of images. In order to make full use of the advantages of the two features, it’s necessary to combine the two features to realize the dynamic sequence facial expression recognition system.

There are many works based on FACS [2], [3], [4], [5], [6], CNNs [15], as well as the integration method [11], [16] on facial expression recognition of dynamic sequences. It [11] first used a CNN to process sequence images, where the sequence image as a multichannel input like RGB channels of RGB-image. And then, a fully connected network receives the facial landmark points. Finally, a proposed joint fine-tuning method was adapted to integrating. There are two aspects that can be improved. One is that the relationship between image features needs to be expressed when processing sequential facial images. The other is that facial landmark point can be used to further construct shallow feature representation rather than directly processing coordinates of points. Part-based hierarchical bidirectional recurrent neural network (PHRNN) has been proposed for keypoint tracking and multi-signal convolutional neural network (MSCNN) has been used for identity invariant features in [16]. Then, the probabilities of predicting in the two networks are fused to conduct temporal and spatial facial expression recognition. Although the PHRNN is designed to track the facial landmarks, it does not specifically relate the prior knowledge, such as FACS.

Therefore, a dynamic sequence facial expression recognition system that integrates shallow features and deep features with the multiple attention mechanism is proposed in this paper. The ASModel is designed to extract the shallow features of the image based on the AUs by describing the relative positions of facial landmarks and the texture features of local areas of the face. At the same time, with the advantage of the CNNs in expressing high-level features of images, the ADModel is also designed to extract the deep features of sequence images by improving the ALexNet structure. Finally, the MSDModel by combining the ASModel and the ADModel implements the dynamic sequence facial expression recognition using attention mechanism. Three kinds of attention mechanism have been introduced, namely Self-Attention (SA), Weight-Attention (WA), and Convolution-Attention (CA), to strengthen the connection between the image sequence characteristics and improve the effectiveness of the model. The SA is for input data, implemented by the attention matrix. The WA is used for alignment of feature sequences, which is realized by the weight matrix. The CA is introduced by means of convolution operation so that you can make the CNN processing sequence image data.

1. We propose a Multi-attention Shallow and Deep Model(MSDModel) to complete the task of sequential facial expression recognition which combines shallow features and deep features. Shallow features are established in FACS, while deep features are obtained through CNN.

2. We introduce three kinds of attention mechanisms in MSDModel to strengthen the connection between image sequence features: Self-Attention (SA), Weight-Attention (WA) and Convolution-Attention (CA).

3. We verify the effectiveness of the dynamic sequence facial expression recognition system on three publicly available databases and obtain state-of-art results than previous works.

Section snippets

Related work

There are roughly two categories in facial expression recognition tasks according to the dimensions of data processing, such as the single static facial expression recognition and the dynamic sequence facial expression recognition. For the recognition of single facial expression, the traditional method generally consists of three steps, namely face detection, feature extraction, and expression classification.

The first step is face detection. For a face image, the most valuable information on

Proposed method

In this section, the detail of ASModel, ADModel, and MSDModel will be introduced. Firstly, the way of how to extract the shallow features will be described in Section 3.1. Next, there are the methods to build the ASModel and the ADModel in Section 3.2 ASModel, 3.3 ADModel respectively. Finally, it shows how to integrate the ASModel and the ADModel to the MSDModel and obtain the final cost function, which is used for the learning of the whole dynamic facial expression system in Section 3.4.

Experimental results

In this section, three public databases will be used for verifying the dynamic sequence expression recognition system, namely CK+, MMI, and Oulu-CASIA respectively. First, the detail of training of the system will be introduced, such as some hyper-parameters, the optimization algorithm, and so on. And then, it’s important to conduct comparative experiments with other state-of-art results and analyze the advantages of this system. Last but not least, it’s also necessary to design experiments to

Conclusion

In this paper, a dynamic sequence facial expression recognition system has been proposed, which integrates shallow features and deep features with the attention mechanism. Shallow features are represented by the relative positions of facial landmarks and the texture features of local areas of the face based on FACS. At the same time, the ALexNet structure is improved to extract the deep features of sequence images to express high-level semantic features. There are three attention mechanisms,

Future work

Of course, the idea of the system can also be extended to other image tasks. For example, the human motion recognition, the shallow features of which can be extracted by describing the geometric features based on the detection of the inflection points of some joints. In addition, not only the AlexNet but other convolutional neural networks can also be extended into the sequence model.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Xiao SUN was born in 1980. He received the M.E. degree in 2004 from the Department of Computer Sciences and Engineering at Dalian University of Technology, and got his double doctor’s degree in Dalian University of Technology (2010) of China and the University of Tokushima (2009) of Japan. He is now working as professor in AnHui Province Key Laboratory of Affective Computing and Advanced Intelligent Machine at Hefei University of Technology. His research interests include Affective Computing,

References (55)

  • A. Azeem et al.

    Hexagonal scale invariant feature transform (h-sift) for facial feature extraction

    Journal of Applied Research and Technology

    (2015)
  • C. Shan et al.

    Facial expression recognition based on local binary patterns: A comprehensive study

    Image and Vision Computing

    (2009)
  • P. Ekman et al.

    Constants across cultures in the face and emotion

    Journal of Personality and Social Psychology

    (1971)
  • Y. Zhang et al.

    Active and dynamic information fusion for facial expression understanding from image sequences

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2005)
  • Y. Tong et al.

    Facial action unit recognition by exploiting their dynamic and semantic relationships

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2007)
  • Y. Zhang et al.

    Dynamic facial expression analysis and synthesis with mpeg-4 facial animation parameters

    IEEE Transactions on Circuits and Systems for Video Technology

    (2008)
  • R. Borgo et al.

    Facial expression recognition in dynamic sequences: An integrated approach

    Pattern Recognition

    (2014)
  • A. Yao et al.

    Capturing au-aware facial features and their latent relations for emotion recognition in the wild

    (2015)
  • Y. li Tian et al.

    Recognizing action units for facial expression analysis

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2001)
  • T. Ojala, M. Pietikainen, D. Harwood, Performance evaluation of texture measures with classification based on kullback...
  • A. Krizhevsky et al.

    Imagenet classification with deep convolutional neural networks

  • R. Girshick et al.

    Rich feature hierarchies for accurate object detection and semantic segmentation

  • H. Jung et al.

    Joint fine-tuning in deep neural networks for facial expression recognition

  • H.-W. Ng et al.

    Deep learning for emotion recognition on small datasets using transfer learning

  • A. Ruiz-Garcia et al.

    Deep learning for emotion recognition in faces

    (2016)
  • W. Liu et al.

    Emotion recognition using multimodal deep learning

    ICONIP

    (2016)
  • D.H. Kim et al.

    Multi-objective based spatio-temporal feature representation learning robust to expression intensity variations for facial expression recognition

    IEEE Transactions on Affective Computing

    (2017)
  • Z. Yu, G. Liu, Q. Liu, J. Deng, Spatio-temporal convolutional features with nested lstm for facial expression...
  • P. Viola et al.

    Rapid object detection using a boosted cascade of simple features

    (2001)
  • H. Li et al.

    A convolutional neural network cascade for face detection

  • S. Yang, P. Luo, C.C. Loy, X. Tang, Faceness-net: Face detection through deep facial part responses, IEEE Transactions...
  • Y. Sun et al.

    Deep convolutional network cascade for facial point detection

    2013 IEEE Conference on Computer Vision and Pattern Recognition

    (2013)
  • E. Zhou, H. Fan, Z. Cao, Y. Jiang, q. Yin, Extensive facial landmark localization with coarse-to-fine convolutional...
  • D.G. Lowe

    Object recognition from local scale-invariant features

  • H. Soyel et al.

    Facial expression recognition based on discriminative scale invariant feature transform

    Electronics Letters

    (2010)
  • C. Shan et al.

    Robust facial expression recognition using local binary patterns

    IEEE International Conference on Image Processing 2005

    (2005)
  • T. Ahonen et al.

    Face description with local binary patterns: Application to face recognition

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2006)
  • Cited by (27)

    • Spatio-temporal convolutional emotional attention network for spotting macro- and micro-expression intervals in long video sequences

      2022, Pattern Recognition Letters
      Citation Excerpt :

      Facial micro-expression analysis by computer vision techniques has attracted extensive attention. The macro-expression [4–7] and micro-expression recognition [8–11] have achieved excellent results. However, people often use macro-expression to cover up their emotional state in high-risk tasks such as national security, criminal investigation, negotiation, and mental illness [12–14].

    View all citing articles on Scopus

    Xiao SUN was born in 1980. He received the M.E. degree in 2004 from the Department of Computer Sciences and Engineering at Dalian University of Technology, and got his double doctor’s degree in Dalian University of Technology (2010) of China and the University of Tokushima (2009) of Japan. He is now working as professor in AnHui Province Key Laboratory of Affective Computing and Advanced Intelligent Machine at Hefei University of Technology. His research interests include Affective Computing, Natural Language Processing, Machine Learning and Human-Machine Interaction.

    Pingping Xia was born in 1992. He received his Bachelor’s degree in 2017 from the School of Hefei University of Technology, Anhui, China. He is currently studying for a Master’s degree at the School of Computer and Information, Hefei University of Technology. His research interest includes facial expression recognition, deep learning and generative adversarial networks.

    Fuji Ren received his Ph.D. in 1991 from Faculty of Engineering, Hokkaido University, Sapporo, Japan. His current research interests include Natural Language Processing, Machine Translation, Multi-Lingual Multi-Function Multi-Media Intelligent Systems, Robust Methods for Dialogue Understanding, and Affective Computing and Knowledge Engineering. He is a senior member of IEEE, SMC of IEEE, a member of the Association for Natural Language Processing, Information Processing Society of Japan, The Institute of Electronics, Information and Communication Engineers, The Japanese Society for Artificial Intelligence, Japanese Society for Information and Systems in Education, Asia-Pacific Association for Machine Translation, The International Association of Science and Technology for Development, Chinese Academy of Science and Engineering in Japan. He has organized over sixty international conferences and symposia as PC, OC and Chair including IEEE NLPKE and CCIS. Professor Ren is the Editor of the International Journal COLIPS, Editor of the International Journal of Information Technology and Decision Making, Associate Editor of the International Journal of Asian Information-Science-Life, Technical Editor of the International Journal of Information Acquisition.

    View full text