Multi-attention based Deep Neural Network with hybrid features for Dynamic Sequential Facial Expression Recognition

doi:10.1016/j.neucom.2019.11.127

Neurocomputing

Volume 444, 15 July 2021, Pages 378-389

https://doi.org/10.1016/j.neucom.2019.11.127 Get rights and content

Abstract

In interpersonal communication, the expression is an import way to express one’s emotions. In order to make computers understand facial expressions like human beings, a large number of researchers have put a lot of time and energy into it. But for now, most of the work of dynamic sequence facial expression recognition fails to make full use of the combined advantages of shallow features (prior knowledge) and depth features (high-level semantic). Therefore, this paper implements a dynamic sequence facial expression recognition system that integrates shallow features and deep features with the attention mechanism. In order to extract the shallow features, an Attention Shallow Model (ASModel) is proposed by using the relative position of facial landmarks and the texture characteristics of the local area of the face to describe the Action Units of the Facial Action Coding System. And with the advantage of the deep convolutional neural network in expressing high-level features, a Attention Deep Model (ADModel) is also designed to extract deep features on sequence facial images. Finally, the ASModel and the ADModel are integrated to a Multi-attention Shallow and Deep Model (MSDModel) to complete the dynamic sequence facial expression recognition. There are three kinds of attention mechanism introduced, such as Self-Attention (SA), Weight-Attention (WA), and Convolution-Attention (CA). We verify our dynamic expression recognition system on three publicly available databases include CK+, MMI, and Oulu-CASIA and get superior performance than other state-of-art results.

Introduction

In daily communication, people convey emotional information through facial expressions to achieve mutual understanding. It can usually grasp a person’s psychological activities by reading facial expressions, which will help us build better interpersonal relationships in life. By learning to “read” expressions, it’s easy to judge a person’s emotional state, such as happy, angry, sad, etc., so as to know what can be done and what can be said now. That is often a sign of a person’s “high EQ”. At the same time, when bored with the other person’s behavior, you can also make a certain expression to let their “enough”. But it depends on whether the others can “read” the expressions as well as you.

The types of expressions are complex and varied, and different expressions can be combined to form a variety of complex expressions, but the basic human expression is universal in all mankind. Six basic emotional categories are defined by [1], namely anger, disgust, fear, happiness, sadness, and surprise. The judgment of facial expression is often a subjective process accumulated from childhood. Therefore, in order to enable computers to understand expressions as well as human beings, prior knowledge needs to be fully utilized when designing expression recognition algorithms, like the Facial Action Coding System (FACS) [1]. FACS defines 64 different Action Units (AUs) to describe facial movement through facial muscle contractions, which correspond to different facial expressions. There is a lot of works [2], [3], [4], [5], [6], [7] based on FACS to achieve sequential expression recognition by tracking image features and facial movement in sequence images.

How to make the computer recognize the facial expressions and improve the intelligence of human–computer interaction, the majority of researchers have invested a lot of time and energy. According to the different data processed, the work of facial expression recognition can be roughly divided into two types, which one is the single static facial expression recognition and the other is the dynamic sequence facial expression recognition. The single static facial expression recognition is to label a single face image with expression tag and the image is generally the peak state of expression. While the dynamic sequence expression recognition is to label a sequence of continuously changing facial expressions with a tag. Because it does not only needs to extract the spatial domain features of images but also takes into account the changing characteristics of spatial domain in the time dimension, the dynamic sequence expression recognition is much more complicated than the single in data dimension and feature dimension. In this paper, the research object is the dynamic sequence facial expression recognition, that is, to give a continuous sequence of images that only represent a certain expression, through some algorithms, labeled the sequence with the corresponding expression category.

Because it’s hard to detect facial muscle contractions from facial images, different AUs can be described with the location relationship of facial landmarks and local texture features of the face. For example, if the mouth is opened or closed it can use the relative distance between the upper lip and the lower lip. Or if the corners of the mouth are raised or lowered, it also can use the distance between the key points of the corners of the mouth and the top of the nose, when supposed the top of the nose is constant when the face is positioned. Besides, only using the relationship between key points is not sufficient enough to describe different AUs. For example, the upturning of the corners of the mouth may cause changes in the muscles around the corners of the mouth, which is far from enough to describe by corners point of the mouth. Therefore, the region texture changes also need to be detected. With the characteristics of rotation invariance and gray invariance to extract texture features, the LBP [8] algorithm has been chosen to describe local texture features of facial images.

The rise of Convolutional Neural Networks (CNNs) constantly refreshes the records of various image scene tasks created by traditional methods, such as [9], [10]. The ALexNet [9] was firstly proposed in the ImageNet¹ competition, and won the champion of the image classification task. Since then, the development of CNNs in image tasks has become out of control. R-CNN [10] has been proposed for the target detection tasks using convolution neural network and greatly improved the target detection rate. Because of the end-to-end characteristics in solving image classification tasks and advantages in expressing high-level semantic features of images, CNNs has also been used in the field of facial expression recognition by scholars [11], [12], [13], [14]. However, the input of CNNs is generally 2-dimensional (2D) plane image data. In order to process the sequence image data of facial expressions, the AlexNet structure has been improved so that it can process 3-dimensional (3D) image data with the time dimension.

Shallow features can utilize the people’s prior knowledge of facial expression recognition tasks while deep features can express high-level semantic features of images. In order to make full use of the advantages of the two features, it’s necessary to combine the two features to realize the dynamic sequence facial expression recognition system.

There are many works based on FACS [2], [3], [4], [5], [6], CNNs [15], as well as the integration method [11], [16] on facial expression recognition of dynamic sequences. It [11] first used a CNN to process sequence images, where the sequence image as a multichannel input like RGB channels of RGB-image. And then, a fully connected network receives the facial landmark points. Finally, a proposed joint fine-tuning method was adapted to integrating. There are two aspects that can be improved. One is that the relationship between image features needs to be expressed when processing sequential facial images. The other is that facial landmark point can be used to further construct shallow feature representation rather than directly processing coordinates of points. Part-based hierarchical bidirectional recurrent neural network (PHRNN) has been proposed for keypoint tracking and multi-signal convolutional neural network (MSCNN) has been used for identity invariant features in [16]. Then, the probabilities of predicting in the two networks are fused to conduct temporal and spatial facial expression recognition. Although the PHRNN is designed to track the facial landmarks, it does not specifically relate the prior knowledge, such as FACS.

Therefore, a dynamic sequence facial expression recognition system that integrates shallow features and deep features with the multiple attention mechanism is proposed in this paper. The ASModel is designed to extract the shallow features of the image based on the AUs by describing the relative positions of facial landmarks and the texture features of local areas of the face. At the same time, with the advantage of the CNNs in expressing high-level features of images, the ADModel is also designed to extract the deep features of sequence images by improving the ALexNet structure. Finally, the MSDModel by combining the ASModel and the ADModel implements the dynamic sequence facial expression recognition using attention mechanism. Three kinds of attention mechanism have been introduced, namely Self-Attention (SA), Weight-Attention (WA), and Convolution-Attention (CA), to strengthen the connection between the image sequence characteristics and improve the effectiveness of the model. The SA is for input data, implemented by the attention matrix. The WA is used for alignment of feature sequences, which is realized by the weight matrix. The CA is introduced by means of convolution operation so that you can make the CNN processing sequence image data.

1. We propose a Multi-attention Shallow and Deep Model(MSDModel) to complete the task of sequential facial expression recognition which combines shallow features and deep features. Shallow features are established in FACS, while deep features are obtained through CNN.

2. We introduce three kinds of attention mechanisms in MSDModel to strengthen the connection between image sequence features: Self-Attention (SA), Weight-Attention (WA) and Convolution-Attention (CA).

3. We verify the effectiveness of the dynamic sequence facial expression recognition system on three publicly available databases and obtain state-of-art results than previous works.

Section snippets

Related work

There are roughly two categories in facial expression recognition tasks according to the dimensions of data processing, such as the single static facial expression recognition and the dynamic sequence facial expression recognition. For the recognition of single facial expression, the traditional method generally consists of three steps, namely face detection, feature extraction, and expression classification.

The first step is face detection. For a face image, the most valuable information on

Proposed method

In this section, the detail of ASModel, ADModel, and MSDModel will be introduced. Firstly, the way of how to extract the shallow features will be described in Section 3.1. Next, there are the methods to build the ASModel and the ADModel in Section 3.2 ASModel, 3.3 ADModel respectively. Finally, it shows how to integrate the ASModel and the ADModel to the MSDModel and obtain the final cost function, which is used for the learning of the whole dynamic facial expression system in Section 3.4.

Experimental results

In this section, three public databases will be used for verifying the dynamic sequence expression recognition system, namely CK+, MMI, and Oulu-CASIA respectively. First, the detail of training of the system will be introduced, such as some hyper-parameters, the optimization algorithm, and so on. And then, it’s important to conduct comparative experiments with other state-of-art results and analyze the advantages of this system. Last but not least, it’s also necessary to design experiments to

Conclusion

In this paper, a dynamic sequence facial expression recognition system has been proposed, which integrates shallow features and deep features with the attention mechanism. Shallow features are represented by the relative positions of facial landmarks and the texture features of local areas of the face based on FACS. At the same time, the ALexNet structure is improved to extract the deep features of sequence images to express high-level semantic features. There are three attention mechanisms,

Future work

Of course, the idea of the system can also be extended to other image tasks. For example, the human motion recognition, the shallow features of which can be extracted by describing the geometric features based on the detection of the inflection points of some joints. In addition, not only the AlexNet but other convolutional neural networks can also be extended into the sequence model.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Xiao SUN was born in 1980. He received the M.E. degree in 2004 from the Department of Computer Sciences and Engineering at Dalian University of Technology, and got his double doctor’s degree in Dalian University of Technology (2010) of China and the University of Tokushima (2009) of Japan. He is now working as professor in AnHui Province Key Laboratory of Affective Computing and Advanced Intelligent Machine at Hefei University of Technology. His research interests include Affective Computing,

References (55)

A. Azeem et al.
Hexagonal scale invariant feature transform (h-sift) for facial feature extraction
Journal of Applied Research and Technology
(2015)
C. Shan et al.
Facial expression recognition based on local binary patterns: A comprehensive study
Image and Vision Computing
(2009)
P. Ekman et al.
Constants across cultures in the face and emotion
Journal of Personality and Social Psychology
(1971)
Y. Zhang et al.
Active and dynamic information fusion for facial expression understanding from image sequences
IEEE Transactions on Pattern Analysis and Machine Intelligence
(2005)
Y. Tong et al.
Facial action unit recognition by exploiting their dynamic and semantic relationships
IEEE Transactions on Pattern Analysis and Machine Intelligence
(2007)
Y. Zhang et al.
Dynamic facial expression analysis and synthesis with mpeg-4 facial animation parameters
IEEE Transactions on Circuits and Systems for Video Technology
(2008)
R. Borgo et al.
Facial expression recognition in dynamic sequences: An integrated approach
Pattern Recognition
(2014)
A. Yao et al.
Capturing au-aware facial features and their latent relations for emotion recognition in the wild
(2015)
Y. li Tian et al.
Recognizing action units for facial expression analysis
IEEE Transactions on Pattern Analysis and Machine Intelligence
(2001)
T. Ojala, M. Pietikainen, D. Harwood, Performance evaluation of texture measures with classification based on kullback...

A. Krizhevsky et al.

Imagenet classification with deep convolutional neural networks

R. Girshick et al.

Rich feature hierarchies for accurate object detection and semantic segmentation

H. Jung et al.

Joint fine-tuning in deep neural networks for facial expression recognition

H.-W. Ng et al.

Deep learning for emotion recognition on small datasets using transfer learning

A. Ruiz-Garcia et al.

Deep learning for emotion recognition in faces

(2016)

W. Liu et al.

Emotion recognition using multimodal deep learning

ICONIP

(2016)

D.H. Kim et al.

Multi-objective based spatio-temporal feature representation learning robust to expression intensity variations for facial expression recognition

IEEE Transactions on Affective Computing

(2017)

Z. Yu, G. Liu, Q. Liu, J. Deng, Spatio-temporal convolutional features with nested lstm for facial expression...

P. Viola et al.

Rapid object detection using a boosted cascade of simple features

(2001)

H. Li et al.

A convolutional neural network cascade for face detection

S. Yang, P. Luo, C.C. Loy, X. Tang, Faceness-net: Face detection through deep facial part responses, IEEE Transactions...

Y. Sun et al.

Deep convolutional network cascade for facial point detection

2013 IEEE Conference on Computer Vision and Pattern Recognition

(2013)

E. Zhou, H. Fan, Z. Cao, Y. Jiang, q. Yin, Extensive facial landmark localization with coarse-to-fine convolutional...

D.G. Lowe

Object recognition from local scale-invariant features

H. Soyel et al.

Facial expression recognition based on discriminative scale invariant feature transform

Electronics Letters

(2010)

C. Shan et al.

Robust facial expression recognition using local binary patterns

IEEE International Conference on Image Processing 2005

(2005)

T. Ahonen et al.

Face description with local binary patterns: Application to face recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence

(2006)

Cited by (27)

Multi-geometry embedded transformer for facial expression recognition in videos
2024, Expert Systems with Applications
Dynamic facial expressions in videos express more realistic emotional states, and recognizing emotions from in-the-wild facial expression videos is a challenging task due to the changeable posture, partial occlusion and various light conditions. Although current methods have designed transformer-based models to learn spatial–temporal features, they cannot explore useful local geometry structures from both spatial and temporal views to capture subtle emotional features for the videos with varied poses and facial occlusion. To this end, we propose a novel multi-geometry embedded transformer (MGET), which adapts multi-geometry knowledge into transformers and excavates spatial–temporal geometry information as complementary to learn effective emotional features. Specifically, from a new perspective, we first design a multi-geometry distance learning (MGDL) to capture emotion-related geometry structure knowledge under Euclidean and Hyperbolic spaces. Especially based on the advantages of hyperbolic geometry, it finds the more subtle emotional changes among local spatial and temporal features. Secondly, we combine MGDL with transformer to design spatial–temporal MGETs, which capture important spatial and temporal multi-geometry features to embed them into their corresponding original features, and then perform cross-regions and cross-frame interaction on these multi-level features. Finally, MGET gains superior performance on DFEW, FERV39k and AFEW datasets, where the unweighted average recall (UAR) and weighted average recall (WAR) are 58.65%/69.91%, 41.91%/50.76% and 53.23%/55.40%, respectively, and the gained improvements are 2.55%/0.66%, 3.69%/2.63% and 3.66%/1.14% compared to M3DFEL, Logo-Forme and EST methods.
Enhanced spatial-temporal learning network for dynamic facial expression recognition
2024, Biomedical Signal Processing and Control
The recognition of dynamic facial expressions has received increasing attention since they can better reflect the real expression process of emotion than a static image. However, due to various factors such as subtle variation differences, pose, occlusion, and illumination, it has been a challenging vision task to obtain discriminative expression features in dynamic facial expression recognition. Traditional CNN-based deep learning networks lack global and temporal contextual expression understanding, which tends to affect the final recognition of dynamic expressions. Therefore, we propose an enhanced spatial–temporal learning network (ESTLNet) for more robust dynamic facial expression recognition, which consists of a spatial fusion learning module (SFLM) and a temporal transformer enhancement module (TTEM). First, the SFLM obtains a more expressive spatial feature representation through a dual-channel feature fusion learning module. Then, the TTEM extracts more valid temporal contextual expression features based on the above spatial features through an encoder constructed by cascading a self-attention learning network and an effective gated feed-forward network. Finally, the co-enhanced spatial–temporal model approach is assessed on the four broadly used dynamic expression datasets (DFEW, AFEW, CK+, and Oulu-CASIA). Extensive experimental outcomes demonstrate that our approach surpasses several existing state-of-the-art methods, leading to notable enhancements in performance.
HiT-MST: Dynamic facial expression recognition with hierarchical transformers and multi-scale spatiotemporal aggregation
2023, Information Sciences
Facial expression recognition rarely explores complex spatiotemporal dependencies among facial regions at different scales. This paper proposes a transformer-based three-layer hierarchical architecture that incorporates multi-scale spatiotemporal aggregation for dynamic facial expression recognition. The hierarchical structure consists of bottom-to-top layers, each comprising transformer encoders with local self-attention mechanisms. These encoders gradually expand their receptive fields through hierarchical spatiotemporal aggregation, enabling the modeling of spatiotemporal context dependencies among facial regions at different scales and across consecutive frames. Consequently, the bottom-to-top layers correspond to learning the fine-grained, coarse-grained, and global facial representations. To evaluate the performance of our proposed framework, we conducted extensive experiments on four public datasets. The comparison results demonstrate that our proposed framework outperforms the state-of-the-art, with accuracies of 79.09%, 62.19%, 64.85%, and 59.79% on the RML, eNTERFACE'05, RAVDESS, and AFEW datasets, respectively. Ablation experiments, statistical significance tests, and visualization analyses indicate that the proposed framework successfully learns emotional-salient facial representations.
C3DBed: Facial micro-expression recognition with three-dimensional convolutional neural network embedding in transformer model
2023, Engineering Applications of Artificial Intelligence
Facial micro-expression is often used for emotional recognition of people in a high-risk or pressure scene, which may reflect genuine emotions due to the low intensity of facial action units. Current methods focus on locating regions with emotional changes and cropping these regions for local feature extraction. However, these methods may lead to the problem of information redundancy caused by overlapping cropped regions. This paper proposes a novel three-dimensional convolutional neural network embedding in the transformer model (C3DBed). This model learns the attention weight of each local region of the micro-expression image, thereby perceiving the detail changes of the facial image and extracting robust local detail features. Solve the problem of model complexity and information redundancy caused by low-intensity local area positioning of facial muscle movement. The experiment results demonstrated that the proposed C3DBed model achieved competitive performance with accuracy rates of 78.04%, 77.64%, and 75.73% on SMIC, CASME II, and SAMM datasets, respectively.
Exploring attribute localization and correlation for pedestrian attribute recognition
2023, Neurocomputing
Pedestrian Attribute Recognition (PAR) is currently an emerging research topic in the field of video surveillance. For PAR, it usually needs to analyze dozens of attributes simultaneously, e.g., age, gender and Clothing type. However, different attributes may focus on different image regions, which makes it difficult to concurrently extract exhaustive features over all attributes. Moreover, some of these attributes are highly correlated, which is the other challenge for pedestrian attribute recognition. To remedy the aforementioned two issues, we propose two novel modules, namely Attribute Localization Module (ALM) and Attribute Correlation Module (ACM). For ALM, it is constructed based on a multi-stream architecture with each stream processing a specific attribute individually. More specifically, an attention mechanism is employed to discover and enhance the attribute-related features while suppressing less important regions. For ACM, the Transformer structure is employed to effectively explore the correlations among different attributes. In particular, we place the Transformer blocks behind the ALM module, with regarding each attribute-specific feature as an input token. The ALM and ACM modules focus on different aspects, which exploits the interrelated and complementary information. We combine the proposed modules to form a unified network with Exploring Attribute Localization and Correlation (abbreviated as EALC). Our approach is validated on five large-scale pedestrian attribute datasets, including PETA, RAP, PA-100 K, Market-1501 and Duke attribute datasets. Experiments demonstrate the effectiveness and advancement of the proposed EALC.
Spatio-temporal convolutional emotional attention network for spotting macro- and micro-expression intervals in long video sequences
2022, Pattern Recognition Letters
Citation Excerpt :
Facial micro-expression analysis by computer vision techniques has attracted extensive attention. The macro-expression [4–7] and micro-expression recognition [8–11] have achieved excellent results. However, people often use macro-expression to cover up their emotional state in high-risk tasks such as national security, criminal investigation, negotiation, and mental illness [12–14].
Emotional detection based on facial micro-expressions is essential in high-risk tasks such as criminal investigation or lie detection. However, micro-expressions often occur in high-risk tasks when people often use facial expressions to conceal their actual emotional states. Therefore, spotting macro- and micro-expression intervals in long video sequences has become hot research. Considering the difference in duration and facial muscle movement intensity between macro- and micro-expression, we propose a novel Spatio-temporal Convolutional Emotional Attention Network (STCEAN) for spotting macro- and micro-expression intervals in long video sequences. The spatial features of each frame in the video sequence are extracted through the convolution neural network. Then the emotional self-attention model is used to analyze the temporal weights of spatial features in different emotional dimensions. The emotional weights in the temporal dimension are filtered for spotting macro- and micro-expressions intervals. Finally, the STCEAN model is jointly optimized by the dual emotional focal loss of macro- and micro-expression to solve the problem of sample unbalance. The experimental results on the CAS(ME)² and SAMM-LV datasets show that the STCEAN model achieves competitive results in the Facial Micro-Expression Challenge 2021.

View all citing articles on Scopus

Pingping Xia was born in 1992. He received his Bachelor’s degree in 2017 from the School of Hefei University of Technology, Anhui, China. He is currently studying for a Master’s degree at the School of Computer and Information, Hefei University of Technology. His research interest includes facial expression recognition, deep learning and generative adversarial networks.

Fuji Ren received his Ph.D. in 1991 from Faculty of Engineering, Hokkaido University, Sapporo, Japan. His current research interests include Natural Language Processing, Machine Translation, Multi-Lingual Multi-Function Multi-Media Intelligent Systems, Robust Methods for Dialogue Understanding, and Affective Computing and Knowledge Engineering. He is a senior member of IEEE, SMC of IEEE, a member of the Association for Natural Language Processing, Information Processing Society of Japan, The Institute of Electronics, Information and Communication Engineers, The Japanese Society for Artificial Intelligence, Japanese Society for Information and Systems in Education, Asia-Pacific Association for Machine Translation, The International Association of Science and Technology for Development, Chinese Academy of Science and Engineering in Japan. He has organized over sixty international conferences and symposia as PC, OC and Chair including IEEE NLPKE and CCIS. Professor Ren is the Editor of the International Journal COLIPS, Editor of the International Journal of Information Technology and Decision Making, Associate Editor of the International Journal of Asian Information-Science-Life, Technical Editor of the International Journal of Information Acquisition.

View full text

Multi-attention based Deep Neural Network with hybrid features for Dynamic Sequential Facial Expression Recognition

Abstract

Introduction

Section snippets

Related work

Proposed method

Experimental results

Conclusion

Future work

Declaration of Competing Interest

Journal of Applied Research and Technology

Image and Vision Computing

Constants across cultures in the face and emotion

Journal of Personality and Social Psychology

Active and dynamic information fusion for facial expression understanding from image sequences

IEEE Transactions on Pattern Analysis and Machine Intelligence

Facial action unit recognition by exploiting their dynamic and semantic relationships

IEEE Transactions on Pattern Analysis and Machine Intelligence

Dynamic facial expression analysis and synthesis with mpeg-4 facial animation parameters

IEEE Transactions on Circuits and Systems for Video Technology

Facial expression recognition in dynamic sequences: An integrated approach

Pattern Recognition

Capturing au-aware facial features and their latent relations for emotion recognition in the wild

Recognizing action units for facial expression analysis

IEEE Transactions on Pattern Analysis and Machine Intelligence

Imagenet classification with deep convolutional neural networks

Rich feature hierarchies for accurate object detection and semantic segmentation

Joint fine-tuning in deep neural networks for facial expression recognition

Deep learning for emotion recognition on small datasets using transfer learning

Deep learning for emotion recognition in faces

Emotion recognition using multimodal deep learning

ICONIP

Multi-objective based spatio-temporal feature representation learning robust to expression intensity variations for facial expression recognition

IEEE Transactions on Affective Computing

Rapid object detection using a boosted cascade of simple features

A convolutional neural network cascade for face detection

Deep convolutional network cascade for facial point detection

2013 IEEE Conference on Computer Vision and Pattern Recognition

Object recognition from local scale-invariant features

Facial expression recognition based on discriminative scale invariant feature transform

Electronics Letters

Robust facial expression recognition using local binary patterns

IEEE International Conference on Image Processing 2005

Face description with local binary patterns: Application to face recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence