Micro-expression recognition using 3D DenseNet fused Squeeze-and-Excitation Networks

https://doi.org/10.1016/j.asoc.2022.108594Get rights and content

Highlights

  • Appropriate preprocessing promotes the extraction of micro-expression features.

  • The three-dimensional DenseNet can extract facial features deeply.

  • SE block combined with DenseNet can facilitate feature extraction.

  • Different SE block combination methods significantly affect the recognition rate.

Abstract

Micro-expression is a kind of facial feature that reflects the most real emotional state hidden in the human heart. Most of the existing micro-expression recognition methods are based on manual feature extraction of subtle movements of facial muscles. Due to its short duration and weak intensity, the accurate identification of micro-expression remains a challenging task. This paper investigates micro-expression recognition based on deep learning methods and proposes a three-dimensional SE-DenseNet architecture, which fused Squeeze-and-Excitation Networks with a 3D DenseNet and can automatically integrate the spatiotemporal features extracted from each video to increase the weight of valid feature maps. The proposed architecture first obtains apex frames from each video for the most obvious facial muscle movements and then amplifies facial muscle movements using Euler video magnification to significantly alleviate the issue of small sample size and weak intensity of micro-expression recognition. Finally, the pre-processed videos are fed into the 3D SE-DenseNet for further feature extraction as well as to perform micro-expression classification. Experiments are performed on three public datasets. Our best model obtains an overall accuracy of 95.12%, 92.96%, and 82.74% on SMIC, CAS(ME)2 and CASME-II dataset, respectively. The experimental results show that the proposed methods can well describe the considerable details of micro-expression and outperform most of the state-of-the-art methods on three public datasets.

Introduction

Facial expression is a significant part of people’s daily communication, and it conveys abundant emotions. Ref. [1] showed that emotional information transmission is composed of 55% of facial expressions, 38% of voices, and 7% of language. Even with people of different cultural backgrounds and skin colors, they can still communicate emotionally through facial expressions though the language is not clear. Facial expressions originate from the differences in texture and geometry caused by the movement of facial muscle tissues, the regularity of facial muscle movement is universal to all human beings. Therefore, facial expression occupies an irreplaceable position at the beginning of human existence.

Generally, expressions can be divided into macro-expression and micro-expression by their muscle movement range. Macro-expression lasts for a longer time (i.e., 3/4 to 2 s) and possesses a large range of muscle movement. Therefore, it is easy to find the existence of macro-expression through naked eyes. However, in psychological [2], experts have proved that macro-expression is deceptive in various life situations, so it cannot represent people’s true feelings. On the contrary, the duration of micro-expression is short (i.e., 1/25 to 1/5 s), and the motion range is relatively weak [3]. Therefore, it is a challenge for human beings to discover and recognize micro-expressions with naked eyes. Compared with macro-expressions, micro-expressions appear unconsciously on the face and can reveal the true feelings in one’s heart. In 1966, Haggard, etc. firstly discovered micro-expression, they believed that micro-expression is related to people’s self-protection mechanism [4]. Since then, increasing numbers of researchers began to focus on micro-expression recognition (MER) [5], [6], [7]. After decades of research and development, micro-expression began to be widely applied in medical treatment [8], lie detection [9],security systems [10], etc.

To realize the application of micro-expression recognition, researchers have begun to study the relevant theoretical knowledge extremely early. In 1997, Ekman [11] established a Facial Motion Coding System (FACS) to describe the relationship between facial muscle motion and facial expression. According to the anatomical characteristics of facial muscles, FACS divides them into several independent motion units (AUs) to describe the intensity and position of facial expressions. A facial expression consists of single or multiple different facial motion units, such as happy usually consists of AU6+AU12. The micro-expression AU keeps low intensity, and in most cases, only a single motion unit is found to change. In 2002, Ekman developed a micro-expression training tool METT to train people’s ability to recognize micro-expressions,he believed that individuals who use METT training programs can improve their ability to recognize micro-expressions by 30% to 40% within 1.5 h. In 2009, Polikovsky et al. [12] proposed a micro-expression database, which uses a 3D Gradient histogram to extract facial motion features for micro-expression, thus MER in machine learning has become increasingly popular. In 2011, Pfister et al. [13] established the SMIC spontaneous micro-expression database. The images in SMIC database are closer to the micro-expression in the real environment, which makes the research of MER more reliable.

Till now, researchers have proposed various methods to achieve micro-expression classification. In the early stage [1], [14], MER researches mainly include spatiotemporal Local Binary Pattern (LBP), Local Binary Patterns from Three Orthogonal Planes (LBP-TOP) , and Directional Mean Optical Flow Feature, etc. However, all of them are based on time-consuming hand-crafted features. Their main drawback is that they can only extract shallow high-dimensional features from the original video, and lack effective information that can be used to express abstract further features.

With the rapid promotion of computer hardware, deep learning received extensive attention, and its wide applications in various fields demonstrate its striking efficiency. Based on deep learning methods, Ref. [15] proposed an Enriched Long-term Recurrent Convolutional Network (ELRCN), the network first applied the CNN module to encode each micro-expression frame into a feature vector, and then used the long short-term memory network (LSTM) to perform predictions. The method achieves 60.98% on CASME dataset with leave-one-subject-out cross-validation (LOSOCV) protocol. Notably, Ref. [15] achieved a relatively high recognition rate, but it was still lower than quite a few traditional extraction methods [16], [17], [18]. This is due to the limitation of the small sample size in MER. Concretely, deep learning methods depend on a large-scale dataset to extract deeper features. To increase the amount of data size, Ref. [19] implemented data augmentation to generate additional composite images from existing datasets, and the results outperformed massive traditional methods in MER, proving that proper preprocessing can improve recognition for deep learning methods.

In addition to the limitations of datasets, overfitting problems, redundant parameters, and computations also significantly restrain micro-expression recognition based on deep learning. To reduce unnecessary computation and improve the generalization ability of the model, we choose DenseNet to improve the extraction of micro-expressions. In addition, DenseNet can effectively alleviate the problem of common vanishing gradient with its unique connection mode.

This paper proposes a new and robust feature learning model with effective preprocessing, which can efficiently represent the subtle facial muscle movement in the MER process. The main contributions of this paper are summarized as follows:

  • 1.

    Extend the collected apex frames (i.e., Data augmentation) from three public datasets (i.e., SMIC, CAS(ME)2, and CASME-II) to alleviate the small sample size limitation, and then exploit Euler video magnification to better represent the facial muscle movement details.

  • 2.

    Perform the three-dimensional extension of the convolution kernel and pooling layer in the DenseNet model (i.e., 2D-DenseNet to 3D-DenseNet), which can better capture the spatial and temporal information in the video sequences, thus extract deeper facial muscle features.

  • 3.

    The attention mechanism is integrated by squeezing and exciting the 3D DenseNet channels, thus the network (i.e., 3D SE-DenseNet) can adaptively assign feature weights, strengthen the extraction of effective features and suppress the useless features. Moreover, we also experimentally compare the variant of 3D SE-DenseNet to observe the impact of SENet in different critical positions of DenseNet.

The rest chapters of this paper are organized as follows: In Section 2, we briefly review the related work. Section 3 presents the main methods in this paper. The overview of the datasets and the experimental settings is described in Section 4. Section 5 compares and analyzes our results with other representative methods as well as performs various contrast experiments. Finally, we make a brief conclusion of our approach in Section 6.

Section snippets

Related work

Most methods in literatures combine image preprocessing and feature extraction. In the following subsections, quite a few representative preprocessing techniques and deep learning based feature extractors utilized in MER will be discussed and explained.

Proposed method

Although CNN, LSTM and their improved variants have obtained impressive performance for micro-expression recognition [25], [26], [27], [33], the problem of gradient disappearance to some extent will still occur in long-term tasks (i.e., CASME-II). Therefore, this paper tends to solve the problem of gradient disappearance for micro-expression recognition in long video sequences and to alleviate the small sample size limitation.

The framework proposed in this paper is demonstrated in Fig. 1, which

Dataset

Experiments are performed on three standard micro- expression datasets including the Spontaneous micro-expression corpus (SMIC) [33], Chinese Academy of Sciences Macro and Micro-expressions (CAS(ME)2) [2] and Chinese Academy of Sciences Micro-expression-II (CASME-II) [34]. More details regarding these datasets will be described below.

Recognition performance

The cross-validation method (i.e., hold-out protocol) is widely used to evaluate the prediction performance, especially the performance of the trained model on new data, which can reduce the over-fitting to a certain extent. As shown in Table 3 on three public datasets(i.e., SMIC, CASME-II, and CAS(ME)2), this paper compared our two deep models (i.e., 3D-DenseNet and SE-DenseNet) with the representative state-of-the-art deep learning-based methods [21], [29], [30], [32], [33]. As shown in

Conclusion

In this paper, we proposed a novel micro-expression recognition approach based on a densely connected convolutional network with SENet (3D SE-DenseNet). The proposed DenseNet model uses three-dimensional processing (i.e., 3D-DenseNet) and SE-block to adaptively assign weights to feature channels to enhance the learning ability, and to model the spatiotemporal deformation of the micro-expression sequence. Firstly, the augmented and EVM-amplified video sequences are computed from apex frames.

CRediT authorship contribution statement

Linqin Cai: Conceptualization, Methodology, Software, Investigation, Writing – review & editing. Hao Li: Experiment, Data processing, Writing – original draft, Visualization, Data curation. Wei Dong: Experiment, Software, Validation. Haodu Fang: Data preprocessing, Validation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (34)

  • WangS.J. et al.

    Face recognition and micro-expression recognition based on discriminant tensor subspace analysis plus extreme learning machine

    Neural Process. Lett.

    (2014)
  • HuangX.H. et al.

    Spontaneous facial micro-expression analysis using spatiotemporal completed local quantized patterns

    Neurocomputing

    (2015)
  • YaacoubAntoun et al.

    Diagnosing clinical manifestation of apathy using machine learning and micro-facial expressions detection

  • IwasakiM. et al.

    Hiding true emotions: micro-expressions in eyes retrospectively concealed by mouth movements

    Sci. Rep.

    (2016)
  • SunY. et al.

    Facial age and expression synthesis using ordinal ranking adversarial networks

    IEEE Trans. Inf. Forensics Secur.

    (2020)
  • PolikovskyS. et al.

    Facial micro-expressions recognition using high speed camera and 3D-gradient descriptor

  • Cited by (16)

    • Facial Deepfake Detection Using Gaussian Processes

      2024, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    View all citing articles on Scopus
    View full text