Elsevier

Pattern Recognition

Volume 112, April 2021, 107694
Pattern Recognition

OAENet: Oriented attention ensemble for accurate facial expression recognition

https://doi.org/10.1016/j.patcog.2020.107694Get rights and content

Highlights

  • We propose a Oriented Attention Enable Network (OAENet) architecture for FER, which aggreates ROI aware and attention mechanism, ensuring the sufficient utilization of both global and local features.

  • We propose a weighed mask that combines the facial landmarks and correlation coefficients coefficients, which prove to be effective to improve the attention on local regions.

  • Our method has achieved state-of-the-art performances on several leading datasets such as Ck+, RAF-DB and AffectNet.

Abstract

Facial Expression Recognition (FER) is a challenging yet important research topic owing to its significance with respect to its academic and commercial potentials. In this work, we propose an oriented attention pseudo-siamese network that takes advantage of global and local facial information for high accurate FER. Our network consists of two branches, a maintenance branch that consisted of several convolutional blocks to take advantage of high-level semantic features, and an attention branch that possesses a UNet-like architecture to obtain local highlight information. Specifically, we first input the face image into the maintenance branch. For the attention branch, we calculate the correlation coefficient between a face and its sub-regions. Next, we construct a weighted mask by correlating the facial landmarks and the correlation coefficients. Then, the weighted mask is sent to the attention branch. Finally, the two branches are fused to output the classification results. As such, a direction-dependent attention mechanism is established to remedy the limitation of insufficient utilization of local information. With the help of our attention mechanism, our network not only grabs a global picture but can also concentrate on important local areas. Experiments are carried out on 4 leading facial expression datasets. Our method has achieved a very appealing performance compared to other state-of-the-art methods.

Introduction

Understanding the unspoken words from facial and body cues is a fundamental human trait, and such aptitude is vital in our daily communications and social interactions. Among many inputs that can be used to derive emotions, facial expression is the most popular so far. One of the pioneer works by Paul Ekman [10] identified 6 emotions (surprise, happy, sad, angry, fear, and disgust) that are universal across different cultures. The invention of Facial Action Coding System (FACS) makes the facial expression analysis convenient.

Researchers have worked hard for many years to promote the development of Facial Expression Recognition (FER). Some methods have been proposed to tackle the inconsistent labeling problem [37], among which model fine-tuning and multi-dataset cross training are two popular strategies. However, they have some drawbacks. FER is different from the general classification task, such as the recognition of cat, dog, and person, however, the difference between various expressions is smaller. These methods may ignore these differences, and the distinguishing features could disappear during the training process as the network goes deeper, resulting in the problem of feature degradation.

To solve such a problem, we rethink the impact of attention mechanism for fine grained categories. Recognizing fine grained categories (e.g., bird species) highly relies on discriminative part localization and part-based fine grained feature learning. Besides, the part localization and fine grained feature learning are mutually correlated and thus can reinforce each other [11]. In this way, the local location information and the feature information of the image are fused and mutually promoted, thereby greatly improving the accuracy of the recognition. Fu et al. [6] proposed a novel recurrent attention convolutional neural network (RA-CNN) that recursively learns discriminative region attention and region-based feature representation at multiple scales in a mutually reinforced way. Wang et al. [32] proposed a residual attention network, a convolutional neural network using attention mechanism which incorporates with general CNN architecture to improve the classification accuracy. For FER, the varieties between the different expressions of the same person are small which leads to close distance between classes, making it difficult for conventional metrics to distinguish between these categories. To solve the problem Cai et al. [3] proposed a novel loss function called island loss. Through increasing the class spacing between different expression categories, network with island loss can learn more distinguishing features.

Conventional expression recognition algorithms are more inclined to identify the global features of the entire expression. Recently, some algorithms [5] began to take regional patches of different portions of a face (e.g., eye patches, nose, and mouth regions) as input into the network for training, aiming to capture global and local features simultaneously. These methods reject the features of other areas on the facial expression image explicitly, leading to insufficient exploitation. However, for decisions on expressions, people pay more attention to vital areas and less attention to others. Thus, it is problematic to completely abandon local information outside the key areas. On the other side, these multi-region based algorithms simply concatenate the features obtained by the convolution of individual sub-regions on the final convolutional layer, resulting in the loss of spatial location information of the facial expression images.

Li et al. li2018occlusion proposed the CNN network architecture with an attention mechanism to solve the occlusion problem in FER. They assigned the attention weight for the focused landmarks area, the other parts of the face are ignored. Sun et al. sun2018visual also applied the attention mechanism to improve the accuracy of FER. However, the landmarks information is used to align the posed face, which is not for the local attention information extraction. Liu et al. [18] proposed a novel FER framework, named identity-disentangled facial expression recognition machine (IDFERM), which considered the expression as a combination of an expressive component and a neutral component of a person. The generated normalized faces were used as hard negative samples for metric learning to improve the FER performance. However, it depends on the quality of natural face generation.

In this paper, we are interested in recognizing facial expressions using a pseudo-siamese network with an oriented attention ensemble, which consists of two branches named as maintenance branch and attention branch. To enhance the keypoint weight of the input face image, we design a novel weighted mask constructed by correlating the facial landmarks and the correlation coefficients. To design the weighted mask, we carry out the region test experiment extensively on the test dataset for each expression, to estimate the effect of different sub-regions contributed to each expression. After the rethinking of attention mechanism on fine-grained object recognition and image classification, we introduce the direction-dependent attention mechanism to our network, to utilize the local highlight feature more efficiently through the attention branch. Furthermore, to preserve the local location information, we perform the depthwise product when combining the output feature map of the two branches of our network. To this end, our main contributions are summarized as:

  • We propose an Oriented Attention Enable Network (OAENet) architecture for FER, which aggregates ROI aware and attention mechanism, ensuring the sufficient utilization of both global and local features.

  • We propose a weighed mask that combines the facial landmarks and correlation coefficients, which proves to be effective to improve the attention on local regions.

  • Our method has achieved state-of-the-art performances on several leading datasets such as Ck+, RAF-DB, and AffectNet.

The remaining chapters are organized as follows: Section 2 presents related works. Our method is described in Section 3. The experimental results and analysis are discussed in Section 4 and the conclusions are made in Section 5.

Section snippets

Related work

In recent years, many researchers made effort on facial expression recognition based on static images, and some effective methods have been proposed. We review the previous work considering two aspects that are related to ours, i.e., the similar network architecture design (multi-task learning and network ensemble) and related techniques (attention mechanism).

Network architecture

As aforementioned, the performance of multitasking and ensemble networks on expression recognition tasks is better than a single end-to-end network. Following our motivations, our network mainstream follows a pseudo-siamese network structure [8]. The overview of our network is shown in Fig. 1. The first half of the network is divided into maintenance and attention branches. The global high-level semantic information of the original expression image is obtained through the maintenance branch,

Experiments and results

The experiment consists of two parts. The first part presents our performance on the leading datasets and comparison with the state-of-the-art methods. The second part discusses the ablation study. We design two ablation studies, ablation I and ablation II. The former shows the impact of the region test, which is used to demonstrate the influence of various parts of a face on the task of FER. The latter studies the importance of the attention branch by removing it from the network.

Conclusions

In this article, we have introduced the attention mechanism to the facial expression recognition task, which is achieved by the oriented attention ensemble approach, to tackle the problem of insufficient utilize of local information for FER. We have designed a pseudo-siamese network with maintenance and attention branches. We have calculated the oriented gradient of the face image to construct the weighted mask, as the input to the attention branch. Furthermore, the attention mask of the facial

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowlgedgments

This work was supported in part by National Foundation of China (NSFC) under Grants 61872067, 61872068 and 61720106004, in part by Department Science & Technology of Sichuan Province under Grant 2019YFH0016 and 2020YFG0288, and in part by the 111 Projects under GraAnt B17008.

Zhengning Wang received the B.E. and Ph.D. degrees from the Southwest Jiaotong University (SWJTU), Chengdu, China, in 2000 and 2007, respectively. From 2009 to 2011, he worked as post-doctoral fellow in the second research institute of Civil Aviation Administration of China (CAAC), where he served as a project leader on the Remote Air Traffic Control Tower. He is currently an associate professor at the school of Electronic Engineering, University of Electronic Science and Technology of China

References (40)

  • M. Berman et al.

    The lovász-softmax loss: a tractable surrogate for the optimization of the intersection-over-union measure in neural networks

    Proceedings of the IEEE Conference on Computer Vison and Pattern Recognition

    (2018)
  • J. Cai et al.

    Island loss for learning discriminative features in facial expression recognition

    2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018)

    (2018)
  • H. Ding et al.

    Facenet2expnet: Regularizing a deep face recognition net for expression recognition

    2017 12th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2017)

    (2017)
  • Y. Fan et al.

    Multi-region ensemble convolutional neural network for facial expression recognition

    International Conference on Artificial Neural Networks

    (2018)
  • J. Fu et al.

    Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2017)
  • R. Hadsell et al.

    Dimensionality reduction by learning an invariant mapping

    Proceedings of the IEEE Conference on Computer Vison and Pattern Recognition

    (2006)
  • B. Hasani et al.

    Facial expression recognition using enhanced deep 3D convolutional neural networks

    Proceedings of the IEEE Conference on Computer Vison and Pattern Recognition (CVPR) Workshops

    (2017)
  • M. Herlihy

    A methodology for implementing highly concurrent data objects

    ACM Trans. Program. Lang. Syst.

    (1993)
  • H. Jung et al.

    Joint fine-tuning in deep neural networks for facial expression recognition

    Proceedings of the IEEE Conference on Computer Vison and Pattern Recognition

    (2015)
  • D.H. Kim et al.

    Multi-objective based spatio-temporal feature representation learning robust to expression intensity variations for facial expression recognition

    IEEE Trans Affect Comput

    (2017)
  • Cited by (66)

    View all citing articles on Scopus

    Zhengning Wang received the B.E. and Ph.D. degrees from the Southwest Jiaotong University (SWJTU), Chengdu, China, in 2000 and 2007, respectively. From 2009 to 2011, he worked as post-doctoral fellow in the second research institute of Civil Aviation Administration of China (CAAC), where he served as a project leader on the Remote Air Traffic Control Tower. He is currently an associate professor at the school of Electronic Engineering, University of Electronic Science and Technology of China (UESTC). From October 2014, he was visiting Media Communication Lab at University of Southern California (USC), USA as a visiting scholar for one year. His research interests are image and video processing, computer vision, multimedia communication systems and applications.

    Fanwei Zeng received his B.S. degree from the College of Physics and Electronic Science, Hubei Normal University, China in 2015. He is currently a master student in the College of Signal and Information Processing, University of Electronic Science and Technology of China (UESTC), China. His current research interests are image processing, computer vision and deep learning.

    Shuaicheng Liu received the Ph.D. and M.Sc. degrees from the National University of Singapore, Singapore, in 2014 and 2010, respectively, and the B.E. degree from Sichuan University, Chengdu, China, in 2008. In 2014, he joined the University of Electronic Science and Technology of China and is currently an Associate Professor with the Institute of Image Processing, School of Information and Communication Engineering, Chengdu, China. His research interests include computer vision and computer graphics.

    Bing Zeng (M’91-SM’13-F’16) received his BEng and MEng degrees in electronic engineering from University of Electronic Science and Technology of China (UESTC), Chengdu, China, in 1983 and 1986, respectively, and his PhD degree in electrical engineering from Tampere University of Technology, Tampere, Finland, in 1991. He worked as a postdoctoral fellow at Uni- versity of Toronto from September 1991 to July 1992 and as a Researcher at Concordia University from August 1992 to January 1993. He then joined the Hong Kong University of Science and Technology (HKUST). After 20 years of service at HKUST, he returned to UESTC in the summer of 2013, through Chinas 1000-Talent-Scheme. At UESTC, he leads the Institute of Image Processing to work on image and video processing, 3D and multi-view video technology, and visual big data. During his tenure at HKUST and UESTC, he graduated more than 30 Master and PhD students, received about 20 research grants, filed 8 international patents, and published more than 260 papers. Three representing works are as follows: one paper on fast block motion estimation, published in IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) in 1994, has so far been SCI-cited more than 1000 times (Google-cited more than 2200 times) and currently stands at the 8th position among all papers published in this Transactions; one paper on smart padding for arbitrarily-shaped image blocks, published in IEEE TCSVT in 2001, leads to a patent that has been successfully licensed to companies; and one paper on directional discrete cosine transform (DDCT), published in IEEE TCSVT in 2008, receives the 2011 IEEE CSVT Transactions Best Paper Award. He also received the best paper award at China Com three times (2009 Xian, 2010 Beijing, and 2012 Kunming). He served as an Associate Editor for IEEE TCSVT for 8 years and received the Best Associate Editor Award in 2011. He was General Co- Chair of VCIP-2016 and PCM-2017. He received a 2nd Class Natural Science Award (the first recipient) from Chinese Ministry of Education in 2014 and was elected as an IEEE Fellow in 2016 for contributions to image and video coding.

    View full text