Elsevier

Neurocomputing

Volume 267, 6 December 2017, Pages 385-395
Neurocomputing

An efficient unconstrained facial expression recognition algorithm based on Stack Binarized Auto-encoders and Binarized Neural Networks

https://doi.org/10.1016/j.neucom.2017.06.050Get rights and content

Highlights

  • The Binarized Auto-encoder is proposed for binary feature learning.

  • A transfer learning scheme is used to solve small-sample problem.

  • The Multi-scale Dense Local Binary Patterns feature is extended from LBP.

  • A real-world facial expression recognition system is proposed.

Abstract

Although deep learning has achieved good performances in many pattern recognition tasks, the over-fitting problem is still a serious issue for training deep networks containing large sets of parameters with limited labeled data. In this work, Binarized Auto-encoders (BAEs) and Stacked Binarized Auto-encoders (Stacked BAEs) are proposed to learn a kind of domain knowledge from a large-scale unlabeled facial dataset. By transferring the knowledge to another Binarized Neural Networks (BNNs) based supervised learning task with limited labeled data, the performance of the BNNs can be improved. A real-world facial expression recognition system is constructed by combining an unconstrained face normalization method, a variant of LBP descriptor, BAEs and BNNs. The experiment result shows that the whole system achieves good performance on the Static Facial Expressions in the Wild (SFEW) benchmark with minimal hardware requirements and lower memory and computation costs.

Introduction

The research on facial expression was started by psychologists. Mehrabian et al. [1] suggested that the combined effect of simultaneous verbal, vocal and facial attitude communications is a weighted sum of their independent effects with the coefficients of 7%, 38% and 55%, respectively. The facial expression plays an important role in Human-Computer Interaction (HCI), affective computing, human behavior analysis, etc. Facial Action Coding System (FACS) [2] and Emotional Facial Action Coding System (EMFACS) [3] were proposed by Ekman and Friensen. FACS and EMFACS define a set of Action Units (AUs) associated with six basic emotions including angry, disgust, fear, happy, sad and surprise. The AUs and six basic emotions become the most commonly used expression labels for classification/detection tasks in machine learning.

In the early days, research on facial expression recognition (FER) is based on the datasets recorded in the laboratory environment using specialized recording devices. Some datasets contain posed expressions [4], and the others contain spontaneous expressions [5]. The existing recognition methods can be divided into three types, the geometry-based methods [6], [7], [8], the appearance-based methods [9], [10], [11] and the hybrid methods [12]. The geometry-based methods use the locations of the landmarks, the distances between the landmarks, the angles of the triangles in the mesh, etc. as the 1D features. The appearance-based methods use the global image or the local patches around the landmarks to extract various 2D image features. Remarkable results have been achieved by these methods.

This work focuses on the unconstrained FER problem which is a research hot spot at present. Unconstrained faces contain variations in head poses, expression intensities, lighting conditions, backgrounds, occlusions and other distortions [13], [14], [15]. These conditions are very close to those in the real world. The unconstrained FER is a challenging unsolved problem.

Research trends show that FER methods based on Convolutional Neural Networks (CNNs) are more and more popular. Take the state-of-the-art methods in the recent EmotiW Challenge [16] as examples. Alex-Net [17], [18], VGG-Net [19], [20], [21], GoogLeNet [21], [22] and various CNNs [18], [20], [23] are used for unconstrained FER. In these studies, external data such as TFD dataset [24], FER-2013 dataset [25], CAISA Web Face dataset [26] are also used [18], [20], [21], [23]. Decision fusion is wildly used to improve their performance by 4–7% [18], [20], [21], [23].

The basic assumption of CNNs is that different regions of images share the same local statistical properties. This assumption is not suitable for aligned faces. CNNs containing locally connected layers [27] are proposed to alleviate this problem. However, locally connected layers have a large number of parameters. They are impracticable to be trained on small datasets. Besides, long training/test time and large memory cost are common drawbacks of these CNNs based methods. Instead of CNNs, a novel fully connected neural network is introduced in this work. We believe that, with the help of face alignment and invariant features, the fully connected neural network is still a good choice for a wide range of face applications.

Recently, neural networks using binary weights or activations have attracted more and more attentions. From a hardware perspective, these binary weights and activations can accelerate the backward propagation (BP) and forward propagation of the networks. They can also reduce the peak of memory cost of training and test. The noisy weights also act as a strong regularizer to prevent over-fitting. A well-designed algorithm can have these advantages with an acceptable loss of the performance. In this paper, Binarized Neural Networks (BNNs) [28] and Binarized Auto-encoders (BAEs) are used as the classifier and the feature extractor in an FER system.

The main contributions of this work include:

  • Based on the novel Binarized Neural Networks (BNNs) [28], an unsupervised feature learning method called Binarized Auto-encoder (BAE) is proposed. BAEs can learn features on external large-scale unlabeled facial datasets, and improve the performance of the supervised learning task.

  • A low-level image feature called Multi-scale Dense Local Binary Patterns (MDLBP) is proposed for extracting discriminative information from face crops.

  • As far as we know, this paper is the first to combine a binary feature extractor, a binary unsupervised feature learner and a binary neural network into a real-world FER system. The system achieves good performance with minimal hardware requirements (i.e., lower memory and computation costs).

The rest of the paper is organized as follows: Section 2 reviews the related work. In Section 3, the main method is proposed. The experiments and results are presented in Section 4. Section 5 gives the conclusion.

Section snippets

Binarized Neural Networks

Neural networks with binary weights or activations have been extensively studied in [28], [29], [30], [31], etc. Expectation Backpropagation (EBP) [29], [30] is a training algorithm for training neural networks with binary weights and activations. It uses real weights + binary activations during training, and uses binary weights + binary activations during test. Binary Connect [31] is another method which uses binary weights + real weights + real activations during training and binary weights +

The proposed method

An efficient recognition system for unconstrained expressions will be proposed in this section. Before going into the details, the general framework will be given. The pipeline of the system is illustrated in Fig. 1.

Several existing face preprocessing methods are integrated into the system for unconstrained face normalization. Once an image containing unconstrained faces is fed into the system. The bounding box and landmarks of the face are successfully detected [35], [36]. The pose of the face

Experiments

In this section, the datasets used in this paper are introduced at first. Several experiments are designed for evaluating each component of the proposed method. Finally, the whole pipeline illustrated in Fig. 1 is evaluated and compared with the related work.

Some new notations are introduced in this section:

  • BAE-1 denotes a 1-layered BAE,

  • SBAE-2 denotes a 2-layered Stacked BAE,

  • BNN-L denotes an L-layered BNN,

  • BAE-BNN-L denotes a network containing a 1-layered BAE and an (L1)-layered BNN,

  • SBAE-BNN-L

Conclusion

This work is focused on solving the real-world FER task. It is the first to combine a binary feature extractor, a binary unsupervised feature learner and a binary neural network into a real-world FER system. Firstly, an unconstrained face normalization method is proposed by integrating several existing facial image preprocessing algorithms. Secondly, a low-level image feature called MDLBP is proposed for extracting discriminative information from face crops. Thirdly, BAEs is proposed to learn

Acknowledgments

This work is partially supported by National Natural Science Foundation of China under Grant Nos. 61375007, 61373063, 61233011, 91420201, 61472187 and by National Basic Research Program of China under Grant No. 2014CB349303.

Wenyun Sun received the B.S. degree in Computer Science and Technology and the M.S. degree in Pattern Recognition and Intelligent System from Jiangsu University of Science and Technology, Zhenjiang, China in 2009 and 2012, respectively. He is currently pursuing the Ph.D. degree in Pattern Recognition and Intelligent Systems at Nanjing University of Science and Technology, Nanjing, China.

References (52)

  • ZhangX. et al.

    BP4D-spontaneous: a high-resolution spontaneous 3d dynamic facial expression database

    Image Vis. Comput.

    (2014)
  • LiuW. et al.

    HSAE: a Hessian regularized sparse auto-encoders

    Neurocomputing

    (2016)
  • C. Sagonas et al.

    300 faces in-the-wild challenge: database and results

    Image Vis. Comput.

    (2016)
  • A. Mehrabian et al.

    Inference of attitudes from nonverbal communication in two channels

    J. Consult. Psychol.

    (1967)
  • P. Ekman et al.

    Facial Action Coding System(FACS): Manual

    (1978)
  • W. Friensen et al.

    EMFACS-7: Emotional Facial Action Coding System

    Technical Report

    (1983)
  • P. Lucey et al.

    The extended Cohn–Kanade dataset (ck+): a complete dataset for action unit and emotion-specified expression

    Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

    (2010)
  • ZengZ. et al.

    A survey of affect recognition methods: audio, visual, and spontaneous expressions

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2009)
  • A. Asthana et al.

    Evaluating AAM fitting methods for facial expression recognition

    Proceedings of the Third International Conference on Affective Computing and Intelligent Interaction and Workshops

    (2009)
  • A. Dhall et al.

    Facial expression based automatic album creation

    Proceedings of the International Conference on Neural Information Processing

    (2010)
  • J. Whitehill et al.

    Toward practical smile detection

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2009)
  • A. Dhall et al.

    Emotion recognition using PHOG and LPQ features

    Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition and Workshops (FG 2011)

    (2011)
  • XuL. et al.

    Automatic facial expression recognition using bags of motion words

    Proceedings of the British Machine Vision Conference

    (2010)
  • S. Lucey, I. Matthews, C. Hu, Z. Ambadar, F. De la Torre, J. Cohn, Aam derived face representations for robust facial...
  • A. Dhall et al.

    Static facial expression analysis in tough conditions: data, evaluation protocol and benchmark

    Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCV Workshops)

    (2011)
  • A. Dhall et al.

    Collecting large, richly annotated facial-expression databases from movies

    IEEE MultiMedia

    (2012)
  • C. Fabian Benitez-Quiroz et al.

    EmotioNet: an accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • A. Dhall et al.

    Video and image based emotion recognition challenges in the wild: EmotiW 2015

    Proceedings of the International Conference on Multimodal Interaction

    (2015)
  • A. Krizhevsky et al.

    ImageNet classification with deep convolutional neural networks

    Proceedings of the Advances in Neural Information Processing Systems

    (2012)
  • S.E. Kahou et al.

    Combining modality specific deep neural networks for emotion recognition in video

    Proceedings of the Fifteenth International Conference on Multimodal Interaction

    (2013)
  • K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, CoRR abs/1409.1556. URL...
  • S. Ebrahimi Kahou et al.

    Recurrent neural networks for emotion recognition in video

    Proceedings of the International Conference on Multimodal Interaction

    (2015)
  • G. Levi et al.

    Emotion recognition in the wild via convolutional neural networks and mapped binary patterns

    Proceedings of the International Conference on Multimodal Interaction

    (2015)
  • C. Szegedy et al.

    Going deeper with convolutions

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2015)
  • KimB.-K. et al.

    Hierarchical committee of deep CNNs with exponentially-weighted decision fusion for static facial expression recognition

    Proceedings of the International Conference on Multimodal Interaction

    (2015)
  • J.M. Susskind et al.

    The Toronto Face Database

    Technical Report

    (2010)
  • Cited by (36)

    • Novel dual-channel long short-term memory compressed capsule networks for emotion recognition

      2022, Expert Systems with Applications
      Citation Excerpt :

      Emotional robots and human-robot emotional communication have been extensively created and employed in multiple areas (Liu et al., 2016). One of the crucial capabilities of the emotional robot is emotion recognition, which primarily involves body language emotion recognition (Rattanyu & Mizukawa, 2011), facial expression recognition (Sun, Zhao, & Jin, 2017), and speech emotion recognition (Song et al., 2016). For human communication with robots effortlessly and harmoniously, it is required to identify human emotion with significant accuracy for the robot.

    • Cross-dataset emotion recognition from facial expressions through convolutional neural networks

      2022, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      Learned features are used to train a Random Forest classifier. Sun et al. [24] employ Multi-scale Dense LBP (MDLBP) to extract descriptors in different resolutions. These descriptors are concatenated into a single vector and sent to a Stacked Binarized Auto-encoder for feature learning in an unsupervised fashion.

    • Frontalization and adaptive exponential ensemble rule for deep-learning-based facial expression recognition system

      2021, Signal Processing: Image Communication
      Citation Excerpt :

      In this work, an automatic FER algorithm for static facial images using the techniques of the CNN, face frontalization, and the hierarchical structure is proposed. In contrast to related works [8–16], which used either small patches or local features for FER, the proposed FER system adopts an improved frontalized preprocessing technique. Moreover, an advanced shortcut network, with even higher accuracy than the famous DenseNet [17], is applied.

    • LBAN-IL: A novel method of high discriminative representation for facial expression recognition

      2021, Neurocomputing
      Citation Excerpt :

      We obtain 55.28% on the validation set which is a new state of the art to our knowledge. It is worth noting that Reference [31] is the first to combine a binary feature extractor, a binary unsupervised feature learner and a binary neural network into the FER system. Our proposed local binary standard layer (LBSL) can better extract proper sparse expression features and achieve superior performance.

    View all citing articles on Scopus

    Wenyun Sun received the B.S. degree in Computer Science and Technology and the M.S. degree in Pattern Recognition and Intelligent System from Jiangsu University of Science and Technology, Zhenjiang, China in 2009 and 2012, respectively. He is currently pursuing the Ph.D. degree in Pattern Recognition and Intelligent Systems at Nanjing University of Science and Technology, Nanjing, China.

    Haitao Zhao received his Ph.D. degree in pattern recognition and intelligent system from Nanjing University of Science and Technology, Nanjing, China in 2003. Now he is a professor at East China University of Science and Technology, Shanghai, China. His current interests are in the areas of pattern recognition, machine learning and computer vision.

    Zhong Jin received the B.S. degree in mathematics, M.S. degree in applied mathematics and the Ph.D. degree in pattern recognition and intelligent system from Nanjing University of Science and Technology, Nanjing, China in 1982, 1984 and 1999, respectively. His current interests are in the areas of pattern recognition and face recognition.

    View full text