Facial expression recognition from image based on hybrid features understanding

https://doi.org/10.1016/j.jvcir.2018.11.010Get rights and content

Abstract

Facial expression recognition (FER) plays an important role in the applications of human computer interaction. Given the wide use of convolutional neural networks (CNNs) in automatic video and image classification systems, higher-level features can be automatically learned from hierarchical neural networks with big data. However, learning CNNs require large amount of training data for adequate generalization, while the Scale-invariant feature transform (SIFT) does not need large training samples to generate useful feature. In this paper, we propose a new hybrid feature representation for the recognition of facial expressions from a single image frame that uses a combination of SIFT and deep-learning feature of different level extracted from the CNN model, then adopt the combined features and classify the expression by support vector machines (SVM). The performance of the proposed method has been validated on public CK+ databases. To evaluate the generalization ability of our method, we also performed an experiment on a cross-database environment. Experimental results show that the proposed approach can achieve better classification rates compared with state-of-art CNN methods, which indicate the considerable potential of combining shallow feature with deep-learning feature.

Introduction

Over the last two-decades, human facial expression recognition (FER) has attracted significant attention and emerged as an important research area [38], [39]. Presently, automated FER has a large variety of applications, such as data-driven animation, neuromarketing, interactive games, sociable robotics, and many other human computer interaction systems. Psychologists have developed different systems to describe and quantify facial behaviors. Among them, the facial action coding system (FACS) developed by Ekman and Friesen [1] is the most popular system. FACS was created to taxonomize human facial movements by their appearance on the face, and remains a standard to categorize the physical expression of emotion systematically. Recent works have already demonstrated good FER performance [2], [3], [11]. However, recognizing facial expression with high accuracy and reliability is still a challenging problem due to image variations caused by pose, illumination, age, and occlusion [4], [5], [6]. Algorithms for automated FER usually involve three main steps: face acquisition, feature extraction, and classification. The very important step in most human FER systems is feature extraction, which aims to represent the facial images as feature vectors [7], [23]. Feature extraction from input data will significantly influence the final classification accuracy. For face representation, most of the existing work utilizes various hand-crafted features including Gabor wavelet coefficients [8], histograms of Local Binary Patterns (LBP) [9], Histograms of Oriented Gradients (HOG) [10], and scale-invariant feature transform (SIFT) descriptors [43], or a combination of these features [12]. After obtaining the representation, various machine-learning algorithms can be applied to perform the classification task.

Despite the success of traditional shallow features and generic image descriptors, recent development in convolutional neural networks (CNNs) has demonstrated significant success of automatically learned features. CNNs technique has recently yielded impressive performance across a wide variety of competitive tasks and challenges [13], [14], [15]. Unlike traditional approaches where features are defined by hand, we often see improvement in visual processing tasks when using neural networks because of the networks ability to extract undefined features from the training database [42]. CNNs can be used as a classifier and is also an effective method to learn deep features automatically from high levels of image data.

The success of CNNs is attributed to their ability to learn rich high-level image representations as opposed to hand-designed low-level features used in image classification methods. However, learning CNNs amounts to estimating millions of parameters and requires a very large number of annotated image samples. This characteristic currently prevents the application of CNNs to FER because of limited databases. Therefore, overfitting becomes a serious problem. Although SIFT and other hand-crafted methods provide less accurate results than CNNs, they do not require extensive database for generalization. But the limitation of the hand-crafted method is that their modeling capacities are limited by the fixed transformations (filters) that stay the same for different sources of data.

In this paper, we propose a novel hybrid feature called CNN-SIFT that integrates the synergy of these two superior features: deep learning features extracted from CNN and the shallow features of SIFT-BoF, and then trains features and classifies expression by Support Vector Machines (SVMs). The method is evaluated on a well-known facial expression database, Extended Cohn-Kanade(CK + ) database [16]. Moreover, we also performed experiments on less controlled scenarios using a cross-database configuration (training with the CK+ database and testing on the JAFFE database [17] and the MMI database [18]) to evaluate the generalization ability of our method. This generalization capability is crucial in real world applications. Our experiments show that this combination method has a strong capacity to capture informative features from different facial expression images.

The main contributions of our work are listed as follows:

In this paper, a new hybrid feature representation for the recognition of facial expressions from a single image frame is proposed, which uses a combination of SIFT and deep-learning feature of different level extracted from the CNN model.

The proposed method address the problem of traditional method which leverage low-level features. In this paper, we exploit deep-level feature from our dataset for a better face recognition performance.

Section snippets

Related work

Various approaches have been proposed to recognize facial expression and significant progress has been made in this research area recently.

Prior to the emergence of deep learning algorithms, majority of traditional feature representation methods used to extract hand-crafted shallow features locally from facial images. Shan [9] evaluated facial representation based on local statistical features called LBP. Experiments have illustrated that LBP features perform robustly over a range of facial

Multi-view features extraction

In this work, we use color, texture, and semantic feature to characterize each region from each image. We detail the feature extraction as follows:

  • (1)

    Color feature: we use color moment [22] to describe the color distribution of each atomic region. Color moment is widely used for image representation in classification and content based image retrieval (CBIR). The procedure of extracting the color moment of each segmented region is given as follows.

  • (2)

    Texture feature: we use the well-known histogram of

Data sets and default setups

In the literature, there are four popular data sets for evaluating image quality, i.e., the CUHK [33], Photo.net [27], AVA, and LIVE-IQ [18]. A rough description of the four data sets is as follows:

  • (1)

    The CUHK contains 12,000 photos collected from DPChallenge.com. They have been labeled by ten independent viewers. Each photo is classified as highly aesthetic if more than eight viewers agree on the assessment. We use the standard split of training/test sets.

  • (2)

    The Photo.net consists of 3581 images.

Conclusions

Quality modeling is a useful technique in multimedia and computer vision [34], [35], [36], [44], [45]. In this paper, a quality model is proposed which optimally mimics human visual perception. Based on our designed multi-view active learning algorithm, a few representative regions are selected for constructing the gaze shifting path. Based on this, a unified probabilistic model is proposed which encodes the human perception of large-scale high quality training photos for determining the

Conflict of interest

The authors declared that there is no conflict of interest.

References (45)

  • E. Andrew

    Johnson and martial Hebert, using spin images for efficient object recognition in cluttered 3D scenes

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1999)
  • Marco Carcassoni, Edwin R. Hancock, Correspondence Matching with Modal Clusters, 2003, pp....
  • S. Belongie et al.

    Shape matching and object recognition using shape contexts

    IEEE Trans. Pattern Anal. Mach. Intellig.

    (2002)
  • Fei-fei Li et al.

    A Bayesian hierarchical model for learning natural scene categories

  • P. Quelhas

    Modeling scenes with local descriptors and latent aspects

  • Liangliang Cao et al.

    Spatially coherent latent topic model for concurrent segmentation and classification of objects and scenes

  • Xiaogang Wang et al.

    Spatial latent dirichlet allocation

    Proceedings of Neural Information Processing Systems Conference

    (2007)
  • D.M. Blei et al.

    Latent dirichlet allocation

    J. Mach. Learn. Res.

    (2003)
  • Efstathios Hadjidemetriou et al.

    Multiresolution histograms and their use for recognition

    IEEE Trans. Pattern Anal. Mach. Intellig.

    (2004)
  • Svetlana Lazebnik et al.

    Beyond bags of features: spatial pyramid matching for recognizing natural scene categories

  • J. Han et al.

    Advanced deep-learning techniques for salient and category-specific object detection: a survey

    IEEE Signal Process. Mag.

    (2018)
  • Xu. Yaowu et al.

    Dynamic learning from multiple examples for semantic object segmentation and search

    Comput. Vision Image Understand.

    (2004)
  • Sinisa Todorovic et al.

    Region-based hierarchical image matching

    Int. J. Computer Vision

    (2007)
  • Yakov Keselman et al.

    Generic model abstraction from examples

    IEEE Trans. Pattern Anal. Mach. Intellig.

    (2005)
  • D. Zhang et al.

    A review of co-saliency detection algorithms: fundamentals, applications, and challenges

    ACM Trans. Intelligent Syst. Technol. (TIST)

    (2018)
  • Pedro F. Felzenszwalb et al.

    Pictorial structures for object recognition

    Int. J. Comput. Vision

    (2005)
  • Yong Jae Lee et al.

    Object-graphs for context-aware category discovery

  • V. Hedau et al.

    Matching images under unstable segmentations

  • Z. Harchaoui et al.

    Image classification with segmentation graph kernels

  • Thomas Gaertner et al.

    On graph kernels: hardness results and efficient alternatives

  • Caihua Wang et al.

    Region correspondence by inexact attributed planar graph matching

  • M.A. Stricker, M. Orengo, Similarity of Color Images Storage and Retrieval of Image and Video Databases 2 (1995)...
  • Cited by (36)

    • A survey on facial emotion recognition techniques: A state-of-the-art literature review

      2022, Information Sciences
      Citation Excerpt :

      There are many variations of these algorithms being used on the Facial Emotion Recognition task to achieve distinct results [91]. In the last few years there can be observed some works that used Convolutional Neural Networks to extract features as well [4,106]. The ASM algorithm was proposed in 1995 [14] with the intent of generating points that fit the contour of the object.

    • Contour and region harmonic features for sub-local facial expression recognition

      2020, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      Several techniques for expression classification found in the literature. They include support vector machines (SVM), k-nearest neighbor (KNN) and extended deep learning [38,14]. The SVM is considered optimal and effective for classification with a small amount of data.

    • Facial expression recognition using human machine interaction and multi-modal visualization analysis for healthcare applications

      2020, Image and Vision Computing
      Citation Excerpt :

      It was applied by combining the sparsity coefficient to suppress Poisson noises and retain infrared spectral structure. Wang et al. [24] designed hybrid features understanding to detect facial expression recognition. Only the necessary features were extracted using a scale-invariant feature transform (SIFT), which didn't need training samples.

    • AI-Based Facial Emotion Recognition

      2024, Lecture Notes in Networks and Systems
    View all citing articles on Scopus

    This article is part of the Special Issue on TIUSM.

    View full text