Facial expression recognition from image based on hybrid features understanding☆
Introduction
Over the last two-decades, human facial expression recognition (FER) has attracted significant attention and emerged as an important research area [38], [39]. Presently, automated FER has a large variety of applications, such as data-driven animation, neuromarketing, interactive games, sociable robotics, and many other human computer interaction systems. Psychologists have developed different systems to describe and quantify facial behaviors. Among them, the facial action coding system (FACS) developed by Ekman and Friesen [1] is the most popular system. FACS was created to taxonomize human facial movements by their appearance on the face, and remains a standard to categorize the physical expression of emotion systematically. Recent works have already demonstrated good FER performance [2], [3], [11]. However, recognizing facial expression with high accuracy and reliability is still a challenging problem due to image variations caused by pose, illumination, age, and occlusion [4], [5], [6]. Algorithms for automated FER usually involve three main steps: face acquisition, feature extraction, and classification. The very important step in most human FER systems is feature extraction, which aims to represent the facial images as feature vectors [7], [23]. Feature extraction from input data will significantly influence the final classification accuracy. For face representation, most of the existing work utilizes various hand-crafted features including Gabor wavelet coefficients [8], histograms of Local Binary Patterns (LBP) [9], Histograms of Oriented Gradients (HOG) [10], and scale-invariant feature transform (SIFT) descriptors [43], or a combination of these features [12]. After obtaining the representation, various machine-learning algorithms can be applied to perform the classification task.
Despite the success of traditional shallow features and generic image descriptors, recent development in convolutional neural networks (CNNs) has demonstrated significant success of automatically learned features. CNNs technique has recently yielded impressive performance across a wide variety of competitive tasks and challenges [13], [14], [15]. Unlike traditional approaches where features are defined by hand, we often see improvement in visual processing tasks when using neural networks because of the networks ability to extract undefined features from the training database [42]. CNNs can be used as a classifier and is also an effective method to learn deep features automatically from high levels of image data.
The success of CNNs is attributed to their ability to learn rich high-level image representations as opposed to hand-designed low-level features used in image classification methods. However, learning CNNs amounts to estimating millions of parameters and requires a very large number of annotated image samples. This characteristic currently prevents the application of CNNs to FER because of limited databases. Therefore, overfitting becomes a serious problem. Although SIFT and other hand-crafted methods provide less accurate results than CNNs, they do not require extensive database for generalization. But the limitation of the hand-crafted method is that their modeling capacities are limited by the fixed transformations (filters) that stay the same for different sources of data.
In this paper, we propose a novel hybrid feature called CNN-SIFT that integrates the synergy of these two superior features: deep learning features extracted from CNN and the shallow features of SIFT-BoF, and then trains features and classifies expression by Support Vector Machines (SVMs). The method is evaluated on a well-known facial expression database, Extended Cohn-Kanade(CK + ) database [16]. Moreover, we also performed experiments on less controlled scenarios using a cross-database configuration (training with the CK+ database and testing on the JAFFE database [17] and the MMI database [18]) to evaluate the generalization ability of our method. This generalization capability is crucial in real world applications. Our experiments show that this combination method has a strong capacity to capture informative features from different facial expression images.
The main contributions of our work are listed as follows:
In this paper, a new hybrid feature representation for the recognition of facial expressions from a single image frame is proposed, which uses a combination of SIFT and deep-learning feature of different level extracted from the CNN model.
The proposed method address the problem of traditional method which leverage low-level features. In this paper, we exploit deep-level feature from our dataset for a better face recognition performance.
Section snippets
Related work
Various approaches have been proposed to recognize facial expression and significant progress has been made in this research area recently.
Prior to the emergence of deep learning algorithms, majority of traditional feature representation methods used to extract hand-crafted shallow features locally from facial images. Shan [9] evaluated facial representation based on local statistical features called LBP. Experiments have illustrated that LBP features perform robustly over a range of facial
Multi-view features extraction
In this work, we use color, texture, and semantic feature to characterize each region from each image. We detail the feature extraction as follows:
- (1)
Color feature: we use color moment [22] to describe the color distribution of each atomic region. Color moment is widely used for image representation in classification and content based image retrieval (CBIR). The procedure of extracting the color moment of each segmented region is given as follows.
- (2)
Texture feature: we use the well-known histogram of
Data sets and default setups
In the literature, there are four popular data sets for evaluating image quality, i.e., the CUHK [33], Photo.net [27], AVA, and LIVE-IQ [18]. A rough description of the four data sets is as follows:
- (1)
The CUHK contains 12,000 photos collected from DPChallenge.com. They have been labeled by ten independent viewers. Each photo is classified as highly aesthetic if more than eight viewers agree on the assessment. We use the standard split of training/test sets.
- (2)
The Photo.net consists of 3581 images.
Conclusions
Quality modeling is a useful technique in multimedia and computer vision [34], [35], [36], [44], [45]. In this paper, a quality model is proposed which optimally mimics human visual perception. Based on our designed multi-view active learning algorithm, a few representative regions are selected for constructing the gaze shifting path. Based on this, a unified probabilistic model is proposed which encodes the human perception of large-scale high quality training photos for determining the
Conflict of interest
The authors declared that there is no conflict of interest.
References (45)
Johnson and martial Hebert, using spin images for efficient object recognition in cluttered 3D scenes
IEEE Trans. Pattern Anal. Mach. Intell.
(1999)- Marco Carcassoni, Edwin R. Hancock, Correspondence Matching with Modal Clusters, 2003, pp....
- et al.
Shape matching and object recognition using shape contexts
IEEE Trans. Pattern Anal. Mach. Intellig.
(2002) - et al.
A Bayesian hierarchical model for learning natural scene categories
Modeling scenes with local descriptors and latent aspects
- et al.
Spatially coherent latent topic model for concurrent segmentation and classification of objects and scenes
- et al.
Spatial latent dirichlet allocation
Proceedings of Neural Information Processing Systems Conference
(2007) - et al.
Latent dirichlet allocation
J. Mach. Learn. Res.
(2003) - et al.
Multiresolution histograms and their use for recognition
IEEE Trans. Pattern Anal. Mach. Intellig.
(2004) - et al.
Beyond bags of features: spatial pyramid matching for recognizing natural scene categories
Advanced deep-learning techniques for salient and category-specific object detection: a survey
IEEE Signal Process. Mag.
Dynamic learning from multiple examples for semantic object segmentation and search
Comput. Vision Image Understand.
Region-based hierarchical image matching
Int. J. Computer Vision
Generic model abstraction from examples
IEEE Trans. Pattern Anal. Mach. Intellig.
A review of co-saliency detection algorithms: fundamentals, applications, and challenges
ACM Trans. Intelligent Syst. Technol. (TIST)
Pictorial structures for object recognition
Int. J. Comput. Vision
Object-graphs for context-aware category discovery
Matching images under unstable segmentations
Image classification with segmentation graph kernels
On graph kernels: hardness results and efficient alternatives
Region correspondence by inexact attributed planar graph matching
Cited by (36)
A comprehensive survey on deep facial expression recognition: challenges, applications, and future guidelines
2023, Alexandria Engineering JournalA survey on facial emotion recognition techniques: A state-of-the-art literature review
2022, Information SciencesCitation Excerpt :There are many variations of these algorithms being used on the Facial Emotion Recognition task to achieve distinct results [91]. In the last few years there can be observed some works that used Convolutional Neural Networks to extract features as well [4,106]. The ASM algorithm was proposed in 1995 [14] with the intent of generating points that fit the contour of the object.
Contour and region harmonic features for sub-local facial expression recognition
2020, Journal of Visual Communication and Image RepresentationCitation Excerpt :Several techniques for expression classification found in the literature. They include support vector machines (SVM), k-nearest neighbor (KNN) and extended deep learning [38,14]. The SVM is considered optimal and effective for classification with a small amount of data.
Facial expression recognition using human machine interaction and multi-modal visualization analysis for healthcare applications
2020, Image and Vision ComputingCitation Excerpt :It was applied by combining the sparsity coefficient to suppress Poisson noises and retain infrared spectral structure. Wang et al. [24] designed hybrid features understanding to detect facial expression recognition. Only the necessary features were extracted using a scale-invariant feature transform (SIFT), which didn't need training samples.
AI-Based Facial Emotion Recognition
2024, Lecture Notes in Networks and Systems
- ☆
This article is part of the Special Issue on TIUSM.