Elsevier

Pattern Recognition

Volume 56, August 2016, Pages 100-115
Pattern Recognition

Differential components of discriminative 2D Gaussian–Hermite moments for recognition of facial expressions

https://doi.org/10.1016/j.patcog.2016.03.006Get rights and content

Highlights

  • A novel facial expression recognition algorithm using 2D Gaussian–Hermite moments.

  • New approach of discriminative selection of moments as features of expression.

  • New subspace to estimate differential components of moments as expressive features.

  • Experiments on challenging datasets having posed, spontaneous, and wild expressions.

  • Results show that proposed method is better than existing or similar methods.

Abstract

This paper deals with a new expression recognition method by representing facial images in terms of higher-order two-dimensional orthogonal Gaussian–Hermite moments (GHMs) and their geometric invariants. Only the moments having high discrimination power are selected as a set of features for expressions. To obtain the differentially expressive components of the moments, the discriminative GHMs are projected on to a new expression-invariant subspace using the correlations among the neutral faces. Features obtained from the discriminative moments and differentially expressive components of the moments are used to recognize an expression using the well-known support vector machine classifier. Experimental results presented are obtained from commonly-referred databases such as the CK-AUC, FRGC, and MMI that have posed or spontaneous expressions as well as the GENKI database that has expressions in-the-wild. Experiments on mutually exclusive subjects reveal that the performance of expression recognition of the proposed method is significantly better than that of the existing or similar methods, which use the local or patch-based high dimensional binary patterns, directional number patterns generated from derivatives of Gaussian, Gabor- or other moment-based features.

Introduction

Facial expression is one of the most powerful means for humans to communicate their emotions, intentions, and opinions to each other. Psychological study reveals that while overall impact of the text content of a conversation is limited to only 7% and the intonation of the voice contributes by 38%, the facial expressions carry the most part of the conveyed information, i.e., 55% [1]. Thus, automated facial expression recognition (FER) has been a highly active field of research in cognitive or behavioral science, with important applications in lie detection, intelligent communication in social media, e-commerce, and multimodal human–computer interface [2].

Facial expressions are outcomes of complex inter-relations of emotions of a person. As an example, the expression at a given instant can be formed from mixed feelings such as a combination of happiness and surprise or that of disgust and contempt. Even a basic emotion can have multiple subclasses such as the emotional state ‘joy’ may include the cheerfulness, zest, contentment, pride, optimism, enthrallment and relief [3]. Dimensional approach defines an emotional state in three continuous spaces, viz., valence, arousal, and dominance [4]. Studies reveal that basic emotions can be discrete in nature and those emotions correspond to universal facial expressions for all cultures. Hence, most of the automatic FER algorithms attempt to directly map the expressions into one of the basic six classes, viz., disgust, sad, fear, anger, surprise and happy, introduced by Ekman and Friesen [5].

Although the human visual system (HVS) is very robust to recognize identity by observing face image, the same is not true in the case of interpretation of facial expressions. According to Bassili [6], a trained observer can correctly classify faces showing six basic emotions with an average of 87%, and the results vary with the familiarity of the face, personality of the observed person, attention given to the face, and even to the non-visual cues, e.g., context of observation [7], and the age of the observer [8]. The challenges of widespread deployment of automatic FER system include the uncontrolled imaging environments as well as the physiological and appearance-based variabilities of the face images. Preprocessing steps such as the registration algorithms or geometric transformations can be used to counter the effects of variations in orientation of head and pose of face. Other effects of imaging variabilities such as the presence of background clutter and occlusion can be reduced by segmentation algorithms and the uneven illumination by enhancement algorithms. In practice, there exists a high degree of variability in facial characteristics across people such as those due to the age, illness, and gender or race, those due to appearance such as facial hair, makeup, beards, and glasses. The preprocessing algorithms, the features representing the deformations of faces caused by expressions, and finally the classification of features play significant role in addressing the challenges of designing automatic FER system. In this context, studies focus on two major types of expression categories: posed or deliberate and spontaneous or authentic [3]. In the posed scenario, a subject produces artificial expressions on demand through a guided mechanism. On the other hand, in spontaneous scenario, the subject shows their day-to-day on the spot expressions without any guidance. Many psychologists have proved that the spontaneous expressions are significantly different from the posed expressions, in terms of their appearance, timing, and the extent of exaggerations. Recommendations were given towards integration of facial features along with body language or intonation of speech to identify spontaneous expressions [9]. In addition to the expression analysis in the controlled environments, a more challenging problem of the analysis when expressions are in the uncontrolled environments, i.e., in-the-wild, is also gaining an increasing research interest. Methods that use densely sampled facial image features having high dimensions are often adopted to recognize expressions in-the-wild [10]. Nevertheless, the development of successful and computationally efficient expression recognition systems based on unique facial features requires that the method can recognize the subtle and genuine expressions in addition to the posed versions not only in the controlled laboratory settings but also in-the-wild.

There are two major approaches to capture the facial deformation – holistic or appearance-based [11] and model or geometric-based [12], wherein the former mainly focuses on the feature extraction from the texture of pixels and the latter on the shape or dimension of faces. The geometric approach requires the selection of fiducial points that are localized at strategic neighboring pixels of facial images. Inter-distances among the key points and characteristics of textures of local neighboring regions of these points are commonly used as features for expressions. In video sequences, the changes of facial muscles in the vicinity of certain key points, very often referred to as the facial action units (FAUs), are also used to determine facial expressions [13], [14]. The activities of FAUs or the changes of geometric features over time that occur with the variations of expression are then measured using a suitable graph matching algorithm [15]. The appearance-based approach, on the other hand, measures the facial actions on the full field of an image or video instead of certain local neighboring points or regions. In general, such an approach makes use of the changes of gray-levels of facial images with respect to the neutral image that are not caused by the non-expression related permanent factors, e.g., wrinkles due to aging [16].

There are numerous deformation-based features generated from the motions of pixels of a video sequence that have been used for developing automatic FER system. For example, Cohen et al. [17] introduced a set of motions as features of expression in terms of Bézier volume control parameter estimated from the surface patches of the face model. The avatar emotion face model has been created as a single good representation of a video sequence by condensing the frames using the scale invariant feature transform (SIFT) [18]. The complex spatio-temporal activities of the facial parts have been modelled by the interval temporal Bayesian network in [19] or interaction of FAUs through restricted Boltzmann machine in [20]. Recent surveys on spatio-temporal analysis for facial affects can be found in [21] and [22]. In order to accommodate description of the facial activities, the change of the depth of the two dimensional (2D) image pixels or the surface of the 3D face models have also been used [23]. It is to be noted that the uses of 3D face models are limited by the stereo imaging process, which results in unequal resolution of intensities and depths if the laser beams are used or the process applies for a constrained environment where multiview images are captured with known camera parameters. Hence, research investigations are still sought for the development of low-complexity 2D image-based FER system for its real-time and widespread deployment.

In the 2D image-based FER system that uses geometric approach, the extraction of fiducial points on neutral image requires high-level precision in the edge detection and projection analysis even in the controlled environments. For example, Lyons et al. [24] have used the computationally expensive elastic graph matching algorithm for registration of face image using the fiducial points and a suitable discriminant analysis of the Gabor transform coefficients of pixels selected by nodes of grids to classify the expressions. In general, the grid-based localized patterns those are captured to obtain features for facial expression are highly sensitive to noise and unwanted facial structures, and hence, these methods require precise alignments of multiple nodes to estimate the facial actions. Other model-based facial expression algorithms include the active shape model and active appearance model, wherein the scale and pose invariant landmarks are fitted on the faces and the illumination invariant textures of images are estimated iteratively using statistical features [25].

The 2D image-based FER system that follows appearance-based approach emphasizes the preservation of the original images as much as possible and allows the classifier to discover the relevant features in the images, which is a similar mechanism of the HVS. In the conventional holistic representation, the features used for the facial expressions are obtained using various projection-based methods such as the Eigenspaces [26], 2D principal component analysis (PCA) [27], independent component analysis, Fisher discriminant analysis (FDA) along with AsymmetryFace [11], kernel PCA-FDA, matrix-based canonical correlation analysis, non-negative matrix factorization, multiple manifolds [28] and mixture covariance analysis [29] applied on the whole face. Features are also extracted from the local regions of facial images due to the fact that the deformations are characterized by changes both in terms of shapes and textures of images. A number of appearance-based FER algorithms have been proposed by partitioning the entire face uniformly or selectively and then extracting suitable deformation-based features of the patches. As an example, the SIFT and the pyramid of histogram of oriented gradients (HOG) of four subregions, namely, the forehead, eyes-eyebrows, nose and mouth of the face images have been used for designing the codewords of facial expression [30]. Shan et al. [31] used the histograms of local binary patterns (LBPs) of the rectangular-shaped local image regions as the texture-based features of facial expressions. In [32], three-scale and five directional Gabor coefficients of the upper, lower and whole parts of a face image are fused using a rule, which is based on the presence of FAUs due to expressions. Scale and positional distributions of salient patch-based Gabor features have been used for expression classification in [33]. In [34], the LBP-based features of active patches selected from eyes, nose, and lips have been used for expression recognition. Other texture-based features of the local neighboring region of facial images used for expression recognition either in the controlled or uncontrolled environment include the second-order local autocorrelation [35], local directional number (LDN) pattern [36], local directional pattern (LDiP) variance [37], high dimensional binary feature (HDBF) [10], multiscale Gaussian derivatives [38], transformed coefficients of images obtained from the Haar-like or log-Gabor filter [39]. In order to take the benefit from both the appearance and geometric-based features, the facial textures and landmark modalities are combined together by using the structured regularization in [40]. Decisions on the expressions obtained from the SIFT, HOG, and LBP-based features of the 3D facial image have been fused in [41]. Ji and Idrissi [42] have recommended an extended LBP and first three geometric moments obtained from certain vertical-time slice-images of a video sequence as the features of facial expression.

Most of the appearance-based features of facial deformations possess challenges in distinguishing between expressions, when there remain scaling, rotation, shift or tilt in facial appearance. Since certain affine transformations of the geometric and orthogonal moments such as the Gaussian–Hermite, Krawtchouk, Tchebichef, and Zernike have been shown to possess invariance properties due to scaling, shifting and rotation of a pattern, a few of these moments have also been used to obtain features for biometric security systems such as the Gaussian–Hermite moments (GHMs) for fingerprint recognition in [43] and iris recognition in [44]. The geometric or orthogonal moments have also been used for recognition of faces as well as facial expressions. For instance, the geometric moments up to order six and 2D Tchebichef moments up to order twenty that are obtained from the face images have been used to represent features for expression recognition in [45] and face recognition in [46], respectively. In [47], the 2D Tchebichef–Krawtchouk moments of fixed orders are chosen as features for face recognition such that the length of features is one fourth of the image size. The 2D Zernike moments of the face image up to the maximum order of 12 are considered as features for face recognition in [48] and expression recognition in [49]. In [50], the intraclass correlation coefficients of the 2D GHMs are employed to construct the discriminatory feature set for face recognition. The 1D GHMs of the mesh nodes of facial surface up to order four [51] or that of the prescribed bending invariants of the surface up to order two [52] are used as features for 3D face recognition. In [53], the features obtained from the variance decomposition-based discriminative selection of Krawtchouk moments (KCMs) have been shown to perform better than that obtained from the traditional LDA-based projection of first few order orthogonal moments for face recognition.

There exists a number of classification algorithms such as the template matching [31], nearest neighborhood [27], support vector machine (SVM) [31], bank of SVMs [54], neural network [32], linear programming, Bayesian dynamic network or hidden Markov model [55] those have been employed for recognizing facial expressions. In order to handle small number of samples in the classification, a number of methods such as the augmented variance ratio, bagging and re-sampling in random subspace [11], AdaBoost [31], mutual information, weighted saliency map, and genetic algorithm [28] have been introduced for improving the discriminative selection by reducing the redundancy and maximizing the relevancy of the features.

In order to address the problem of isolating the local deformations in face images due to expressions, the traditional appearance-based methods use the projection- or texture-based features. Orthogonal moments have also been employed as features of expression with a view to capture the local spatial dynamics of the images. But, these moment-based methods fall short to recognize the discriminative information specifically due to the fact that the number of moments used to represent features for recognition of expressions in these algorithms are chosen heuristically. To overcome these drawbacks a mathematical and experimental thorough study is necessary to develop an efficient orthogonal moment-based FER system. In this study, the 2D GHMs and their geometric invariants are considered for developing descriptors of facial expressions. A question that naturally arises is why the GHMs are preferred to other existing orthogonal moments as features for facial expressions.

In visual signal processing, the GHMs are popular among various moments due to the fact that the width of the Gaussian weight function of the Hermite polynomial expansion provides a flexibility in isolating the visual features just as the HVS does. In particular, the GHMs can be interpreted as linear combinations of the derivatives of filtered signal by a Gaussian filter [56]. In [57], a multi-resolution spectral analysis has been reported to show that the quality factor of the frequency window of Gaussian–Hermite basis functions is better than other moments such as the geometric or Legendre. Further, the zero-crossings of the Gaussian–Hermite basis functions are distributed more evenly than that of the other orthogonal functions such as the Legendre, discrete Tchebichef and Krawtchouk [58]. It is due to the properties of smoothness and uniformity of distribution of zero-crossings of the basis functions that the image reconstruction and the noise-robust feature representation by the GHMs are better than that of the commonly-referred orthogonal moments [59]. Thus, the GHM is preferred to other orthogonal moments in the analysis of facial expressions.

One of the major objectives of this paper is to present a low-complexity FER method in which the discriminative GHMs are used as features of expressions. The key contribution of this paper is the introduction of expression-invariant subspace generated from the GHMs of the expression-free Neutral images, from which the subtle amount of facial deformations can be isolated and be used for improving the accuracy of recognition of expressions. The effectiveness of the construction of features obtained from both the discriminatory set of moments and the differentially expressive components of the moments is validated using comprehensive experimentations. Representative results shown in this paper include those that are obtained from the well-known expression databases such as the Cohn–Kanade Action Unit Coded (CK-AUC) [60], the Facial Recognition Grand Challenge (FRGC) [61] and the M&M Initiative (MMI) [62] that have posed or spontaneous expressions, as well as the database with expressions in-the-wild such as the GENKI [63].

The paper is organized as follows. In Section 2, a brief review of estimating the 2D GHMs and their invariants from discrete images are described. Section 3 presents the construction of the proposed features in terms of the discriminatory set of moments and the differentially expressive components of the moments using the expression-invariant subspace. This section also provides the method of classification of the proposed GHM-based features for recognizing the expressions. In Section 4, the setups of expression databases used in the experiments are described, and the results of the proposed and existing FER methods are presented. Finally, conclusions are drawn in Section 5.

Section snippets

Face representation by 2D GHMs

Let I(x,y)L2(R2) be a continuous square integrable 2D facial image. The set of 2D orthogonal image moments of order (p,q) (p,qZ1) denoted as MpqΨ may be obtained as [57]MpqΨ=R2I(x,y)Ψp(x)Ψq(y)dxdyΨp(·) and Ψq(·) are two independent generalized sets of orthogonal polynomial functions having orders p and q (p,qZ1), respectively. In this paper, the moments for FER system are obtained from the orthogonal Gaussian–Hermite polynomials. Hence, a brief review of the Hermite polynomials and their

FER method using moments

In this section, the proposed FER algorithm that considers the expression features in terms of the GHMs and RTIGHMs is presented. Let a vector that comprises (N+1)2 number of GHMs of an image having expression label ((1,2,,K)) be expressed asM=[M00,M01,,M0N,M10,,MN0,M11,,MNN]and a vector that has R number of RTIGHMs of the image be expressed asρ=[ρ1,ρ2,,ρR]where the first two RTIGHMs are estimated from the second and third order non-coefficient GHMs as [59]ρ1=M¯02+M¯20ρ2=

Experimental results

We performed several experiments to evaluate the performance of the proposed EGHM features for classifying facial expressions. This section first highlights the characteristics of datasets that are used to present the performance of the proposed method. Next, the experimental setup, the performance evaluation on the datasets, and the contrast of the results of experiments are presented.

Conclusion

In this paper, an expression recognition algorithm has been developed by using new features in terms of the orthogonal 2D GHMs and their geometric invariants obtained from facial images. It has been shown that the discriminative moments as well as the differentially expressive components of the moments estimated from a neutral subspace can effectively model the facial activities that originate due to expressions. In the proposed method, a discriminative set of moments has been selected using a

Conflict of interest

The authors declare that they have no competing interests.

Acknowledgement

The initial part of this work was supported by the Natural Sciences and Engineering Research Council (NSERC) of Canada. The authors would like to give thanks to the anonymous reviewers for their valuable comments that were useful to improve the quality of the paper.

Saif Muhammad Imran received BSc and MSc in Electrical and Electronic Engineering (EEE) from Bangladesh University of Engineering and Technology (BUET), Dhaka, Bangladesh, in 2012 and 2014 respectively. He is currently in PhD program of Department of Electrical and Computer Engineering, Michigan State University, East Lansing, MI, USA.

References (74)

  • B. Yang et al.

    Image reconstruction from continuous Gaussian–Hermite moments implemented by discrete algorithm

    Pattern Recognit.

    (2012)
  • A. Mehrabian

    Communication without words

    Psychol. Today

    (1968)
  • M. Turk, Multimodal human-computer interaction, in: Real-Time Vision for Human-Computer Interaction, Springer, New...
  • V. Bettadapura, Facial Expression Recognition and Analysis: The State of the Art, Technical Report 1203.6722, Cornell...
  • W.M. Wundt, Grundzüge de physiologischen Psychologie, Engelman, Leipzig,...
  • P. Ekman et al.

    Constants across cultures in the face and emotion

    J. Personal. Soc. Psychol.

    (1971)
  • J.N. Bassili

    Facial motion in the perception of faces and of emotional expression

    J. Exp. Psychol.: Hum. Percept. Perform.

    (1978)
  • K.J. Kelly et al.

    Metacognition of emotional face recognition

    Emotion

    (2011)
  • H. Rodger et al.

    Mapping the development of facial expression recognition

    Dev. Sci.

    (2015)
  • M.S. Bartlett et al.

    Automatic recognition of facial actions in spontaneous expressions

    J. Multimed.

    (2006)
  • S.E. Kahou, P. Froumenty, C. Pal, Facial Expression Analysis Based on High Dimensional Binary Features, in: Lecture...
  • S. Mitra et al.

    Understanding the role of facial asymmetry in human face identification

    Stat. Comput.

    (2007)
  • N. Alugupally et al.

    Analysis of landmarks in recognition of face expressions

    Pattern Recognit. Image Anal.

    (2011)
  • G. Donato et al.

    Classifying facial actions

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1999)
  • L. Wang et al.

    Feature representation for facial expression recognition based on FACS and LBP

    Int. J. Autom. Comput.

    (2014)
  • D. Ghimire et al.

    Recognition of facial expressions based on tracking and selection of discriminative geometric features

    Int. J. Multimed. Ubiquitous Eng.

    (2015)
  • G. Guo et al.

    Facial expression recognition influenced by human aging

    IEEE Trans. Affect. Comput.

    (2013)
  • S. Yang, B. Bhanu, Facial expression recognition using emotion avatar image, in: Proceedings of IEEE International...
  • Z. Wang, S. Wang, Q. Ji, Capturing complex spatio-temporal relations among facial muscles for facial expression...
  • Z. Wang, Y. Li, S. Wang, Q. Ji, Capturing Global Semantic Relationships for Facial Action Unit Recognition, in:...
  • E. Sariyanidi et al.

    Automatic analysis of facial affecta survey of registration, representation, and recognition

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2015)
  • S. Wang et al.

    Video affective content analysis : a survey of state-of-the-art methods

    IEEE Trans. Affective Computing

    (2015)
  • H. Li, J.-M. Morvan, L. Chen, 3D facial expression recognition based on histograms of surface differential quantities,...
  • M.J. Lyons et al.

    Automatic classification of single facial images

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1999)
  • A. Lanitis et al.

    Automatic interpretation and coding of face images using flexible models

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1997)
  • S.R.V. Kittusamy et al.

    Facial expressions recognition using eigenspaces

    J. Comput. Sci.

    (2012)
  • L. Oliveira et al.

    2D principal component analysis for face and facial-expression recognition

    Comput. Sci. Eng.

    (2011)
  • Cited by (16)

    • Stable computation of higher order Charlier moments for signal and image reconstruction

      2020, Information Sciences
      Citation Excerpt :

      The literature review of image analysis approaches shows that the method of moments occupies an important place in its different main fields such as pattern recognition [1-3], classification [4–7], authentication [8-9], compression [10], edge detection [11–12] and image reconstruction [13,14].

    • Estimation of affective dimensions using CNN-based features of audiovisual data

      2019, Pattern Recognition Letters
      Citation Excerpt :

      Audiovisual information has been shown to have significant success in categorizing discrete level emotional states. For example, Imran et al. [9] have employed differential components of the orthogonal 2D Gaussian-Hermite moments to classify the facial expressions. Noroozi et al. [18] have employed convolutional neural network (CNN)-based confidence values from video data, geometric features estimated from key frames, and acoustic features such as the mel-frequency cepstral coefficients (MFCC) and prosodic features to classify the discrete level emotional states.

    • ELM based smile detection using Distance Vector

      2018, Pattern Recognition
      Citation Excerpt :

      Facial expressions recognition has attracted increasing attention in past years [1,2] and smile (as one of the most common facial expressions) detection has become one of the hotspots recently. Accurate identification of smiles can verify a person’s true feelings [3,4]. Many researchers have proposed various methods for smile detection in the last decade.

    • Design of orthogonal moment invariants for images with N-Fold rotation symmetry

      2017, Signal Processing
      Citation Excerpt :

      Their numerical characters and computation properties were studied in comparison with geometric moments [30,31]. GHMs have been found in a variety of applications, e.g., iris recognition [32], face recognition [33,34], moving object detection [35], fingerprint segmentation[36], image reconstruction [37] and registration [38], directional feature extraction [39], depth evaluation in still images [40], facial expression recognition [41], amelioration of feature descriptor [42] and etc. Design of moment invariants is a significant part in researches on moments.

    • Gaussian-Hermite moment-based depth estimation from single still image for stereo vision

      2016, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      Secondly, the zero-crossings of the Gaussian-Hermite basis functions are distributed more evenly than that of the other orthogonal functions such as the Legendre, discrete Tchebichef and Krawtchouk [71]. It is due to the properties of smoothness and uniformity of distribution of zero-crossings of the basis functions that the image reconstruction and the noise-robust feature representation by the GHMs are better than that of the commonly-referred orthogonal moments [72,73]. Thus, the GHM is preferred to other orthogonal moments for estimating the focus cue of a pixel in an image.

    View all citing articles on Scopus

    Saif Muhammad Imran received BSc and MSc in Electrical and Electronic Engineering (EEE) from Bangladesh University of Engineering and Technology (BUET), Dhaka, Bangladesh, in 2012 and 2014 respectively. He is currently in PhD program of Department of Electrical and Computer Engineering, Michigan State University, East Lansing, MI, USA.

    S.M. Mahbubur Rahman received the BSc and MSc degrees in Electrical and Electronic Engineering from BUET, Dhaka, Bangladesh, in 1999 and 2002, respectively, and the PhD degree from Concordia University, Montreal, QC, Canada, in 2009. In 1999, he joined BUET, where he is currently a Professor. He was tenured in the University of Toronto as an NSERC Postdoctoral Fellow in 2012.

    Dimitrios Hatzinakos received the Diploma degree from the University of Thessaloniki, Greece, in 1983, the MASc degree from the University of Ottawa, Canada, in 1986, and the PhD degree from Northeastern University, Boston, Massachusetts, USA, in 1990, all in electrical engineering. In 1990, he joined the Department of Electrical and Computer Engineering, University of Toronto, where he now holds the rank of Professor and Director of the Identity, Privacy and Security Institute. Since 2004, he has been serving as the Bell Canada Chair in Multimedia.

    1

    A significant part of this work has been done when the author is in BUET.

    2

    Tel.: +1 517 355 5066; fax: +1 517 353 1980.

    3

    Tel.: +1 416 978 1613; fax: +1 416 978 4425.

    View full text