Hierarchical uncorrelated multiview discriminant locality preserving projection for multiview facial expression recognition

https://doi.org/10.1016/j.jvcir.2018.04.013Get rights and content

Abstract

Existing multi-view facial expression recognition algorithms are not fully capable of finding discriminative directions if the data exhibits multi-modal characteristics. This research moves toward addressing this issue in the context of multi-view facial expression recognition. For multi-modal data, local preserving projection (LPP) or local Fisher discriminant analysis (LFDA)-based approach is quite appropriate to find a discriminative space. Also, the classification performance can be enhanced by imposing uncorrelated constraint onto the discriminative space. So for multi-view (multi-modal) data, we proposed an uncorrelated multi-view discriminant locality preserving projection (UMvDLPP)-based approach to find an uncorrelated common discriminative space. Additionally, the proposed UMvDLPP is implemented in a hierarchical fashion (H-UMvDLPP) to obtain an optimal performance. Extensive experiments on BU3DFE dataset show that UMvDLPP performs slightly better than the existing methods. However, an improvement of approximately 3% as compared to the existing state-of-the-art multi-view learning-based approaches is achieved by our H-UMvDLPP. This improvement is due to the fact that the proposed method enhances the discrimination between the classes more effectively, and classifies expressions category-wise followed by classification of the basic expressions embedded in each of the subcategories (hierarchical approach).

Introduction

In the real world, images of facial expressions can be captured by camera’s situated at different points of the world coordinated system. So, the captured expressive face images may have arbitrary head poses, and the corresponding images of facial expressions may lie in a completely different space [1]. Hence, the facial features extracted for one viewing angle may differ from the facial features obtained for other viewing angles. This scenario imposes several challenges in facial expression recognition (FER). These problems include face occlusion, discriminative feature extraction, face alignment, and accurate non-frontal facial points localization [2]. Like facial expression recognition, face recognition (FR) also uses features of human faces. The task of FR is accomplished by comparing features extracted from an unknown face image to the stored set of features of different individuals. However, one of the major concerns of the face recognition system is that extracted features should be invariant to different facial expressions [3], [4]. Facial features extracted from expressive face images have somewhat different spatial characteristics as that of features extracted from a neutral face image. Hence, face recognition problem becomes more challenging when faces have different facial expressions. On the other hand, facial features extracted for FER should be independent of human faces. The essential components of FER include face localization, feature extraction, and classification. In this paper, a supervised framework is proposed to recognize facial expressions such as “happy”, “surprise”, “fear”, “anger”, “sad” and “disgust” from multi-view face images. In the direction of automated FER, there is an ever growing need of developing an automated system which can recognize expressions from any arbitrary views or a set of viewing angles for the applications like sign language recognition, human-computer interactions, and many more. So far, only few research works on multi-view/view-invariant facial expression recognition have been reported till date. The reported techniques can be grouped into three categories: (1) methods which perform pose-wise recognition [5], [6], [7], [8], (2) methods which perform view-normalization before recognition [9], [10], [11], [2], and (3) methods which learn a single discriminative space using the observations of multiple views [12], [13], [14], [15], [16]. The first group of methods select view-specific classifiers for FER. Moore and Bowden [5] performed multi-view FER by learning a view-specific supervised Support Vector Machine (SVM) for each of the views [17]. Hu et al. performed multi-view FER by extracting 2D displacement vectors of 38 landmark facial points between the expressive face images and the corresponding neutral face images [6]. Subsequently, these features are fed into different view-specific classifiers. In [7], the authors investigated three kinds of appearance-based features, namely, Scale Invariant Feature Transform (SIFT) [18], Histogram of Oriented Gradient (HOG)[19], and Local Binary Pattern (LBP) [20] for multi-view (0°,30°,45°,60°, and 90°) FER. Hesse et al. performed multi-view FER by extracting different appearance-based features, e.g., SIFT, HOG, and Discrete Cosine Transform (DCT) [21] around 83 landmark facial points [8]. The major shortcoming of the above-mentioned methods is that the correlation that exists across different views of the expressions are not at all considered. Since separate view-specific classifiers are learned for FER, and so, the overall classification strategy is sub-optimal. The second group of methods mainly follow a three step procedure i.e., head-pose estimation, head-pose normalization, and FER from the frontal pose. Rudovic et al. [9], [10], [11] recognize expressions from non-frontal views of facial images. For this, 39 facial points are localized on each of the non-frontal/multi-view facial images, and then head-pose normalization is performed. The objective of head-pose normalization is to learn the mapping functions between a discrete set of non-frontal poses and the frontal pose. In order to learn the robust mapping functions, a coupled Gaussian process regression-based framework is proposed by considering pair-wise correlations between different views. However, learning of mapping functions is performed on the observation space, and hence, improper mapping functions can adversely affect the classification accuracy. View-normalization or multi-view facial feature synthesis method is also proposed in [2]. In this, block-based texture features are extracted from multi-view facial images to learn the mapping functions between any two different views of facial images. So, the features can be extracted from several off-regions, on-regions, and on/off-regions of a face. Their method has a limitation as the weight assignment for on/off-region is not defined. The major shortcoming of methods of this group is that the head-pose normalization and the learning of expression classifiers are carried out independently, which may eventually affect the final classification accuracy. The third group of methods have several advantages. One important advantage is that a single classifier is learned instead of view-specific classifiers [12], [13], [14]. So, head-pose normalization is not needed. In [14], it was assumed that different views of a facial expression are just different manifestations of the same underline facial expression, and hence the correlations which exist among different views of expressions are considered during the training phase. They proposed discriminative shared Gaussian process latent variable model (DS-GPLVM) to learn a single non-linear discriminative subspace. However, discriminative nature of the Gaussian process depends on the kind of prior. They proposed Laplacian-based prior [22], which can give better performance than the Linear Discriminant Analysis (LDA)-based prior. Laplacian-based prior preserves within-class local topology of the data onto the reduced space by minimizing the sum of square distances between the latent positions of the examples of the intra-class. However, it ignores the effect of inter-class variations of data, which results a sub-optimal latent space. Although, DS-GPLVM can give better performance, the method proposed in [1] is a linear non-parametric projection based approach, and hence it is comparatively simpler than the parametric DS-GPLVM-based approach. For learning of a common discriminative space (latent space) shared by all the views, several research works [23], [24], [25], [26], [1], [27] have been proposed. Eleftheriadis et al. [14] showed that the above-mentioned methods can be efficiently applied for multi-view FER. Hence, inspired from the state-of-the-art multi-view learning-based method proposed in [1] and the method proposed in [28], we proposed a more efficient objective function for multi-view FER. The proposed method is termed as uncorrelated multi-view discriminant locality preserving projection (UMvDLPP) analysis. The proposed objective function of UMvDLPP is formulated in such a way that it can preserve the intra-class topology of intra-view as well as inter-view onto the common space. Also, it can maximize the local between-class separation of intra-view and inter-view samples. The motivation behind generalizing LPP and local between-class scatter matrix (LBCSM) in our proposed method is that they both are capable of handling multi-modal characteristics of multi-view facial data [29]. On the other hand, the simple LDA-based approach fails to capture discriminative directions when data of different classes have several local maxima and minima [29]. So, our approach entails extracting an uncorrelated common discriminative space. Fig. 1 shows the multi-modal characteristic of multi-view happy and surprise expressions. So to handle multi-modal data, we adopt LBCSM defined in [30] to maximize the local between-class separability of the data, and Laplacian-based approach to minimize the within-class local geometric structure of the data in our proposed UMvDLPP approach. Next, the proposed UMvDLPP-based method is extended in a hierarchical manner, which is termed as Hierarchical-UMvDLPP (H-UMvDLPP). This is done in order to obtain the optimal performance of multi-view FER. In the first step, the basic expressions are grouped into three categories, namely, Lips-based, Lips-Eyes-based, and Lips-Eyes-Forehead-based. The first stage of classification is done on the basis of contribution of a specific part/region of a face in a particular expression. After obtaining the most likely expression sub-category, the constituent expressions are identified in the second stage of classification. The above mentioned two stage classification strategy is shown in Table 1. The framework of our proposed H-UMvDLPP-based multi-view facial expression recognition approach is shown in Fig. 2, which mainly has the following four steps.

  • 1.

    Feature extraction: Performance of any FER system mainly depends on the availability of most discriminative features. Apparently, informative/active regions of a face can provide most discriminative features [31], [32], [33], [34], [35]. In [36], we proposed an efficient face model by extracting informative regions of a face. Our proposed model consists of 54 facial points, and the features are extracted only from the salient/informative regions of the face. It was shown in our earlier work [36] that our proposed face model outperformed several existing facial models [9], [37], [38]. So, we proposed to employ our proposed informative region-based face model [36] to extract features from multi-view facial images.

  • 2.

    Hierarchical-UMvDLPP: In our approach, expressions are initially divided into three categories on the basis of the movements of lips, eyes, and forehead as stated in [39], [40]. The corresponding common space is learned using 1-UMvDLPP as shown in Fig. 2(a). Further, a 2-UMvDLPP is learned for the constituent expressions present in each of the subcategories. Hence, there are three different 2-UMvDLPPs –one for each sub-category.

  • 3.

    Sub-model selection: This step is required in the second/final level of facial expression classification. As there are three different trained subclass-specific 2-UMvDLPPs, the system needs to select a specific 2-UMvDLPP for a given test sample. In the proposed framework, this step is automatically decided based on the class-label of the test sample obtained at the first-level of H-UMvDLPP and kNN i.e., 1-UMvDLPP + kNN. Hence, the model selection step of the proposed H-UMvDLPP model for MvFER system is fully automatic.

  • 4.

    Multi-view FER: For recognition, the facial features are first extracted, and subsequently first-level of classification is performed using 1-UMvDLPP and kNN. The first-level of classification is basically a three-class problem, and hence the output of the classifier is the class-label of the specific-sub-category in which the test sample under consideration belongs. Finally, the test sample is classified by expression-specific 2-UMvDLPP.

In our earlier work (UMvDLPP) [41], local Fisher discriminant analysis (LFDA) is employed instead of linear discriminant analysis (LDA) to handle multi-modal characteristics of multi-view data. Finally, an objective function based on UDLPP and LFDA is proposed which minimizes intra-class topology of intra-view and inter-view of data. It also maximizes local between-class scatter matrix of inter-class of the intra-view and the inter-view. The performance of UMvDLPP is slightly better than MvDA. In order to further improve the accuracy of UMvDLPP, our earlier proposed method is extended in a hierarchical framework in this paper, termed as hierarchical-UMvDLPP (H-UMvDLPP). As explained above, the proposed method consists of two stages. Each stage is implemented by the UMvDLPP. In H-UMvDLPP, expressions are first sub-categorized based on facial sub-regions which are involved in the expressions, and hence in the second stage, classifiers only need to classify individual expressions which belongs to a particular subcategory. This hierarchical approach reduces search domain in the final classification stage, and this approach marginalize the impact of irrelevant features compared to a single-stage implementation (UMvDLPP). In H-UMvDLPP proposed in this paper, extensive experiments are carried out in order to validate our proposed method. It includes implementation of several linear (MvDA, GMA) and non-linear (D-GPLVM, DS-GPLVM) state-of-the-art methods. The performance of H-UMvDLPP is also compared with convolution neural network (CNN), deep belief network (DBN), and a DNN-based structure. In the current paper, we showed distribution of samples belonging to different expression classes at different stages. This shows that the proposed method effectively converts a six class expression recognition problem into a three class recognition problem.

The rest of the paper is organized as follows. In Section 2 we introduced our proposed UMvDLPP-based method to obtain an uncorrelated common discriminative space for multiple observation spaces. Experimental results are presented in Section 3. Finally, conclusion of the paper is drawn in Section 4.

Section snippets

Proposed method

Our proposed UMvDLPP-based method [41] generalizes the Uncorrelated Discriminative LPP (UDLPP) [28] along with Local Fisher Discriminant Analysis (LFDA) [29] to learn a robust uncorrelated discriminative common space from the observations of multi-view facial images. Let X=X1,X2,,XvRD×vN be a D-dimensional data space extracted from v-views of facial expressions. The kth view of X i.e., Xk is given by: Xk=xick|i=1,2,,N;c=1,2,,C, where xick indicates ith sample of cth class extracted from kth

Experiments

BU-3D Facial Expression (BU3DFE) dataset [42] is used to validate our proposed H-UMvDLPP-based method. This dataset comprises of 3D facial images of seven basic expressions i.e., happy (HA), surprise (SU), fear (FE), anger (AN), disgust (DI), sad (DA), and neutral (NA) expressions. Expressions of BU3DFE dataset are captured at four different levels of intensity levels ranging from onset/offset level to peak level of expressions. For our experimentation, 2D facial images corresponding to seven

Conclusions

In this paper, we addressed the problem of recognizing facial expressions from multi-view face images. In view of this, we developed a novel linear non-parametric-based approach, which is termed as UMvDLPP. The main objective is to learn an uncorrelated discriminative common space from multiple observations of multi-view facial images. The UMvDLPP is robust to capture discriminative directions even if the data exhibits multi-modal characteristics. We proposed a novel objective function for

References (54)

  • S. Moore et al.

    Local binary patterns for multi-view facial expression recognition

    Comput. Vis. Image Underst.

    (2011)
  • Z. Zheng et al.

    Gabor feature-based face recognition using supervised locality preserving projection

    Sign. Process.

    (2007)
  • M. Kan et al.

    Multi-view discriminant analysis

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2016)
  • W. Zheng

    Multi-view facial expression recognition based on group sparse reduced-rank regression

    IEEE Trans. Affect. Comput.

    (2014)
  • C.-K. Hsieh et al.

    Expression-invariant face recognition with constrained optical flow warping

    IEEE Trans. Multimedia

    (2009)
  • W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, L. Song, Sphereface: deep hypersphere embedding for face recognition, in: The...
  • Y. Hu, Z. Zeng, L. Yin, X. Wei, J. Tu, T. S. Huang, A study of non-frontal-view facial expressions recognition, in:...
  • Y. Hu, Z. Zeng, L. Yin, X. Wei, X. Zhou, T. S. Huang, Multi-view facial expression recognition, in: 8th IEEE...
  • N. Hesse, T. Gehrig, H. Gao, H. K. Ekenel, Multi-view facial expression recognition using local appearance features,...
  • O. Rudovic et al.

    Coupled gaussian processes for pose-invariant facial expression recognition

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2013)
  • O. Rudovic, I. Patras, M. Pantic, Coupled gaussian process regression for pose-invariant facial expression recognition,...
  • O. Rudovic, I. Patras, M. Pantic, Regression-based multi-view facial expression recognition, in: 2010 20th...
  • W. Zheng, H. Tang, Z. Lin, T. S. Huang, Emotion recognition from arbitrary view facial images, in: Computer Vision–ECCV...
  • U. Tariq, J. Yang, T. S. Huang, Multi-view facial expression recognition analysis with generic sparse coding feature,...
  • S. Eleftheriadis et al.

    Discriminative shared gaussian processes for multiview and view-invariant facial expression recognition

    IEEE Trans. Image Process.

    (2015)
  • J. Lu et al.

    Discriminative multimanifold analysis for face recognition from a single training sample per person

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2013)
  • J. Lu et al.

    Localized multifeature metric learning for image-set-based face recognition

    IEEE Trans. Circuits Syst. Video Technol.

    (2016)
  • C. Cortes et al.

    Support-vector networks

    Mach. Learn.

    (1995)
  • D.G. Lowe

    Distinctive image features from scale-invariant keypoints

    Int. J. Comput. Vis.

    (2004)
  • N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: IEEE Computer Society Conference on...
  • T. Ojala et al.

    Multiresolution gray-scale and rotation invariant texture classification with local binary patterns

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2002)
  • N. Ahmed et al.

    Discrete cosine transfom

    IEEE Trans. Comput.

    (1974)
  • F.R. Chung, Spectral Graph Theory, vol. 92, American Mathematical Soc.,...
  • H. Hotelling

    Relations between two sets of variates

    Biometrika

    (1936)
  • A.A. Nielsen

    Multiset canonical correlations analysis and multispectral, truly multitemporal remote sensing data

    IEEE Trans. Image Process.

    (2002)
  • J. Rupnik, J. Shawe-Taylor, Multi-view canonical correlation analysis, in: Conference on Data Mining and Data...
  • A. Sharma, A. Kumar, H. Daume, D. W. Jacobs, Generalized multiview analysis: a discriminative latent space, in: 2012...
  • Cited by (4)

    • Multiview Facial Expression Recognition, A Survey

      2022, IEEE Transactions on Affective Computing
    • A Geodesic Locality Canonical Correlation Analysis Method for Image Recognition

      2020, Dianzi Yu Xinxi Xuebao/Journal of Electronics and Information Technology

    This paper has been recommended for acceptance by Jiwen Lu.

    View full text