Elsevier

Image and Vision Computing

Volume 47, March 2016, Pages 19-26
Image and Vision Computing

M3 CSR: Multi-view, multi-scale and multi-component cascade shape regression

https://doi.org/10.1016/j.imavis.2015.11.005Get rights and content

Highlights

  • We investigate how face detection affects face alignment.

  • We improve the CSR model by multi-view, multi-scale and multi-component strategies.

  • We obtain impressive results on the IBUG and 300-W challenge datasets.

Abstract

Automatic face alignment is a fundamental step in facial image analysis. However, this problem continues to be challenging due to the large variability of expression, illumination, occlusion, pose, and detection drift in the real-world face images. In this paper, we present a multi-view, multi-scale and multi-component cascade shape regression (M3CSR) model for robust face alignment. Firstly, face view is estimated according to the deformable facial parts for learning view specified CSR, which can decrease the shape variance, alleviate the drift of face detection and accelerate shape convergence. Secondly, multi-scale HoG features are used as the shape-index features to incorporate local structure information implicitly, and a multi-scale optimization strategy is adopted to avoid trapping in local optimum. Finally, a component-based shape refinement process is developed to further improve the performance of face alignment. Extensive experiments on the IBUG dataset and the 300-W challenge dataset demonstrate the superiority of the proposed method over the state-of-the-art methods.

Introduction

Automatic facial landmark localization plays an important role in facial image analysis [1], [2], [3], [4]. A lot of methods [5], [6], [7], [8], [9], [10] have been proposed, achieving remarkable improvements [11], [12] on standard benchmarks in the past two decades. Existing methods can be roughly divided into three categories: generative methods, discriminative methods and statistical methods [13]. Generative methods attempt to optimize the shape parameter configuration by maximizing the probability of a face image being reconstructed by a facial deformable model. Active Shape Model (ASM) [14] and Active Appearance Model (AAM) [15], [16], [17], [18] are two representative generative methods. Discriminative methods try to infer a face shape through a discriminative regression function, which directly maps a face image to the landmark coordinates [19], [20], [21], [22], [23], [24]. There are two popular ways to learn such a regression function. One is based on deep neural network learning [25], [26], [27], [28], the other is the well-known cascade shape regression model, which aims to learn a set of regressors to approximate complex nonlinear mapping between the initial shape and the ground truth [29], [30], [31]. The idea of Statistical methods is to combine both generative and discriminative methods, trying to fit the shape model on a statistical way after learning patch experts. The most notable example is probably the Constrained Local Model (CLM) [32], [33], [34] paradigm, which represents the face via a set of local image patches cropped around the landmark points. Recent research efforts have been made on the collection, annotation and alignment of face images captured in-the-wild [12]. However, face alignment is still challenging due to the large variability of expression, illumination, occlusion and pose in the real-world face images [11].

An automatic face alignment system also suffers from the performance of face detector, because its initialization is usually based on the output of face detector. Challenging factors such as pose, illumination, expression and occlusion also have great effects on the performance of face detection [35]. Moreover, face detection is often determined by the criterion [36] that the ratio of the intersection of a detected region with an annotated face region is greater than 0.5. As shown in Fig. 1, all of the faces are detected according to the criterion of 0.5 overlap, but there is more or less drift with the detection results. When the detection result is largely drifted from the ground truth, it is actually not accurate enough for the initialization of face landmark localization algorithm.

In this paper, we propose a robust face landmark localization algorithm. The proposed method is based on the popular cascade shape regression model, and we try to further improve its robustness from three aspects. Firstly, we develop a robust deformable parts model (DPM) [37], [35] based face detector to provide a good shape initialization for face alignment. We also utilize the deformable parts information to predict the face view, so as to select the view-specific shape model. View based shape model is not only able to decrease the shape variance, but also can accelerate the shape convergence. Secondly, we develop a multi-scale cascade shape regression with multi-scale HOG features [38]. Multi-scale HOG features can incorporate local structure information implicitly, and multi-scale cascade shape regression helps to avoid trapping in local optimum. To further improve the performance of face alignment, a refinement process is conducted on facial components, such as mouth. The proposed methods achieve the state-of-the-art performance on the challenging benchmarks including the IBUG dataset and the 300-W dataset.

The rest of the paper is organized as follows. The related work is reviewed in Section 2. Cascade shape regression model with multi-view, multi-scale, and multi-component are presented in Section 3. Experimental results are shown in Section 4, and finally the conclusion is drawn in Section 5.

Section snippets

Related work

The cascade shape regression model (CSR) has attracted much attention in recent years, because it has achieved much success in face alignment under uncontrolled environment [13]. In [29], Cascade Pose Regression (CPR) is first proposed to estimate pose with pose-indexed features, which iteratively estimates object pose update from the features on current pose. Explicit Shape Regression (ESR) [30] improves CPR by using a two-level boosted regression and correlation-based feature selection. The

M3CSR model

Although the cascade shape regression model has achieved much success in face alignment [31], it is still sensitive to some large variations, such as illumination, pose, expression, and occlusion which often exist in real-world images, as well as shape initialization from face detector [38]. In this paper, we propose a new M3CSR model to make CSR more robust to the real-world variations. Its work flow is illustrated in Fig. 2, in which we enrich the system from three steps. The first step is to

Experimental data and setting

A number of face datasets [9], [47], [7] with different facial expression, pose, illumination and occlusion variations have been collected for evaluating face alignment algorithms. In [12], some in-the-wild datasets including AFW [7], LFPW [9], and HELEN [47] are re-annotated4 using semi-supervised methodology [48], and the well established landmark configuration of Multi-PIE [49]. A new dataset called IBUG is also created by [12],

Conclusion

In this paper, we present a M3CSR model for robust face alignment. Firstly, we develop a robust DPM-based face detector, and we estimate face view based on the locations of deformable facial parts for specifying the view-based CSR models. Secondly, we use the multi-scale HOG features for CSR. Finally, a process of facial component refinement is conducted to obtain more accurate results on the facial components. Extensive experiments on the IBUG dataset and the 300-W challenge dataset

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant 61532009 and Grant 61272223, in part by the Graduate Education Innovation Project of Jiangsu under Grant KYLX15_0881.

References (51)

  • T.F. Cootes et al.

    Active shape models: their training and application

    Comput. Vis. Image Underst.

    (1995)
  • J. Zhang et al.

    Coarse-to-Fine Auto-Encoder Networks (CFAN) for real-time face alignment

  • R. Gross et al.

    Multi-pie

    Image Vis. Comput.

    (2010)
  • A. Yao et al.

    Robust face representation using hybrid spatial feature interdependence matrix

    IEEE Trans. Image Process.

    (2013)
  • J. Qian et al.

    Local structure-based image decomposition for feature extraction with applications to face recognition

    IEEE Trans. Image Process.

    (2013)
  • R. Ramirez et al.

    Local directional number pattern for face analysis: face and expression recognition

    IEEE Trans. Image Process.

    (2013)
  • Y. Li et al.

    Simultaneous facial feature tracking and facial expression recognition

    IEEE Trans. Image Process.

    (2013)
  • L. Liang et al.

    Accurate face alignment using shape constrained Markov network

  • M. Valstar et al.

    Facial point detection using boosted regression and graph models

  • X. Zhu et al.

    Face detection, pose estimation, and landmark localization in the wild

  • F. Zhou et al.

    Exemplar-based Graph Matching for Robust Facial Landmark Localization

    (2013)
  • P.N. Belhumeur et al.

    Localizing parts of faces using a consensus of exemplars

  • J. Saragih

    Principal regression analysis

  • O. Çeliktutan et al.

    A comparative study of face landmarking techniques

    EURASIP J. Image Video Process.

    (2013)
  • C. Sagonas et al.

    300 Faces in-the-wild Challenge: the First Facial Landmark Localization, Challenge

    (2013)
  • A. Asthana et al.

    Incremental face alignment in the wild

  • T.F. Cootes et al.

    Active appearance models

  • I. Matthews et al.

    Active appearance models revisited

    Int. J. Comput. Vis.

    (2004)
  • J. Alabort et al.

    Bayesian active appearance models

  • G. Tzimiropoulos et al.

    Optimization problems for fast AAM fitting in-the-wild

  • T.F. Cootes et al.

    Robust and accurate shape model fitting using random forest regression voting

  • H. Yang et al.

    Sieving regression forest votes for facial feature detection in the wild

  • B. Martinez et al.

    Local evidence aggregation for regression-based facial point detection

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2013)
  • G. Tzimiropoulos et al.

    Gauss–Newton deformable part models for face alignment in-the-wild

  • V. Kazemi et al.

    One millisecond face alignment with an ensemble of regression trees

  • Cited by (63)

    • Facial landmarks localization using cascaded neural networks

      2021, Computer Vision and Image Understanding
    • Implementing cascaded regression tree-based face landmarking: An in-depth overview

      2020, Image and Vision Computing
      Citation Excerpt :

      Bulat et al. [22] mention fitting framerate estimations between 28 and 150fps using a NVidia(R) Titan X GPU card, while our method runs comfortably at 200fps on an Intel(R) Xeon(R) 2.3GHz CPU without much code optimization and no GPU resources whatsoever. Computational costs are probably an issue with second 300 W challenge winners Fan et al. [45], due to their decision to rely on CNNs, while the other winners Deng et al. [46] mention both a strong reliance on face detection accuracy and a testing framerate for each face of 20fps (50 ms) using a multi-view, multi-scale and multi-component cascade shape regression strategy. Given the results presented in last subsection 6.5, we believe that despite recent advances in deep learning modeling with convolutional kernels, previous works on face landmarking such as cascaded regression trees remain relevant, at least when only low computational resources are available.

    • A Unified Approach for Occlusion Tolerant 3D Facial Pose Capture and Gaze Estimation using MocapNETs

      2023, Proceedings - 2023 IEEE/CVF International Conference on Computer Vision Workshops, ICCVW 2023
    • Robust Face Alignment via Deep Progressive Reinitialization and Adaptive Error-Driven Learning

      2022, IEEE Transactions on Pattern Analysis and Machine Intelligence
    View all citing articles on Scopus

    This paper has been recommended for acceptance by Stefanos Zafeiriou.

    View full text