M3 CSR: Multi-view, multi-scale and multi-component cascade shape regression☆
Introduction
Automatic facial landmark localization plays an important role in facial image analysis [1], [2], [3], [4]. A lot of methods [5], [6], [7], [8], [9], [10] have been proposed, achieving remarkable improvements [11], [12] on standard benchmarks in the past two decades. Existing methods can be roughly divided into three categories: generative methods, discriminative methods and statistical methods [13]. Generative methods attempt to optimize the shape parameter configuration by maximizing the probability of a face image being reconstructed by a facial deformable model. Active Shape Model (ASM) [14] and Active Appearance Model (AAM) [15], [16], [17], [18] are two representative generative methods. Discriminative methods try to infer a face shape through a discriminative regression function, which directly maps a face image to the landmark coordinates [19], [20], [21], [22], [23], [24]. There are two popular ways to learn such a regression function. One is based on deep neural network learning [25], [26], [27], [28], the other is the well-known cascade shape regression model, which aims to learn a set of regressors to approximate complex nonlinear mapping between the initial shape and the ground truth [29], [30], [31]. The idea of Statistical methods is to combine both generative and discriminative methods, trying to fit the shape model on a statistical way after learning patch experts. The most notable example is probably the Constrained Local Model (CLM) [32], [33], [34] paradigm, which represents the face via a set of local image patches cropped around the landmark points. Recent research efforts have been made on the collection, annotation and alignment of face images captured in-the-wild [12]. However, face alignment is still challenging due to the large variability of expression, illumination, occlusion and pose in the real-world face images [11].
An automatic face alignment system also suffers from the performance of face detector, because its initialization is usually based on the output of face detector. Challenging factors such as pose, illumination, expression and occlusion also have great effects on the performance of face detection [35]. Moreover, face detection is often determined by the criterion [36] that the ratio of the intersection of a detected region with an annotated face region is greater than 0.5. As shown in Fig. 1, all of the faces are detected according to the criterion of 0.5 overlap, but there is more or less drift with the detection results. When the detection result is largely drifted from the ground truth, it is actually not accurate enough for the initialization of face landmark localization algorithm.
In this paper, we propose a robust face landmark localization algorithm. The proposed method is based on the popular cascade shape regression model, and we try to further improve its robustness from three aspects. Firstly, we develop a robust deformable parts model (DPM) [37], [35] based face detector to provide a good shape initialization for face alignment. We also utilize the deformable parts information to predict the face view, so as to select the view-specific shape model. View based shape model is not only able to decrease the shape variance, but also can accelerate the shape convergence. Secondly, we develop a multi-scale cascade shape regression with multi-scale HOG features [38]. Multi-scale HOG features can incorporate local structure information implicitly, and multi-scale cascade shape regression helps to avoid trapping in local optimum. To further improve the performance of face alignment, a refinement process is conducted on facial components, such as mouth. The proposed methods achieve the state-of-the-art performance on the challenging benchmarks including the IBUG dataset and the 300-W dataset.
The rest of the paper is organized as follows. The related work is reviewed in Section 2. Cascade shape regression model with multi-view, multi-scale, and multi-component are presented in Section 3. Experimental results are shown in Section 4, and finally the conclusion is drawn in Section 5.
Section snippets
Related work
The cascade shape regression model (CSR) has attracted much attention in recent years, because it has achieved much success in face alignment under uncontrolled environment [13]. In [29], Cascade Pose Regression (CPR) is first proposed to estimate pose with pose-indexed features, which iteratively estimates object pose update from the features on current pose. Explicit Shape Regression (ESR) [30] improves CPR by using a two-level boosted regression and correlation-based feature selection. The
M3CSR model
Although the cascade shape regression model has achieved much success in face alignment [31], it is still sensitive to some large variations, such as illumination, pose, expression, and occlusion which often exist in real-world images, as well as shape initialization from face detector [38]. In this paper, we propose a new M3CSR model to make CSR more robust to the real-world variations. Its work flow is illustrated in Fig. 2, in which we enrich the system from three steps. The first step is to
Experimental data and setting
A number of face datasets [9], [47], [7] with different facial expression, pose, illumination and occlusion variations have been collected for evaluating face alignment algorithms. In [12], some in-the-wild datasets including AFW [7], LFPW [9], and HELEN [47] are re-annotated4 using semi-supervised methodology [48], and the well established landmark configuration of Multi-PIE [49]. A new dataset called IBUG is also created by [12],
Conclusion
In this paper, we present a M3CSR model for robust face alignment. Firstly, we develop a robust DPM-based face detector, and we estimate face view based on the locations of deformable facial parts for specifying the view-based CSR models. Secondly, we use the multi-scale HOG features for CSR. Finally, a process of facial component refinement is conducted to obtain more accurate results on the facial components. Extensive experiments on the IBUG dataset and the 300-W challenge dataset
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China under Grant 61532009 and Grant 61272223, in part by the Graduate Education Innovation Project of Jiangsu under Grant KYLX15_0881.
References (51)
- et al.
Active shape models: their training and application
Comput. Vis. Image Underst.
(1995) - et al.
Coarse-to-Fine Auto-Encoder Networks (CFAN) for real-time face alignment
- et al.
Multi-pie
Image Vis. Comput.
(2010) - et al.
Robust face representation using hybrid spatial feature interdependence matrix
IEEE Trans. Image Process.
(2013) - et al.
Local structure-based image decomposition for feature extraction with applications to face recognition
IEEE Trans. Image Process.
(2013) - et al.
Local directional number pattern for face analysis: face and expression recognition
IEEE Trans. Image Process.
(2013) - et al.
Simultaneous facial feature tracking and facial expression recognition
IEEE Trans. Image Process.
(2013) - et al.
Accurate face alignment using shape constrained Markov network
- et al.
Facial point detection using boosted regression and graph models
- et al.
Face detection, pose estimation, and landmark localization in the wild
Exemplar-based Graph Matching for Robust Facial Landmark Localization
Localizing parts of faces using a consensus of exemplars
Principal regression analysis
A comparative study of face landmarking techniques
EURASIP J. Image Video Process.
300 Faces in-the-wild Challenge: the First Facial Landmark Localization, Challenge
Incremental face alignment in the wild
Active appearance models
Active appearance models revisited
Int. J. Comput. Vis.
Bayesian active appearance models
Optimization problems for fast AAM fitting in-the-wild
Robust and accurate shape model fitting using random forest regression voting
Sieving regression forest votes for facial feature detection in the wild
Local evidence aggregation for regression-based facial point detection
IEEE Trans. Pattern Anal. Mach. Intell.
Gauss–Newton deformable part models for face alignment in-the-wild
One millisecond face alignment with an ensemble of regression trees
Cited by (63)
Facial landmarks localization using cascaded neural networks
2021, Computer Vision and Image UnderstandingImplementing cascaded regression tree-based face landmarking: An in-depth overview
2020, Image and Vision ComputingCitation Excerpt :Bulat et al. [22] mention fitting framerate estimations between 28 and 150fps using a NVidia(R) Titan X GPU card, while our method runs comfortably at 200fps on an Intel(R) Xeon(R) 2.3GHz CPU without much code optimization and no GPU resources whatsoever. Computational costs are probably an issue with second 300 W challenge winners Fan et al. [45], due to their decision to rely on CNNs, while the other winners Deng et al. [46] mention both a strong reliance on face detection accuracy and a testing framerate for each face of 20fps (50 ms) using a multi-view, multi-scale and multi-component cascade shape regression strategy. Given the results presented in last subsection 6.5, we believe that despite recent advances in deep learning modeling with convolutional kernels, previous works on face landmarking such as cascaded regression trees remain relevant, at least when only low computational resources are available.
A Unified Approach for Occlusion Tolerant 3D Facial Pose Capture and Gaze Estimation using MocapNETs
2023, Proceedings - 2023 IEEE/CVF International Conference on Computer Vision Workshops, ICCVW 2023Robust Face Alignment via Deep Progressive Reinitialization and Adaptive Error-Driven Learning
2022, IEEE Transactions on Pattern Analysis and Machine Intelligence
- ☆
This paper has been recommended for acceptance by Stefanos Zafeiriou.