Adaptive Cascade Deep Convolutional Neural Networks for face alignment☆
Introduction
Face alignment or facial landmark localization plays a critical role in many visual applications such as face recognition, face tracking, facial expression recognition and 3D face modeling. Therefore, it has been extensively studied in recent years. However, robust facial landmark detection remains a challenging problem when face images are taken under the situation with extreme occlusion, lighting, expressions and pose. To address this issue, research explores the modeling of shape variation and appearance variation for improved performance. In general, this type of research can be categorized into three groups: constrained local model based methods [2], [3], [4], active appearance model based methods [5], [6] and regression based methods [1], [7], [8], [9], [10].
Constrained local models build classifiers called component detectors to search for each facial feature point independently. These component detectors calculate response maps to present the appearance variance around facial feature points. Due to the problems of ambiguity and corruption in local features, facial points detected by the local experts may be far away from the ground truth positions. Then shape constraints are applied to adjust the initial positions for improved results [2], [4]. However, the global contextual information is difficult to be embedded into these methods.
Instead of modeling the appearance with each facial point, active appearance models such as Active Appearance Model (AAM) [5] use a holistic perspective to model the appearance variance. An AAM model is composed of a linear shape model and a linear texture model. The Principal Component Analysis (PCA) is applied to bridge the relationship between the two models. Nevertheless, simple linear models can hardly present the nonlinear variations of facial appearance in the case of faces taken in complex environment (e.g., extreme lighting).
Regression based methods, on the other hand, directly learn a regression function from image appearance (features) to the target output (shapes) [11]. Cascade architecture is usually employed and explored in regression based models. In each stage of the cascade architecture, shape-index features [12] are extracted to predict the shape increment with linear regression [7], tree-based regression [8] where the mean shape is used as the initializations of the shapes. Coarse-to-Fine Auto-Encoder Networks (CFAN) [9] utilizes a Stacked Auto-encoder Network [13] to predict the face shape quickly by taking a whole face as input. DCNN [1] employs a deep CNN model to extract high-level features to make accurate predictions as the initialization. After the initialization, the DCNN designs two-level convolutional networks to refine each landmark separately by taking local regions as input. To train these networks, several factors are critical for achieving good performance. For example, Sun et al. [1] conduct extensive experiments to investigate different network structures which are the basic regression units. The input range of local regions and the selecting strategy of local patches for training are other main factors having great impacts on the accuracy and reliability. But these factors are set by intuition or empirically in traditional methods. Besides, the relationship between any two successive networks is less developed.
In this paper, we propose an Adaptive Cascade Deep Convolutional Neural Networks (ACDCNN) for facial point detection. After initializing the shape by a CNN model like DCNN, each landmark is refined by a series of networks. These networks take the output of previous networks as input and locate a new position of the landmark. Different from existing methods [1], [9] which apply the same configuration of regression for each landmark or each facial component in a stage, we set the configurations according to different results of each landmark. In addition, a Gaussian distribution is used to model the output error of the previous network. The input range of the local region is related to the expectation and the standard deviation of this Gaussian distribution. After the input range is determined, patches centered at positions shifted from the ground truth position are taken for training. Instead of taking these patches randomly, they are fetched under that Gaussian distribution. Thus the most relevant image patches are selected for training the successive network. These better training samples lead to better performance. The comparison experiments show that the proposed ACDCNN outperforms or is comparable to the state-of-the-art methods on both robustness and accuracy.
The rest of the paper is organized as follows. Section 2 introduces related work followed by our proposed ACDCNN introduced in Section 3. The Implementation details are described in Section 4. Section 5 reports our experimental results followed by conclusion in Section 6.
Section snippets
Related work
Many approaches to face alignment have been reported in the past decades among which regression based methods show highly efficient and accurate performance thus have received increasing attentions. Valstar et al. [14] develop support vector regression to model the nonlinear transform from the input local features (Haar-like features) to target point locations. Dantone et al. [15] extend the regression forests [16] to conditional regression forests. Head poses are utilized in the framework as
Adaptive Cascade Deep Convolutional Neural Networks
In this section, we present a novel method termed ACDCNN for face alignment. The details of each component of ACDCNN, including the initialization with a Deep Convolutional Neural Network and the Local Adaptive Cascade Networks (LACN) are explained.
Structures
Networks at different levels follow a similar architecture with varied numbers of the inputs. Table 1 summarizes the input sizes for different facial components. All networks are trained on raw RGB values of the pixels. The networks used in the first level, the left eye and right eye have higher resolution since it is observed the whole faces and the regions of eyes contain richer context information than the regions of the nose and mouth.
Training
All networks are trained by stochastic gradient descent
Experiments
In this section, we firstly illustrate the datasets for the evaluations. Every cascade network for each facial landmark in our method is investigated. Next, the comparison with DCNN is conducted on the same training and validation set. Finally, we compare the proposed ACDCNN with the state-of-the-art methods and commercial software.
Conclusions and future work
In this paper, an adaptive cascade framework for face alignment is proposed. Gaussian distribution is used to bridge the current network input with the previous network output. Each landmark is refined independently based on its previous statistical information on which the adaptively sampling strategy to select training patches depends. The benefit of adaptive sampling lies in that the most relevant image patches are exploited for training deep convolutional neural networks. We show that the
Dong Yuan is associate professor at Beijing University of Posts and Telecommunications, China. He is also invited as “France Telecom — Orange Expert on Solution of Content Service” of France Telecom R&D Group. He received his PhD degree in Shanghai Jiao Tong University at 1999, worked as R&D scientist at Nokia Research Center China from 1999–2001, worked as post-doctoral research staff at Engineering Department Cambridge University UK from 2001–2003. His current research interests include
References (28)
- et al.
Coarse-to-fine auto-encoder networks (cfan) for real-time face alignment
- et al.
Deep convolutional network cascade for facial point detection
- et al.
Localizing parts of faces using a consensus of exemplars
- et al.
Robust and accurate shape model fitting using random forest regression voting
- et al.
Face detection, pose estimation, and landmark localization in the wild
- et al.
Active appearance models
- et al.
A review of active appearance models
Syst. Man Cybern. C Appl. Rev. IEEE Trans.
(2010) - et al.
Supervised descent method and its applications to face alignment
- et al.
Face alignment at 3000 fps via regressing local binary features
- et al.
Facial landmark detection by deep multi-task learning
Facial Feature Point Detection: a Comprehensive Survey. arXiv, preprint arXiv:1410.1037
Cascaded pose regression
Reducing the dimensionality of data with neural networks
Science
Facial point detection using boosted regression and graph models
Cited by (32)
Securing social platform from misinformation using deep learning
2023, Computer Standards and InterfacesCitation Excerpt :For example, fake news regarding the Barack Obama injury was flooded on Twitter in 2013, which led to 130 billion dollars being wiped out in stock value. The problem of misinformation is not new, and has remained since the development of the printing press, although it has recently received a lot of acceleration and exposure due to the ease of access to the Internet, and social media [15,16]. To date, the authenticity of news depends on some fact-checking websites like PolitiFact and Snopes.
PPNNP: A privacy-preserving neural network prediction with separated data providers using multi-client inner-product encryption
2023, Computer Standards and InterfacesImproving face representation learning with center invariant loss
2018, Image and Vision ComputingInnovative method for recognizing subgrade defects based on a convolutional neural network
2018, Construction and Building MaterialsCitation Excerpt :The structure is similar to that of biological neural networks [18–20]. Detailed information about CNNs can be found in previous studies [21–27]. In general, robustness is an attractive property of CNNs in civil engineering, where this robustness is evident in terms of their high stability at recognizing different objects such as humans and animals in different conditions.
Improving the robustness of GNP-PCA using the multiagent system
2017, Applied Soft Computing JournalCitation Excerpt :Even more, the recognition accuracy of GNP-PCA in the noise free environment has been improved by GNP-MAS using the fine information of facial images. Nowadays, convolution neural networks(CNN) have become the state-of-the-art methods and have been widely studied to solve facial recognition problems [22,23]. However, it still suffers from noisy problems [24].
Innovation for evaluating aggregate angularity based upon 3D convolutional neural network
2017, Construction and Building Materials
Dong Yuan is associate professor at Beijing University of Posts and Telecommunications, China. He is also invited as “France Telecom — Orange Expert on Solution of Content Service” of France Telecom R&D Group. He received his PhD degree in Shanghai Jiao Tong University at 1999, worked as R&D scientist at Nokia Research Center China from 1999–2001, worked as post-doctoral research staff at Engineering Department Cambridge University UK from 2001–2003. His current research interests include semantic video indexing, video copy detection, and multimedia content search.
Wu Yue is a postgraduate student at Beijing University of Posts and Telecommunications, China. He received the B.S. degree in Electronic Information Engineering in Beijing University of Posts and Telecommunications at 2013. His current research interests are face tracking, face alignment, face recognition, object detection and deep learning.