Elsevier

Computer Standards & Interfaces

Volume 42, November 2015, Pages 105-112
Computer Standards & Interfaces

Adaptive Cascade Deep Convolutional Neural Networks for face alignment

https://doi.org/10.1016/j.csi.2015.06.004Get rights and content

Highlights

  • In this paper, an adaptive Cascade Deep Convolutional Neural Networks framework is proposed for face alignment before face recognition.

  • A new convolutional network structure with three convolutional layers and three fully-connected layers is introduced. Gaussian distribution is utilized to model the output error of previous networks and adjust configurations of networks adaptively.

  • Experiments show that our method has a better accuracy result than the state-of-the-art methods, with low complexity and good robustness.

Abstract

Deep convolutional network cascade has been successfully applied for face alignment. The configuration of each network, including the selecting strategy of local patches for training and the input range of local patches, is crucial for achieving desired performance. In this paper, we propose an adaptive cascade framework, termed Adaptive Cascade Deep Convolutional Neural Networks (ACDCNN) which adjusts the cascade structure adaptively. Gaussian distribution is utilized to bridge the successive networks. Extensive experiments demonstrate that our proposed ACDCNN achieves the state-of-the-art in accuracy, but with reduced model complexity and increased robustness.

Introduction

Face alignment or facial landmark localization plays a critical role in many visual applications such as face recognition, face tracking, facial expression recognition and 3D face modeling. Therefore, it has been extensively studied in recent years. However, robust facial landmark detection remains a challenging problem when face images are taken under the situation with extreme occlusion, lighting, expressions and pose. To address this issue, research explores the modeling of shape variation and appearance variation for improved performance. In general, this type of research can be categorized into three groups: constrained local model based methods [2], [3], [4], active appearance model based methods [5], [6] and regression based methods [1], [7], [8], [9], [10].

Constrained local models build classifiers called component detectors to search for each facial feature point independently. These component detectors calculate response maps to present the appearance variance around facial feature points. Due to the problems of ambiguity and corruption in local features, facial points detected by the local experts may be far away from the ground truth positions. Then shape constraints are applied to adjust the initial positions for improved results [2], [4]. However, the global contextual information is difficult to be embedded into these methods.

Instead of modeling the appearance with each facial point, active appearance models such as Active Appearance Model (AAM) [5] use a holistic perspective to model the appearance variance. An AAM model is composed of a linear shape model and a linear texture model. The Principal Component Analysis (PCA) is applied to bridge the relationship between the two models. Nevertheless, simple linear models can hardly present the nonlinear variations of facial appearance in the case of faces taken in complex environment (e.g., extreme lighting).

Regression based methods, on the other hand, directly learn a regression function from image appearance (features) to the target output (shapes) [11]. Cascade architecture is usually employed and explored in regression based models. In each stage of the cascade architecture, shape-index features [12] are extracted to predict the shape increment with linear regression [7], tree-based regression [8] where the mean shape is used as the initializations of the shapes. Coarse-to-Fine Auto-Encoder Networks (CFAN) [9] utilizes a Stacked Auto-encoder Network [13] to predict the face shape quickly by taking a whole face as input. DCNN [1] employs a deep CNN model to extract high-level features to make accurate predictions as the initialization. After the initialization, the DCNN designs two-level convolutional networks to refine each landmark separately by taking local regions as input. To train these networks, several factors are critical for achieving good performance. For example, Sun et al. [1] conduct extensive experiments to investigate different network structures which are the basic regression units. The input range of local regions and the selecting strategy of local patches for training are other main factors having great impacts on the accuracy and reliability. But these factors are set by intuition or empirically in traditional methods. Besides, the relationship between any two successive networks is less developed.

In this paper, we propose an Adaptive Cascade Deep Convolutional Neural Networks (ACDCNN) for facial point detection. After initializing the shape by a CNN model like DCNN, each landmark is refined by a series of networks. These networks take the output of previous networks as input and locate a new position of the landmark. Different from existing methods [1], [9] which apply the same configuration of regression for each landmark or each facial component in a stage, we set the configurations according to different results of each landmark. In addition, a Gaussian distribution is used to model the output error of the previous network. The input range of the local region is related to the expectation and the standard deviation of this Gaussian distribution. After the input range is determined, patches centered at positions shifted from the ground truth position are taken for training. Instead of taking these patches randomly, they are fetched under that Gaussian distribution. Thus the most relevant image patches are selected for training the successive network. These better training samples lead to better performance. The comparison experiments show that the proposed ACDCNN outperforms or is comparable to the state-of-the-art methods on both robustness and accuracy.

The rest of the paper is organized as follows. Section 2 introduces related work followed by our proposed ACDCNN introduced in Section 3. The Implementation details are described in Section 4. Section 5 reports our experimental results followed by conclusion in Section 6.

Section snippets

Related work

Many approaches to face alignment have been reported in the past decades among which regression based methods show highly efficient and accurate performance thus have received increasing attentions. Valstar et al. [14] develop support vector regression to model the nonlinear transform from the input local features (Haar-like features) to target point locations. Dantone et al. [15] extend the regression forests [16] to conditional regression forests. Head poses are utilized in the framework as

Adaptive Cascade Deep Convolutional Neural Networks

In this section, we present a novel method termed ACDCNN for face alignment. The details of each component of ACDCNN, including the initialization with a Deep Convolutional Neural Network and the Local Adaptive Cascade Networks (LACN) are explained.

Structures

Networks at different levels follow a similar architecture with varied numbers of the inputs. Table 1 summarizes the input sizes for different facial components. All networks are trained on raw RGB values of the pixels. The networks used in the first level, the left eye and right eye have higher resolution since it is observed the whole faces and the regions of eyes contain richer context information than the regions of the nose and mouth.

Training

All networks are trained by stochastic gradient descent

Experiments

In this section, we firstly illustrate the datasets for the evaluations. Every cascade network for each facial landmark in our method is investigated. Next, the comparison with DCNN is conducted on the same training and validation set. Finally, we compare the proposed ACDCNN with the state-of-the-art methods and commercial software.

Conclusions and future work

In this paper, an adaptive cascade framework for face alignment is proposed. Gaussian distribution is used to bridge the current network input with the previous network output. Each landmark is refined independently based on its previous statistical information on which the adaptively sampling strategy to select training patches depends. The benefit of adaptive sampling lies in that the most relevant image patches are exploited for training deep convolutional neural networks. We show that the

Dong Yuan is associate professor at Beijing University of Posts and Telecommunications, China. He is also invited as “France Telecom — Orange Expert on Solution of Content Service” of France Telecom R&D Group. He received his PhD degree in Shanghai Jiao Tong University at 1999, worked as R&D scientist at Nokia Research Center China from 1999–2001, worked as post-doctoral research staff at Engineering Department Cambridge University UK from 2001–2003. His current research interests include

References (28)

  • J. Zhang et al.

    Coarse-to-fine auto-encoder networks (cfan) for real-time face alignment

  • Y. Sun et al.

    Deep convolutional network cascade for facial point detection

  • P.N. Belhumeur et al.

    Localizing parts of faces using a consensus of exemplars

  • T.F. Cootes et al.

    Robust and accurate shape model fitting using random forest regression voting

  • X. Zhu et al.

    Face detection, pose estimation, and landmark localization in the wild

  • T.F. Cootes et al.

    Active appearance models

  • X. Gao et al.

    A review of active appearance models

    Syst. Man Cybern. C Appl. Rev. IEEE Trans.

    (2010)
  • X. Xiong et al.

    Supervised descent method and its applications to face alignment

  • S. Ren et al.

    Face alignment at 3000 fps via regressing local binary features

  • Z. Zhang et al.

    Facial landmark detection by deep multi-task learning

  • N. Wang et al.

    Facial Feature Point Detection: a Comprehensive Survey. arXiv, preprint arXiv:1410.1037

    (2014)
  • P. Dollár et al.

    Cascaded pose regression

  • G.E. Hinton et al.

    Reducing the dimensionality of data with neural networks

    Science

    (2006)
  • M. Valstar et al.

    Facial point detection using boosted regression and graph models

  • Cited by (32)

    • Securing social platform from misinformation using deep learning

      2023, Computer Standards and Interfaces
      Citation Excerpt :

      For example, fake news regarding the Barack Obama injury was flooded on Twitter in 2013, which led to 130 billion dollars being wiped out in stock value. The problem of misinformation is not new, and has remained since the development of the printing press, although it has recently received a lot of acceleration and exposure due to the ease of access to the Internet, and social media [15,16]. To date, the authenticity of news depends on some fact-checking websites like PolitiFact and Snopes.

    • Innovative method for recognizing subgrade defects based on a convolutional neural network

      2018, Construction and Building Materials
      Citation Excerpt :

      The structure is similar to that of biological neural networks [18–20]. Detailed information about CNNs can be found in previous studies [21–27]. In general, robustness is an attractive property of CNNs in civil engineering, where this robustness is evident in terms of their high stability at recognizing different objects such as humans and animals in different conditions.

    • Improving the robustness of GNP-PCA using the multiagent system

      2017, Applied Soft Computing Journal
      Citation Excerpt :

      Even more, the recognition accuracy of GNP-PCA in the noise free environment has been improved by GNP-MAS using the fine information of facial images. Nowadays, convolution neural networks(CNN) have become the state-of-the-art methods and have been widely studied to solve facial recognition problems [22,23]. However, it still suffers from noisy problems [24].

    View all citing articles on Scopus

    Dong Yuan is associate professor at Beijing University of Posts and Telecommunications, China. He is also invited as “France Telecom — Orange Expert on Solution of Content Service” of France Telecom R&D Group. He received his PhD degree in Shanghai Jiao Tong University at 1999, worked as R&D scientist at Nokia Research Center China from 1999–2001, worked as post-doctoral research staff at Engineering Department Cambridge University UK from 2001–2003. His current research interests include semantic video indexing, video copy detection, and multimedia content search.

    Wu Yue is a postgraduate student at Beijing University of Posts and Telecommunications, China. He received the B.S. degree in Electronic Information Engineering in Beijing University of Posts and Telecommunications at 2013. His current research interests are face tracking, face alignment, face recognition, object detection and deep learning.

    The work is sponsored by the Chinese NSFC project 61372169.

    View full text