Elsevier

Pattern Recognition

Volume 105, September 2020, 107311
Pattern Recognition

Heterogenous output regression network for direct face alignment

https://doi.org/10.1016/j.patcog.2020.107311Get rights and content

Abstract

Face alignment has gained great popularity in computer vision due to its wide-spread applications. In this paper, we propose a novel learning architecture, i.e., heterogenous output regression network (HORNet), for face alignment, which directly predicts facial landmarks from images. HORNet is based on kernel approximations and establishes a new compact multi-layer architecture. A nonlinear layer with cosine activations disentangles nonlinear relationships between representations of images and shapes of facial landmarks. A linear layer with identity activations explicitly encodes landmark correlations by low-rank learning via matrix elastic nets. HORNet is highly flexible and can work either with pre-built feature representations or with convolutional architectures for end-to-end learning. HORNet leverages the strengths of both kernel methods in modeling nonlinearities and of neural networks in structural prediction. This combination renders it effective and efficient for direct face alignment. Extensive experiments on five in-the-wild datasets show that HORNet delivers high performance and consistently exceeds state-of-the-art methods.

Introduction

Face alignment as a fundamental face analysis task has generated huge popularity in computer vision as face recognition and so on [1], [2]. Its widespread applications include face verification, facial expression analysis, human-machine interaction, and animation. Face alignment finds the locations, i.e. the coordinates, of a set of predefined landmark points on the face image. The challenges stem from the large variations in head pose [3], facial expression [4], illumination [6], [5], and partial occlusion [7], especially when it comes to unconstrained datasets in the wild [8].

Therefore, the relationship between the face image and the shape of landmarks is highly nonlinear; face images are usually represented by low-level feature descriptors while the predefined facial landmarks contain high-level semantic meaning. Moreover, the landmarks are spatially interdependent and strongly correlated, which should be modeled to improve the prediction performance. Since cascaded regression [9] was introduced to face alignment [10], it has provided the state-of-the-art performance in various face alignment tasks [11], [12]. Instead of performing one-step regression, cascaded regression starts with an initial shape, and iteratively refines the shape by a cascade of regressors. Building upon cascaded regression, many improved variants have been developed which distinguish themselves by the shape initialization strategies [13], shape-indexed features, [14] or regressors [11].

Despite of the great progress made, cascaded regression [15], [16] suffers from several widely-acknowledged, innate shortcomings. It is highly dependent on and sensitive to initialization. The estimation solution is prone to get trapped in local optima when starting from a poor shape initialization. Unfortunately, this is likely to happen in practice, due to the huge head pose variations [17]. Attempts to circumvent this problem, e.g. by applying multiple runs [9], [13], fall short from offering a principled solution. Moreover, it has been observed that cascaded models, e.g. supervised descent methods (SDM) [11], are only effective within a specific domain of homogeneous descent [18], [19]. This indicates that only limited performance can be expected for unconstrained faces with large head rotations and large shape deformations. In addition, previous work [9], [11] mostly treats coordinates of landmarks independently, thus largely overlooking the correlation of facial landmarks [11]. This said, a shape constraint was exploited in [13], [20]. Implicit measures to include landmark relations, like minimizing shape parameter errors, tend to be suboptimal and do not necessarily lead to lower alignment errors [19]. It is highly desirable to incorporate the correlation information in a more explicit and principled way. That would improve the prediction performance, e.g. recovering occluded landmarks [21]. Direct face alignment [22], [23] without relying on cascaded regression has recently generated increasing popularity, and achieved high performance on existing benckmarks, showing great promise for efficient and accurate face alignment [24].

In this paper, we propose directly predicting the locations of facial landmarks from the image rather than relying on any cascaded regression models. Our method finds the explicit mapping from images to shapes composed of facial landmarks, and offers a principled way to avoid the shortcomings of cascaded regression. We formulate face alignment as a multivariate regression problem, where each landmark point coordinate corresponds to a regression output. In contrast to cascaded regression, our method takes the holistic image representation as the input rather than shape-indexed features; it does not require any shape initialization and allows for the direct prediction of all facial landmarks.

It is a nontrivial task to directly predict facial landmarks from images, however. The highly nonlinear input-output relationship causes serious challenges. These are induced by the large image appearance variations and the highly deformable shapes of facial landmarks. We propose heterogenous output regression network (HORNet) to simultaneously handle those challenges in one single framework.

Derived from kernel approximation, HORNet accomplishes a new compact multi-layer learning network that is composed of a nonlinear learning layer and a linear low-rank learning layer. The nonlinear layer derived from kernel approximation is implemented as a feed-forward neural network with a cosine activation function. It achieves nonlinear mappings from the inputs to the hidden layers, to disentangle complicated nonlinear relationships between image representations and shapes of facial landmarks. The linear layer with an identity activation function explicitly and efficiently encodes the intrinsic correlations between landmark points by low-rank learning via the matrix elastic net (MEN). The parameters in both the nonlinear and linear layers can be jointly optimized via a newly derived alternating optimization algorithm based on mini-batch gradient descent.

Moreover, HORNet is highly flexible and can work either with pre-built feature representations or with convolutional architectures for end-to-end deep learning. HORNet inherits the strengths of kernel methods to tackle the nonlinearity in data and enjoys the innate properties of neural networks in structural prediction. Those fill the gap between kernel methods and neural networks for multivariate regression.

The contributions of this work can be summarized as follows:

  • We propose a novel learning architecture, heterogenous output regression network (HORNet), for face alignment. HORNet allows for the direct prediction of facial landmarks from images, without relying on iterative cascaded regression models or shape initialization.

  • HORNet can jointly handle the highly nonlinear input-output relationships and the underlying intrinsic correlations between outputs in one single framework.

  • HORNet is highly flexible and can work either with pre-built feature representations or with convolutional architectures for end-to-end learning. Its scales well with large datasets and is able to leverage large, annotated unconstrained face alignment datasets.

HORNet has been extensively evaluated on five challenging in-the-wild datasets, including CelebA [25], MAFL [26], AFLW [27], 300-W [28] and 300-VW [29]. Experimental results show that HORNet achieves state-of-the-art performance and surpasses previous methods, demonstrating its great effectiveness for direct face alignment.

This paper focusses on the problem of explainable deep learning for efficient and robust pattern recognition by jointly aiming to improve explainability, efficiency and robustness. Our work sheds light on the interpretation of the learning behaviour of deep neural networks. This problem has been the subject of recent state-of-the art work on explaining deep learning models [30], [31], [32], [33].

On the one hand, our model bears strong explainability for the strong learning ability of our the proposed deep learning architecture for heterogenous output regression. The proposed HORNet contains a non-linear layer that enables the model to disentangle the complex relationship between input images and the associated facial shapes represented as a set of landmarks. The images in form of pixels generally contains low-level information, including edges and contours while facial shapes are in a higher semantic level [34]. The proposed nonlinear layer is derived from nonlinear kernels by the technique of kernel approximation based on random Fourier features (RFFs). The RFFs can lift the feature representations extracted from convolutional architecture to a higher semantic level in the reproducing kernel Hilbert space (RKHS) [31]. This is also related to the recent work in [35] which explores the nonlinear convolutional filters as vectors in a RKHS. Moreover, the cosine activation function in the derived nonlinear layer also admits a natural explanation, which provides an useful alternative to the ReLU function for the activation function which plays an important role in the learning behaviour in deep neural networks [36]. The proposed HORNet combines the strengths of both deep learning and kernels, which is in the similar spirit of the recent work on learning deep kernels [37].

On the other hand, the linear layer with nuclear norm constraint establishes a matrix elastic net and is relevant to the recent work [30], in which deep learning models are explained via a Bayesian nonparametric regression mixture model with multiple elastic nets. With our matrix elastic net, the HORNet is able to better capture the interdependency among landmarks, which can also be well explained from the perspective of subspace learning. By imposing a low-rank constraint, the linear layer forces correlated outputs, e.g., landmarks that are spatially connected and depended, to share roughly the same subspace of feature representations. Moreover, the low-rank constraint also enforce to the network to extract the most important features [33], [38] that are closely related to facial landmarks, which enables robust predictions [32]. Therefore, the proposed HORNet can achieve more robust predictions. Additionally, the linear layer with a low-rank constraint would also enhance the expressive ability by increasing the depth [39] in that the linear layer can be viewed as a network of one hidden layer with an identity activation; while the increased depth does not introduce any bad local minimum [40].

In addition, our HORNet is computationally more efficient than previous models based on cascaded regression for face alignment, which enables more robust prediction. Thanks to the designed architecture, our HORNet is able to directly predict the facial landmarks without relying on the progressive cascaded regression. Especially in the test stage, given an image, we can directly obtain the landmarks by several steps of feedforward matrix computation rather than by iterative optimization in the cascaded models. In general, in contrast to regular deep learning architectures, our proposed HORNet offers some useful interpretability to the deep learning models, which has also been well supported by our experiments.

Section snippets

Related work

We review representative work to briefly show the historical progress of face alignment over the last decade. It is also worth mentioning that the rapid development in face alignment is largely attributed to the release of large annotated datasets of facial landmarks [41].

In earlier work, face alignment was addressed mainly based on active shape /appearance models (ASM/AAM) [42]. Despite of many improved variants of ASM/AAM [43], those models suffer from poor generalization to unseen images [44]

Direct face alignment

We introduce a novel, heterogenous output regression network (HORNet) for face alignment. HORNet learns a mapping function from the input space XRd of image representations to the multivariate output space YRQ of facial landmarks. Taking the feature representation of the facial image as the input, HORNet directly predicts a shape composed of facial landmarks as the multivariate output.

Experiments and results

We conduct extensive experiments to evaluate HORNet for direct face alignment on five challenging unconstrained datasets in the wild. They come with the diversity and challenges confronting existing face alignment tasks, from single images to video sequences, and thus provide a comprehensive evaluation of face alignment approaches. HORNet consistently achieves the state-of-the-art performance, showing its great effectiveness and generality for face alignment.

Conclusion and future work

In this paper, we have presented heterogenous output regression network (HORNet) for face alignment, which does not rely on cascaded regression. HORNet represents a new compact multivariate learning architecture comprised of a nonlinear layer and a low-rank linear layer. In particular, (i) HORNet combines the strengths of neural networks for structure prediction and kernels for nonlinear learning, which offers a new powerful multi-output regressor; (ii) HORNet is independent of input features

Xiantong Zhen received the B.S. and M.E. degrees from Lanzhou University, Lanzhou, China in 2007 and 2010, respectively and the Ph.D. degree from the Department of Electronic and Electrical Engineering, the University of Sheffield, UK in 2013. He worked as postdoctoral fellows with the University of Western Ontario, London, Canada and the University of Texas at Arlington, Texas, U.S.A. from 2013 to 2017. He was an associate professor with the School of Electronic and Information Engineering,

References (86)

  • M. Castelan et al.

    A coupled statistical model for face shape recovery from brightness images

    IEEE Trans. Image Process.

    (2007)
  • X.P. Burgos-Artizzu et al.

    Robust face landmark estimation under occlusion

    ICCV

    (2013)
  • X. Zhu et al.

    Face detection, pose estimation, and landmark localization in the wild

    CVPR

    (2012)
  • P. Dollár et al.

    Cascaded pose regression

    CVPR

    (2010)
  • X. Cao et al.

    Face alignment by explicit shape regression

    Int. J. Comput. Vis.

    (2014)
  • X. Xiong et al.

    Supervised descent method and its applications to face alignment

    CVPR

    (2013)
  • G. Tzimiropoulos

    Project-out cascaded regression with an application to face alignment

    CVPR

    (2015)
  • X. Cao et al.

    Face alignment by explicit shape regression

    CVPR

    (2012)
  • S. Zhu et al.

    Face alignment by coarse-to-fine shape searching

    CVPR

    (2015)
  • X. Xiong et al.

    Global supervised descent method

    CVPR

    (2015)
  • S. Zhu et al.

    Unconstrained face alignment via cascaded compositional learning

    CVPR

    (2016)
  • O. Tuzel et al.

    Robust face alignment using a mixture of invariant experts

    ECCV

    (2016)
  • H.K. Galoogahi et al.

    Correlation filter cascade for facial landmark localization

    WACV

    (2016)
  • X. Miao et al.

    Direct shape regression networks for end-to-end face alignment

    CVPR

    (2018)
  • L. Yue et al.

    Attentional alignment networks

    BMVC

    (2018)
  • P. Li et al.

    Multi-scale aggregation network for direct face alignment

    WACV

    (2019)
  • Z. Liu et al.

    Deep learning face attributes in the wild

    ICCV

    (2015)
  • Z. Zhang et al.

    Learning deep representation for face alignment with auxiliary attributes

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2016)
  • M. Köstinger et al.

    Annotated facial landmarks in the wild: alarge-scale, real-world database for facial landmark localization

    ICCVW

    (2011)
  • C. Sagonas et al.

    300 faces in-the-wild challenge: the first facial landmark localization challenge

    ICCVW

    (2013)
  • J. Shen et al.

    The first facial landmark tracking in-the-wild challenge: benchmark and results

    ICCVW

    (2015)
  • W. Guo et al.

    Explaining deep learning models-a Bayesian non-parametric approach

    NIPS

    (2018)
  • M.D. Zeiler et al.

    Visualizing and understanding convolutional networks

    European Conference on Computer Vision

    (2014)
  • S.M. Lundberg et al.

    A unified approach to interpreting model predictions

    NIPS

    (2017)
  • P. Schwab et al.

    Cxplain: Causal explanations for model interpretation under uncertainty

    NIPS

    (2019)
  • K. Hara et al.

    Growing regression tree forests by classification for continuous object pose estimation

    Int. J. Comput. Vis.

    (2017)
  • Y. Zhang et al.

    Convexified convolutional neural networks

    ICML

    (2017)
  • S. Hayou et al.

    On the impact of the activation function on deep neural networks training

    ICML

    (2019)
  • A.G. Wilson et al.

    Deep kernel learning

    Artificial Intelligence and Statistics

    (2016)
  • A. Shrikumar et al.

    Learning important features through propagating activation differences

    ICML

    (2017)
  • M. Telgarsky

    benefits of depth in neural networks

    Conference on Learning Theory

    (2016)
  • K. Kawaguchi

    Deep learning without poor local minima

    NIPS

    (2016)
  • X. Liu

    Discriminative face alignment

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2009)
  • Cited by (11)

    • Robust face alignment by dual-attentional spatial-aware capsule networks

      2022, Pattern Recognition
      Citation Excerpt :

      The spatial relationship between landmarks is ambiguous since facial appearance usually contains much noise under partial occlusion, which results in the inaccurate location of the occluded landmarks or even the visible ones. Recently, CNN-based face alignment algorithms have become the dominant approaches [6,7] and achieved outstanding performances. However, it is difficult for some CNN-based alignment algorithms [8,9] to capture the spatial positional relationship among landmark-related features neatly.

    • Explainable deep learning for efficient and robust pattern recognition: A survey of recent developments

      2021, Pattern Recognition
      Citation Excerpt :

      Yuan et al. [96] introduced an effective and efficient ”gate” structure to integrate multi-scale features for object detection. The heterogeneous output regression network (HORNet) was presented in Zhen et al. [97] for face alignment, which directly predicted facial landmarks from images. Leveraging the strengths of both kernel methods in nonlinearity modeling and neural networks in structural prediction, HORNet was highly flexible and offered a powerful multi-output regressor without a cascade regression architecture, demonstrating its efficiency.

    • Residual multi-task learning for facial landmark localization and expression recognition

      2021, Pattern Recognition
      Citation Excerpt :

      Park and Kim [9] proposes an occlusion-tolerant highly accurate face alignment method for the face occlusion issues. In [10], a heterogenous output regression network is proposed. These methods only focus on one aspect of the facial landmark localization.

    • Deep tree-ensembles for multi-output prediction

      2021, Pattern Recognition
      Citation Excerpt :

      More specifically, it decomposes the original multi-target problem into several local sub-tasks, and combines their solutions. Also using deep neural networks, Zhen et al. [22] proposed a general model that employs a non-linear layer and a linear low-rank layer to perform direct face alignment via multi-target regression. ERC [23] is an ensemble local approach where multiple regressors are chained in a random order.

    • An incrementally cascaded broad learning framework to facial landmark tracking

      2020, Neurocomputing
      Citation Excerpt :

      Comparisons with State-of-the-arts We compared our ICBL with both cascaded regression methods and deep learning-based methods. The methods based on cascaded regression include SDM [1], CFSS [28], iPar-CLR [11], CCR [12], iCCR [12], HORNet [29] and STCSR [30], which utilizes hand-crafted features. The deep learning-based methods include TCDCN [31], TSCN [32], FARN [33], AAN [34], DSRN [35], MAN [25], RNN-based [36], CRCT [37], and MHM [38], which learns features directly from pixels to predict landmarks end-to-end.

    View all citing articles on Scopus

    Xiantong Zhen received the B.S. and M.E. degrees from Lanzhou University, Lanzhou, China in 2007 and 2010, respectively and the Ph.D. degree from the Department of Electronic and Electrical Engineering, the University of Sheffield, UK in 2013. He worked as postdoctoral fellows with the University of Western Ontario, London, Canada and the University of Texas at Arlington, Texas, U.S.A. from 2013 to 2017. He was an associate professor with the School of Electronic and Information Engineering, Beihang University, Beijing, China from 2017 to 2018. He is currently with Inception Institute of Artificial Intelligence, United Arab Emirates and Guangdong University of Petrochemical Technology, Guangdong, China. His research interests include machine learning and computer vision.

    Mengyang Yu received the B.S. and M.S. degrees from the School of Mathematical Sciences, Peking University, Beijing, China, in 2010 and 2013, respectively, and the Ph.D. degree from the Department of Computer Science and Digital Technologies, Northumbria University, Newcastle upon Tyne, U.K., in 2017. He was a postdoctoral researcher at the Computer Vision Laboratory, ETH Zurich, Switzerland from 2017 to 2018. He is now a research scientist with Inception Institute of Artificial Intelligence. His research interests include computer vision, machine learning, and information retrieval.

    Zehao Xiao received his B.S. and M.E. degrees from School of Electronic and Information Engineering, Beihang University, 2017. He is currently a PhD candidate with University of Amsterdam, The Netherlands. His research interests include Computer Vision, Medical Image Analysis and Machine Learning.

    Lei Zhang received her PhD degree in computer science from Harbin Institute of Technology (HIT), Harbin, Heilongjiang, China, in 2004. She is currently a professor of Computer Science College, Guangdong University of Petrochemical Technology, Guangdong, China. Her research interests include Signal/Image Processing, Computer Vision and Machine Learning.

    Ling Shao is CEO and Chief Scientist at Inception Institute of Artificial Intelligence United Arab Emirates and a professor with the School of Computing Sciences at the University of East Anglia, Norwich, UK. Previously, he was a professor (2014–2016) with Northumbria University, a senior lecturer (2009–2014) with the University of Sheffield and a senior scientist (2005- 2009) with Philips Research, The Netherlands. His research interests include computer vision, image/video processing and machine learning. He is an associate editor of IEEE Transactions on Image Processing, IEEE Transactions on Neural Networks and Learning Systems and several other journals. He is a Fellow of the British Computer Society and the Institution of Engineering and Technology.

    This work was supported in part by the National Natural Science Foundation of China (grant no. 61871016, 61976060)

    View full text