Heterogenous output regression network for direct face alignment☆
Introduction
Face alignment as a fundamental face analysis task has generated huge popularity in computer vision as face recognition and so on [1], [2]. Its widespread applications include face verification, facial expression analysis, human-machine interaction, and animation. Face alignment finds the locations, i.e. the coordinates, of a set of predefined landmark points on the face image. The challenges stem from the large variations in head pose [3], facial expression [4], illumination [6], [5], and partial occlusion [7], especially when it comes to unconstrained datasets in the wild [8].
Therefore, the relationship between the face image and the shape of landmarks is highly nonlinear; face images are usually represented by low-level feature descriptors while the predefined facial landmarks contain high-level semantic meaning. Moreover, the landmarks are spatially interdependent and strongly correlated, which should be modeled to improve the prediction performance. Since cascaded regression [9] was introduced to face alignment [10], it has provided the state-of-the-art performance in various face alignment tasks [11], [12]. Instead of performing one-step regression, cascaded regression starts with an initial shape, and iteratively refines the shape by a cascade of regressors. Building upon cascaded regression, many improved variants have been developed which distinguish themselves by the shape initialization strategies [13], shape-indexed features, [14] or regressors [11].
Despite of the great progress made, cascaded regression [15], [16] suffers from several widely-acknowledged, innate shortcomings. It is highly dependent on and sensitive to initialization. The estimation solution is prone to get trapped in local optima when starting from a poor shape initialization. Unfortunately, this is likely to happen in practice, due to the huge head pose variations [17]. Attempts to circumvent this problem, e.g. by applying multiple runs [9], [13], fall short from offering a principled solution. Moreover, it has been observed that cascaded models, e.g. supervised descent methods (SDM) [11], are only effective within a specific domain of homogeneous descent [18], [19]. This indicates that only limited performance can be expected for unconstrained faces with large head rotations and large shape deformations. In addition, previous work [9], [11] mostly treats coordinates of landmarks independently, thus largely overlooking the correlation of facial landmarks [11]. This said, a shape constraint was exploited in [13], [20]. Implicit measures to include landmark relations, like minimizing shape parameter errors, tend to be suboptimal and do not necessarily lead to lower alignment errors [19]. It is highly desirable to incorporate the correlation information in a more explicit and principled way. That would improve the prediction performance, e.g. recovering occluded landmarks [21]. Direct face alignment [22], [23] without relying on cascaded regression has recently generated increasing popularity, and achieved high performance on existing benckmarks, showing great promise for efficient and accurate face alignment [24].
In this paper, we propose directly predicting the locations of facial landmarks from the image rather than relying on any cascaded regression models. Our method finds the explicit mapping from images to shapes composed of facial landmarks, and offers a principled way to avoid the shortcomings of cascaded regression. We formulate face alignment as a multivariate regression problem, where each landmark point coordinate corresponds to a regression output. In contrast to cascaded regression, our method takes the holistic image representation as the input rather than shape-indexed features; it does not require any shape initialization and allows for the direct prediction of all facial landmarks.
It is a nontrivial task to directly predict facial landmarks from images, however. The highly nonlinear input-output relationship causes serious challenges. These are induced by the large image appearance variations and the highly deformable shapes of facial landmarks. We propose heterogenous output regression network (HORNet) to simultaneously handle those challenges in one single framework.
Derived from kernel approximation, HORNet accomplishes a new compact multi-layer learning network that is composed of a nonlinear learning layer and a linear low-rank learning layer. The nonlinear layer derived from kernel approximation is implemented as a feed-forward neural network with a cosine activation function. It achieves nonlinear mappings from the inputs to the hidden layers, to disentangle complicated nonlinear relationships between image representations and shapes of facial landmarks. The linear layer with an identity activation function explicitly and efficiently encodes the intrinsic correlations between landmark points by low-rank learning via the matrix elastic net (MEN). The parameters in both the nonlinear and linear layers can be jointly optimized via a newly derived alternating optimization algorithm based on mini-batch gradient descent.
Moreover, HORNet is highly flexible and can work either with pre-built feature representations or with convolutional architectures for end-to-end deep learning. HORNet inherits the strengths of kernel methods to tackle the nonlinearity in data and enjoys the innate properties of neural networks in structural prediction. Those fill the gap between kernel methods and neural networks for multivariate regression.
The contributions of this work can be summarized as follows:
- •
We propose a novel learning architecture, heterogenous output regression network (HORNet), for face alignment. HORNet allows for the direct prediction of facial landmarks from images, without relying on iterative cascaded regression models or shape initialization.
- •
HORNet can jointly handle the highly nonlinear input-output relationships and the underlying intrinsic correlations between outputs in one single framework.
- •
HORNet is highly flexible and can work either with pre-built feature representations or with convolutional architectures for end-to-end learning. Its scales well with large datasets and is able to leverage large, annotated unconstrained face alignment datasets.
HORNet has been extensively evaluated on five challenging in-the-wild datasets, including CelebA [25], MAFL [26], AFLW [27], 300-W [28] and 300-VW [29]. Experimental results show that HORNet achieves state-of-the-art performance and surpasses previous methods, demonstrating its great effectiveness for direct face alignment.
This paper focusses on the problem of explainable deep learning for efficient and robust pattern recognition by jointly aiming to improve explainability, efficiency and robustness. Our work sheds light on the interpretation of the learning behaviour of deep neural networks. This problem has been the subject of recent state-of-the art work on explaining deep learning models [30], [31], [32], [33].
On the one hand, our model bears strong explainability for the strong learning ability of our the proposed deep learning architecture for heterogenous output regression. The proposed HORNet contains a non-linear layer that enables the model to disentangle the complex relationship between input images and the associated facial shapes represented as a set of landmarks. The images in form of pixels generally contains low-level information, including edges and contours while facial shapes are in a higher semantic level [34]. The proposed nonlinear layer is derived from nonlinear kernels by the technique of kernel approximation based on random Fourier features (RFFs). The RFFs can lift the feature representations extracted from convolutional architecture to a higher semantic level in the reproducing kernel Hilbert space (RKHS) [31]. This is also related to the recent work in [35] which explores the nonlinear convolutional filters as vectors in a RKHS. Moreover, the cosine activation function in the derived nonlinear layer also admits a natural explanation, which provides an useful alternative to the ReLU function for the activation function which plays an important role in the learning behaviour in deep neural networks [36]. The proposed HORNet combines the strengths of both deep learning and kernels, which is in the similar spirit of the recent work on learning deep kernels [37].
On the other hand, the linear layer with nuclear norm constraint establishes a matrix elastic net and is relevant to the recent work [30], in which deep learning models are explained via a Bayesian nonparametric regression mixture model with multiple elastic nets. With our matrix elastic net, the HORNet is able to better capture the interdependency among landmarks, which can also be well explained from the perspective of subspace learning. By imposing a low-rank constraint, the linear layer forces correlated outputs, e.g., landmarks that are spatially connected and depended, to share roughly the same subspace of feature representations. Moreover, the low-rank constraint also enforce to the network to extract the most important features [33], [38] that are closely related to facial landmarks, which enables robust predictions [32]. Therefore, the proposed HORNet can achieve more robust predictions. Additionally, the linear layer with a low-rank constraint would also enhance the expressive ability by increasing the depth [39] in that the linear layer can be viewed as a network of one hidden layer with an identity activation; while the increased depth does not introduce any bad local minimum [40].
In addition, our HORNet is computationally more efficient than previous models based on cascaded regression for face alignment, which enables more robust prediction. Thanks to the designed architecture, our HORNet is able to directly predict the facial landmarks without relying on the progressive cascaded regression. Especially in the test stage, given an image, we can directly obtain the landmarks by several steps of feedforward matrix computation rather than by iterative optimization in the cascaded models. In general, in contrast to regular deep learning architectures, our proposed HORNet offers some useful interpretability to the deep learning models, which has also been well supported by our experiments.
Section snippets
Related work
We review representative work to briefly show the historical progress of face alignment over the last decade. It is also worth mentioning that the rapid development in face alignment is largely attributed to the release of large annotated datasets of facial landmarks [41].
In earlier work, face alignment was addressed mainly based on active shape /appearance models (ASM/AAM) [42]. Despite of many improved variants of ASM/AAM [43], those models suffer from poor generalization to unseen images [44]
Direct face alignment
We introduce a novel, heterogenous output regression network (HORNet) for face alignment. HORNet learns a mapping function from the input space of image representations to the multivariate output space of facial landmarks. Taking the feature representation of the facial image as the input, HORNet directly predicts a shape composed of facial landmarks as the multivariate output.
Experiments and results
We conduct extensive experiments to evaluate HORNet for direct face alignment on five challenging unconstrained datasets in the wild. They come with the diversity and challenges confronting existing face alignment tasks, from single images to video sequences, and thus provide a comprehensive evaluation of face alignment approaches. HORNet consistently achieves the state-of-the-art performance, showing its great effectiveness and generality for face alignment.
Conclusion and future work
In this paper, we have presented heterogenous output regression network (HORNet) for face alignment, which does not rely on cascaded regression. HORNet represents a new compact multivariate learning architecture comprised of a nonlinear layer and a low-rank linear layer. In particular, (i) HORNet combines the strengths of neural networks for structure prediction and kernels for nonlinear learning, which offers a new powerful multi-output regressor; (ii) HORNet is independent of input features
Xiantong Zhen received the B.S. and M.E. degrees from Lanzhou University, Lanzhou, China in 2007 and 2010, respectively and the Ph.D. degree from the Department of Electronic and Electrical Engineering, the University of Sheffield, UK in 2013. He worked as postdoctoral fellows with the University of Western Ontario, London, Canada and the University of Texas at Arlington, Texas, U.S.A. from 2013 to 2017. He was an associate professor with the School of Electronic and Information Engineering,
References (86)
- et al.
Robust, discriminative and comprehensive dictionary learning for face recognition
Pattern Recognit.
(2018) - et al.
A complete and fully automated face verification system on mobile devices
Pattern Recognit.
(2013) - et al.
L2, 1-based regression and prediction accumulation across views for robust facial landmark detection
Image Vis. Comput.
(2016) - et al.
A joint cascaded framework for simultaneous eye detection and eye state estimation
Pattern Recognit.
(2017) - et al.
Hierarchical facial landmark localization via cascaded random binary patterns
Pattern Recognit.
(2015) - et al.
300 Faces in-the-wild challenge: database and results
Image Vis. Comput.
(2016) - et al.
Coarse-to-fine auto-encoder networks (cfan) for real-time face alignment
ECCV
(2014) - et al.
Learning binary code for fast nearest subspace search
Pattern Recognit.
(2020) - et al.
Convolutional aggregation of local evidence for large pose face alignment
BMVC
(2016) - W.A.P. Smith, E.R. Hancock, Facial shape-from-shading and recognition using principal geodesic analysis and robust...
A coupled statistical model for face shape recovery from brightness images
IEEE Trans. Image Process.
Robust face landmark estimation under occlusion
ICCV
Face detection, pose estimation, and landmark localization in the wild
CVPR
Cascaded pose regression
CVPR
Face alignment by explicit shape regression
Int. J. Comput. Vis.
Supervised descent method and its applications to face alignment
CVPR
Project-out cascaded regression with an application to face alignment
CVPR
Face alignment by explicit shape regression
CVPR
Face alignment by coarse-to-fine shape searching
CVPR
Global supervised descent method
CVPR
Unconstrained face alignment via cascaded compositional learning
CVPR
Robust face alignment using a mixture of invariant experts
ECCV
Correlation filter cascade for facial landmark localization
WACV
Direct shape regression networks for end-to-end face alignment
CVPR
Attentional alignment networks
BMVC
Multi-scale aggregation network for direct face alignment
WACV
Deep learning face attributes in the wild
ICCV
Learning deep representation for face alignment with auxiliary attributes
IEEE Trans. Pattern Anal. Mach. Intell.
Annotated facial landmarks in the wild: alarge-scale, real-world database for facial landmark localization
ICCVW
300 faces in-the-wild challenge: the first facial landmark localization challenge
ICCVW
The first facial landmark tracking in-the-wild challenge: benchmark and results
ICCVW
Explaining deep learning models-a Bayesian non-parametric approach
NIPS
Visualizing and understanding convolutional networks
European Conference on Computer Vision
A unified approach to interpreting model predictions
NIPS
Cxplain: Causal explanations for model interpretation under uncertainty
NIPS
Growing regression tree forests by classification for continuous object pose estimation
Int. J. Comput. Vis.
Convexified convolutional neural networks
ICML
On the impact of the activation function on deep neural networks training
ICML
Deep kernel learning
Artificial Intelligence and Statistics
Learning important features through propagating activation differences
ICML
benefits of depth in neural networks
Conference on Learning Theory
Deep learning without poor local minima
NIPS
Discriminative face alignment
IEEE Trans. Pattern Anal. Mach. Intell.
Cited by (11)
Robust face alignment by dual-attentional spatial-aware capsule networks
2022, Pattern RecognitionCitation Excerpt :The spatial relationship between landmarks is ambiguous since facial appearance usually contains much noise under partial occlusion, which results in the inaccurate location of the occluded landmarks or even the visible ones. Recently, CNN-based face alignment algorithms have become the dominant approaches [6,7] and achieved outstanding performances. However, it is difficult for some CNN-based alignment algorithms [8,9] to capture the spatial positional relationship among landmark-related features neatly.
Explainable deep learning for efficient and robust pattern recognition: A survey of recent developments
2021, Pattern RecognitionCitation Excerpt :Yuan et al. [96] introduced an effective and efficient ”gate” structure to integrate multi-scale features for object detection. The heterogeneous output regression network (HORNet) was presented in Zhen et al. [97] for face alignment, which directly predicted facial landmarks from images. Leveraging the strengths of both kernel methods in nonlinearity modeling and neural networks in structural prediction, HORNet was highly flexible and offered a powerful multi-output regressor without a cascade regression architecture, demonstrating its efficiency.
Residual multi-task learning for facial landmark localization and expression recognition
2021, Pattern RecognitionCitation Excerpt :Park and Kim [9] proposes an occlusion-tolerant highly accurate face alignment method for the face occlusion issues. In [10], a heterogenous output regression network is proposed. These methods only focus on one aspect of the facial landmark localization.
Deep tree-ensembles for multi-output prediction
2021, Pattern RecognitionCitation Excerpt :More specifically, it decomposes the original multi-target problem into several local sub-tasks, and combines their solutions. Also using deep neural networks, Zhen et al. [22] proposed a general model that employs a non-linear layer and a linear low-rank layer to perform direct face alignment via multi-target regression. ERC [23] is an ensemble local approach where multiple regressors are chained in a random order.
An incrementally cascaded broad learning framework to facial landmark tracking
2020, NeurocomputingCitation Excerpt :Comparisons with State-of-the-arts We compared our ICBL with both cascaded regression methods and deep learning-based methods. The methods based on cascaded regression include SDM [1], CFSS [28], iPar-CLR [11], CCR [12], iCCR [12], HORNet [29] and STCSR [30], which utilizes hand-crafted features. The deep learning-based methods include TCDCN [31], TSCN [32], FARN [33], AAN [34], DSRN [35], MAN [25], RNN-based [36], CRCT [37], and MHM [38], which learns features directly from pixels to predict landmarks end-to-end.
An automated online proctoring system using attentive-net to assess student mischievous behavior
2023, Multimedia Tools and Applications
Xiantong Zhen received the B.S. and M.E. degrees from Lanzhou University, Lanzhou, China in 2007 and 2010, respectively and the Ph.D. degree from the Department of Electronic and Electrical Engineering, the University of Sheffield, UK in 2013. He worked as postdoctoral fellows with the University of Western Ontario, London, Canada and the University of Texas at Arlington, Texas, U.S.A. from 2013 to 2017. He was an associate professor with the School of Electronic and Information Engineering, Beihang University, Beijing, China from 2017 to 2018. He is currently with Inception Institute of Artificial Intelligence, United Arab Emirates and Guangdong University of Petrochemical Technology, Guangdong, China. His research interests include machine learning and computer vision.
Mengyang Yu received the B.S. and M.S. degrees from the School of Mathematical Sciences, Peking University, Beijing, China, in 2010 and 2013, respectively, and the Ph.D. degree from the Department of Computer Science and Digital Technologies, Northumbria University, Newcastle upon Tyne, U.K., in 2017. He was a postdoctoral researcher at the Computer Vision Laboratory, ETH Zurich, Switzerland from 2017 to 2018. He is now a research scientist with Inception Institute of Artificial Intelligence. His research interests include computer vision, machine learning, and information retrieval.
Zehao Xiao received his B.S. and M.E. degrees from School of Electronic and Information Engineering, Beihang University, 2017. He is currently a PhD candidate with University of Amsterdam, The Netherlands. His research interests include Computer Vision, Medical Image Analysis and Machine Learning.
Lei Zhang received her PhD degree in computer science from Harbin Institute of Technology (HIT), Harbin, Heilongjiang, China, in 2004. She is currently a professor of Computer Science College, Guangdong University of Petrochemical Technology, Guangdong, China. Her research interests include Signal/Image Processing, Computer Vision and Machine Learning.
Ling Shao is CEO and Chief Scientist at Inception Institute of Artificial Intelligence United Arab Emirates and a professor with the School of Computing Sciences at the University of East Anglia, Norwich, UK. Previously, he was a professor (2014–2016) with Northumbria University, a senior lecturer (2009–2014) with the University of Sheffield and a senior scientist (2005- 2009) with Philips Research, The Netherlands. His research interests include computer vision, image/video processing and machine learning. He is an associate editor of IEEE Transactions on Image Processing, IEEE Transactions on Neural Networks and Learning Systems and several other journals. He is a Fellow of the British Computer Society and the Institution of Engineering and Technology.