Elsevier

Neurocomputing

Volume 407, 24 September 2020, Pages 259-269
Neurocomputing

Learning from discrete Gaussian label distribution and spatial channel-aware residual attention for head pose estimation

https://doi.org/10.1016/j.neucom.2020.05.010Get rights and content

Abstract

Recent head pose estimation techniques are advanced by performing bin classification, where the predicted result is compared against a one-hot classification vector. We argue that the head poses may better be modelled by discrete distribution sampled from a smooth continuous curve rather than one-hot coding or some other kinds of binned classification vector, since pose angles in practice are arbitrary. In this paper, we propose a deep head pose estimation scheme by regressing between predicted probabilistic labels and discrete Gaussian distribution. Such Gaussian distribution aims at modelling the arbitrary state of true head poses and supervises the deep network through maximum mean discrepancy loss. Besides, we also propose a spatial channel-aware residual attention structure for enhancing intrinsic pose features to further improve the prediction accuracy and speed up training convergence. Experiments on two public datasets AFLW2000 and BIWI show the proposed method outperforms all previous methods, and its individual components yield substantial improvements.

Introduction

In the field of computer vision, the problems of face-based analysis and modeling have long been challenging, such as face recognition [1], [2], [3] and identification, 3D face reconstruction [4], [5], [6], [7], facial landmark detection [8], [9], [10], [11] and head pose estimation [12], [13], [14], [15]. Among them, head pose estimation has played an important role for many years in the fields of three-dimensional reconstruction [5], face alignment [8], multi-pose face recognition [16], [17] and other applications. In general, head pose estimation is a task to predict the values of three Euler angles yaw, pitch and roll. To solve this problem, a large number of databases [18], [12], [9] containing 3D, 2D, even time-based face data and various effective methods [19], [13] have been proposed. Since the method [20], [21], [12] using data with depth or time information usually needs to consume more computing resources, and is not easy to use in practical applications, there are many researches [22], [15], [14], [23], [24] have focused on single RGB images. Some of the effective ways are based on facial keypoints [8], [25]. The landmark-based methods often establish a correspondence between the keypoints and a 3D head models, and predict head poses in 3D space. However, these methods rely on the detection of the keypoints and an additional 3D model, which makes them a little far away from widespread applications.

With the large-scale application with deep convolutional neural networks, methods based on them have achieved good performance [23], [14], [15]. FAN [8] directly predicts the 3D coordinates of the keypoints through a conv-net and further obtain the head pose. Hyperface [23] is a multi-objective learning method that includes direct regression to the three pose values. Hopenet [15] converts the regression into a classification problem and computes the mathematical expectations of angles. FSA-Net [14] then relies on the soft stage-wise regression and feature aggregation.

However, we have observed head pose angles are arbitrary and continuous in practice and the one-hot label via bin classification may lead to loss of information. As shown in Fig. 1(a), the samples have the similar poses and the max discrepancy of their ground truths of yaw is less than 6°. The first two samples share the same binned yaw and the last two share another, where the bin step is 3° as that in Hopenet [15]. The last column then shows that prediction of poses based on the binned classification is not so robust that error occurs when the degrees of pose are near the preset edges of bins.

Another problem is demonstrated in Fig. 1(b) that the ROI (region of interest) of the original face image impacts the accuracy greatly, which means the pose estimation has to depend on the accurate results of face detection and it is inefficient in practice.

In order to reduce the impact of these two problems mentioned above, and inspired by the work of Nataniel et al. [15] at the same time, we have made the following attempts: 1) Seeking a new label distribution to represent the head pose instead of the one-hot vector for binned classification. 2) Establishing an attention structure to reduce the impact of various ROIs of face images during the pose estimation. 3) Deploying an end-to-end DCNN to implement the above structure. As a result, our proposed method, to the best of our knowledge, is the first time to use both of label distribution and attention learning for pose estimation, which has made the following contributions:

  • 1.

    We propose to use a discrete Gaussian distribution label (DGDL) to depict the degrees of head pose with a combination of sampling spots. Different from the one-hot vector, our proposed label distribution guarantees the uniqueness of arbitrary poses. Besides, compared with the exact angles of scalar type, the continuity of the head pose is reflected by our DGDL. Further more, the DGDL could be migrated to similar tasks such as age estimation.

  • 2.

    We propose a spatial channel-aware residual attention (SCR-AT) structure and the SCR-AT is combined with two attention modules which respectively focus on identifying the regions of face exactly and activate the specific channels of feature map. These two attention maps do not affect each other and are connected to the shortcut feature map, making our proposed SCR-AT achieve better representation on head pose and faster training convergence.

  • 3.

    As the same as existing works, we trained our models on publicly available dataset such as 300 WLP [18] dataset. Various comparative experiments have shown that our method achieves the state-of-art results on the benchmarks such as AFLW2000 [18], BIWI [12], etc.

Section snippets

Related works

Human head pose estimation has been widely studied in the history of computer vision with so much diverse approaches proposed. Appearance Template methods [26], [27] use image based comparison metrics and compare new images with a set of exemplars. Detector Array methods [28], [29] train multiple face detectors for each different discrete pose. Nonlinear Regression methods then map from image space to pose space with using techniques like SVR [30], PCA [31], etc.

In the recent years, facial

Methodology

In this section, we firstly describe our discrete Gaussian distribution label (DGDL). Then, we introduce the spatial channel-aware residual attention (SCR-AT). Finally we give an overview of the end-to-end workflow and some details of the whole architecture.

Experiment and results

In this section we describe the implementation, validate our approach on the challenging datasets, show ablation studies and compare with the state-of-the-art methods.

Conclusion

In this paper, we propose a new method to directly predict the head pose from a single RGB image and our approach shows more accuracy and robustness. By defining a probabilistic label calculated by discrete Gaussian distribution and a spatial channel-aware residual attention structure, we have evaluated our approach using challenging head pose datasets and achieved the state-of-the-art with a small error. Further more, we have performed ablation studies and results have shown that our approach

CRediT authorship contribution statement

Yi Zhang: Conceptualization, Methodology, Software. Keren Fu: Writing - original draft. Jiang Wang: Validation. Peng Cheng: Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work was supported by the National Science Foundation of China, Nos. U1833128, 61703077, the Fundamental Research Funds for the Central Universities, No. YJ201755, the Sichuan Science and Technology Major Projects (No. 2018GZDZX0029), and the National Key Research and Development Program of China (No. 2016YFC0801100).

Yi Zhang received the B.E. degree from Tong Ji University, Shanghai, China. He is currently pursuing a Ph.D. degree in computer science at National Key Laboratory of Fundamental Science on Synthetic Vision, Sichuan University, Chengdu, China, under the supervision of Prof. Zhisheng You. His current research interests include visual computing, saliency analysis, and machine learning.

References (49)

  • M. Koestinger et al.

    Annotated facial landmarks in the wild: a large-scale, real-world database for facial landmark localization

  • C. Sagonas et al.

    300 faces in-the-wild challenge: the first facial landmark localization challenge

  • Z. Zhang, P. Luo, C.C. Loy, X. Tang, Facial landmark detection by deep multi-task learning, in: European Conference on...
  • G. Fanelli et al.

    Random forests for real time 3d face analysis

    Int. J. Comput. Vision

    (2013)
  • E. Murphy-Chutorian et al.

    Head pose estimation in computer vision: a survey

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2008)
  • T.-Y. Yang et al.

    Fsa-net: learning fine-grained structure aggregation for head pose estimation from a single image

  • N. Ruiz et al.

    Fine-grained head pose estimation without keypoints

  • M. Bauml et al.

    Multi-pose face recognition for person retrieval in camera networks

  • X. Zhu et al.

    Face alignment across large poses: a 3d solution

  • G.G. Chrysos et al.

    A comprehensive performance evaluation of deformable face tracking “in-the-wild”

    Int. J. Comput. Vision

    (2018)
  • S.S. Mukherjee et al.

    Deep head pose: gaze-direction estimation in multimodal video

    IEEE Trans. Multimedia

    (2015)
  • M. Martin, F. Van De Camp, R. Stiefelhagen, Real time head model creation and head pose estimation on consumer depth...
  • F.-J. Chang et al.

    Faceposenet: making a case for landmark-free face alignment

  • R. Ranjan et al.

    Hyperface: a deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2017)
  • Cited by (0)

    Yi Zhang received the B.E. degree from Tong Ji University, Shanghai, China. He is currently pursuing a Ph.D. degree in computer science at National Key Laboratory of Fundamental Science on Synthetic Vision, Sichuan University, Chengdu, China, under the supervision of Prof. Zhisheng You. His current research interests include visual computing, saliency analysis, and machine learning.

    Keren Fu received the dual Ph.D. degrees from Shanghai Jiao Tong University, Shanghai, China, and Chalmers University of Technology, Gothenburg, Sweden, under the joint supervision of Prof. Jie Yang and Prof. Irene Yu-Hua Gu. He is currently a research associate professor with College of Computer Science, Sichuan University, Chengdu, China. His current research interests include visual computing, saliency analysis, and machine learning.

    Jiang Wang received the M.E. degree in Information and Communication Engineering from University of Electronics Science and Technology of China in 2014. He is currently pursuing a Ph.D. degree in computer science at National Key Laboratory of Fundamental Science on Synthetic Vision, Sichuan University, Chengdu, China. His research interests include face recognition and face image analysis.

    Peng Cheng received Ph.D. degree from Sichuan University, Chengdu, China. Currently, he is an associate professor with College of Aeronautics and Astronautics in Sichuan University, Chengdu, China. His research interests include image registration, image fusion and computer vision.

    View full text