Learning from discrete Gaussian label distribution and spatial channel-aware residual attention for head pose estimation
Introduction
In the field of computer vision, the problems of face-based analysis and modeling have long been challenging, such as face recognition [1], [2], [3] and identification, 3D face reconstruction [4], [5], [6], [7], facial landmark detection [8], [9], [10], [11] and head pose estimation [12], [13], [14], [15]. Among them, head pose estimation has played an important role for many years in the fields of three-dimensional reconstruction [5], face alignment [8], multi-pose face recognition [16], [17] and other applications. In general, head pose estimation is a task to predict the values of three Euler angles yaw, pitch and roll. To solve this problem, a large number of databases [18], [12], [9] containing 3D, 2D, even time-based face data and various effective methods [19], [13] have been proposed. Since the method [20], [21], [12] using data with depth or time information usually needs to consume more computing resources, and is not easy to use in practical applications, there are many researches [22], [15], [14], [23], [24] have focused on single RGB images. Some of the effective ways are based on facial keypoints [8], [25]. The landmark-based methods often establish a correspondence between the keypoints and a 3D head models, and predict head poses in 3D space. However, these methods rely on the detection of the keypoints and an additional 3D model, which makes them a little far away from widespread applications.
With the large-scale application with deep convolutional neural networks, methods based on them have achieved good performance [23], [14], [15]. FAN [8] directly predicts the 3D coordinates of the keypoints through a conv-net and further obtain the head pose. Hyperface [23] is a multi-objective learning method that includes direct regression to the three pose values. Hopenet [15] converts the regression into a classification problem and computes the mathematical expectations of angles. FSA-Net [14] then relies on the soft stage-wise regression and feature aggregation.
However, we have observed head pose angles are arbitrary and continuous in practice and the one-hot label via bin classification may lead to loss of information. As shown in Fig. 1(a), the samples have the similar poses and the max discrepancy of their ground truths of yaw is less than . The first two samples share the same binned yaw and the last two share another, where the bin step is as that in Hopenet [15]. The last column then shows that prediction of poses based on the binned classification is not so robust that error occurs when the degrees of pose are near the preset edges of bins.
Another problem is demonstrated in Fig. 1(b) that the ROI (region of interest) of the original face image impacts the accuracy greatly, which means the pose estimation has to depend on the accurate results of face detection and it is inefficient in practice.
In order to reduce the impact of these two problems mentioned above, and inspired by the work of Nataniel et al. [15] at the same time, we have made the following attempts: 1) Seeking a new label distribution to represent the head pose instead of the one-hot vector for binned classification. 2) Establishing an attention structure to reduce the impact of various ROIs of face images during the pose estimation. 3) Deploying an end-to-end DCNN to implement the above structure. As a result, our proposed method, to the best of our knowledge, is the first time to use both of label distribution and attention learning for pose estimation, which has made the following contributions:
- 1.
We propose to use a discrete Gaussian distribution label (DGDL) to depict the degrees of head pose with a combination of sampling spots. Different from the one-hot vector, our proposed label distribution guarantees the uniqueness of arbitrary poses. Besides, compared with the exact angles of scalar type, the continuity of the head pose is reflected by our DGDL. Further more, the DGDL could be migrated to similar tasks such as age estimation.
- 2.
We propose a spatial channel-aware residual attention (SCR-AT) structure and the SCR-AT is combined with two attention modules which respectively focus on identifying the regions of face exactly and activate the specific channels of feature map. These two attention maps do not affect each other and are connected to the shortcut feature map, making our proposed SCR-AT achieve better representation on head pose and faster training convergence.
- 3.
As the same as existing works, we trained our models on publicly available dataset such as 300 W–LP [18] dataset. Various comparative experiments have shown that our method achieves the state-of-art results on the benchmarks such as AFLW2000 [18], BIWI [12], etc.
Section snippets
Related works
Human head pose estimation has been widely studied in the history of computer vision with so much diverse approaches proposed. Appearance Template methods [26], [27] use image based comparison metrics and compare new images with a set of exemplars. Detector Array methods [28], [29] train multiple face detectors for each different discrete pose. Nonlinear Regression methods then map from image space to pose space with using techniques like SVR [30], PCA [31], etc.
In the recent years, facial
Methodology
In this section, we firstly describe our discrete Gaussian distribution label (DGDL). Then, we introduce the spatial channel-aware residual attention (SCR-AT). Finally we give an overview of the end-to-end workflow and some details of the whole architecture.
Experiment and results
In this section we describe the implementation, validate our approach on the challenging datasets, show ablation studies and compare with the state-of-the-art methods.
Conclusion
In this paper, we propose a new method to directly predict the head pose from a single RGB image and our approach shows more accuracy and robustness. By defining a probabilistic label calculated by discrete Gaussian distribution and a spatial channel-aware residual attention structure, we have evaluated our approach using challenging head pose datasets and achieved the state-of-the-art with a small error. Further more, we have performed ablation studies and results have shown that our approach
CRediT authorship contribution statement
Yi Zhang: Conceptualization, Methodology, Software. Keren Fu: Writing - original draft. Jiang Wang: Validation. Peng Cheng: Supervision.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
This work was supported by the National Science Foundation of China, Nos. U1833128, 61703077, the Fundamental Research Funds for the Central Universities, No. YJ201755, the Sichuan Science and Technology Major Projects (No. 2018GZDZX0029), and the National Key Research and Development Program of China (No. 2016YFC0801100).
Yi Zhang received the B.E. degree from Tong Ji University, Shanghai, China. He is currently pursuing a Ph.D. degree in computer science at National Key Laboratory of Fundamental Science on Synthetic Vision, Sichuan University, Chengdu, China, under the supervision of Prof. Zhisheng You. His current research interests include visual computing, saliency analysis, and machine learning.
References (49)
- et al.
Efficient 3d reconstruction for face recognition
Pattern Recogn.
(2005) - et al.
Reconstruction and analysis of multi-pose face images based on nonlinear dimensionality reduction
Pattern Recogn.
(2004) - et al.
Composite support vector machines for detection of faces across views and pose estimation
Image Vis. Comput.
(2002) - et al.
Face description with local binary patterns: application to face recognition
IEEE Trans. Pattern Anal. Mach. Intell.
(2006) - et al.
Reconstruction-based disentanglement for pose-invariant face recognition
- J. Yang, P. Ren, D. Zhang, D. Chen, F. Wen, H. Li, G. Hua, Neural aggregation network for video face recognition, in:...
- et al.
Large pose 3d face reconstruction from a single image via direct volumetric cnn regression
- et al.
3d face reconstruction from a single image using a single reference face shape
IEEE Trans. Pattern Anal. Mach. Intell.
(2010) - et al.
Learning detailed face reconstruction from a single image
- et al.
How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks)
Annotated facial landmarks in the wild: a large-scale, real-world database for facial landmark localization
300 faces in-the-wild challenge: the first facial landmark localization challenge
Random forests for real time 3d face analysis
Int. J. Comput. Vision
Head pose estimation in computer vision: a survey
IEEE Trans. Pattern Anal. Mach. Intell.
Fsa-net: learning fine-grained structure aggregation for head pose estimation from a single image
Fine-grained head pose estimation without keypoints
Multi-pose face recognition for person retrieval in camera networks
Face alignment across large poses: a 3d solution
A comprehensive performance evaluation of deformable face tracking “in-the-wild”
Int. J. Comput. Vision
Deep head pose: gaze-direction estimation in multimodal video
IEEE Trans. Multimedia
Faceposenet: making a case for landmark-free face alignment
Hyperface: a deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition
IEEE Trans. Pattern Anal. Mach. Intell.
Cited by (0)
Yi Zhang received the B.E. degree from Tong Ji University, Shanghai, China. He is currently pursuing a Ph.D. degree in computer science at National Key Laboratory of Fundamental Science on Synthetic Vision, Sichuan University, Chengdu, China, under the supervision of Prof. Zhisheng You. His current research interests include visual computing, saliency analysis, and machine learning.
Keren Fu received the dual Ph.D. degrees from Shanghai Jiao Tong University, Shanghai, China, and Chalmers University of Technology, Gothenburg, Sweden, under the joint supervision of Prof. Jie Yang and Prof. Irene Yu-Hua Gu. He is currently a research associate professor with College of Computer Science, Sichuan University, Chengdu, China. His current research interests include visual computing, saliency analysis, and machine learning.
Jiang Wang received the M.E. degree in Information and Communication Engineering from University of Electronics Science and Technology of China in 2014. He is currently pursuing a Ph.D. degree in computer science at National Key Laboratory of Fundamental Science on Synthetic Vision, Sichuan University, Chengdu, China. His research interests include face recognition and face image analysis.
Peng Cheng received Ph.D. degree from Sichuan University, Chengdu, China. Currently, he is an associate professor with College of Aeronautics and Astronautics in Sichuan University, Chengdu, China. His research interests include image registration, image fusion and computer vision.