Elsevier

Pattern Recognition

Volume 121, January 2022, 108210
Pattern Recognition

Head pose estimation using deep neural networks and 3D point clouds

https://doi.org/10.1016/j.patcog.2021.108210Get rights and content

Highlights

  • We propose head pose estimation using deep neural networks and 3D point cloud.

  • We adopt 3D point cloud data generated from depth to estimate 3D head pose.

  • This is the first work of 3D point cloud based head pose estimation in a deep learning framework.

Abstract

In this paper, we propose head pose estimation using deep neural networks and 3D point cloud. Unlike existing methods that either take 2D RGB image or 2D depth image as input, we adopt 3D point cloud data generated from depth to estimate 3D head poses. To further improve robustness and accuracy of head pose estimation, we classify 3D angles of head poses into 36 classes with 5 interval and predict the probability of each angle in a class based on multi-layer perceptron (MLP). While traditional iterative methods for head model construction require high computation and memory costs, the proposed method is lightweight and computationally efficient by utilizing a sampled 3D point cloud as input combined with a graph convolutional neural network (GCNN). Experimental results on Biwi Kinect Head Pose dataset show that the proposed method achieves outstanding performance in head pose estimation and outperforms state-of-the-art ones in terms of accuracy.

Introduction

Head pose estimation are used for various vision applications such as human computer interaction and virtual reality. Moreover, it is a necessary preprocessing step for identification, gaze estimation, facial expression recognition and face 3D reconstruction [1], [2]. Head pose estimation aims to estimate the 3D Euler angles (roll, yaw and pitch) of head poses. Most existing methods [1], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14] take 2D RGB image as input to learn a mapping between 2D and 3D spaces. It is feasible to use traditional methods for regression and classification of the facial landmarks. However, it is difficult to accurately estimate the 3D head pose in a complex environment. In recent years, researchers [1], [7], [8], [9], [10], [11], [12] have used the powerful learning ability of convolution neural networks (CNNs) to extract features for head pose estimation. However, a single RGB image does not contain 3D information, thus causing estimation errors by the mapping from 2D space to 3D space. Mid-range and short-range depth cameras enable people to utilize depth information from depth images in a very convenient way. Model-based methods use depth information for head pose estimation by registering a 3D model via rigid iterative closest points (ICP) [15], [16] and particle swarm optimization (PSO) [17]. Although directly feeding a depth image into CNN seems to work well in head pose estimation [18], it is difficult for 2D CNN to extract features from 2D image that fully coincide with 3D spatial information of depth image.

To tackle the lack of 3D spatial information, we introduce a new data format, 3D point cloud, for head pose estimation. Point cloud is a set of 3D points on the visible surface generated from a single depth image obtained by a depth camera. The point cloud converted from the depth image is regarded as ordered, but after removing irrelevant parts and sampling it to the fixed number of points, the input point cloud is out of order. Meanwhile, typical convolution architectures like CNNs do not work well on this kind of unordered data. When we feed the point cloud into a traditional CNN, we shuffle the points and the CNN regards the shuffled point cloud as a different one even though they are the same as the original one. PointNet++ [19] performs a grouping operation that gathers the nearest points to a group as the convolution input. Point cloud is a data format that is suitable for building a graph. If we use graph convolution, we do not need to construct the data shape for the traditional convolution by stacking points in a group. The depth image also contains 2.5D information with x and y coordinates in an implicit way. However, the point cloud transforms the implicit x and y coordinates into the explicit data format, which removes the background outside face or zero value pixels inside face that may mislead the head pose estimation, thus resulting in performance improvement. Fig. 1 shows some head point clouds with their RGB and depth images.

In this paper, we propose head pose estimation from a single depth image using deep neural networks and 3D point cloud. Specifically, we first segment the input image using a head mask provided by the dataset, then convert the depth image of head into a set of 3D point cloud. Prior to feeding the 3D points into the network, we downsample and normalize them to have a normal distribution for robust estimation. Next, we perform the abstraction layer of PointNet++ to extract features from the input 3D points. Based on the extracted feature, we classify the head pose angle into 36 classes with 5 interval and predict the probability of each angle in a class based on multi-layer perceptron (MLP). Finally, we use the classification loss in the following regression layer to estimate the final head pose angle.

Compared with existing methods, main contributions of this paper are summarize as follows:

  • We propose a network architecture for head pose estimation that takes 3D point cloud from depth as input instead of RGB image or depth image. Thus, the proposed network captures 3D spatial information better than those based on RGB image or depth image.

  • We adopt graph convolution on the 3D point cloud to prevent stacking of points and generate sufficient dimensions for input, thus resulting in low memory cost and lightweight model.

  • We perform regression networks after classification of poses to make head pose estimation robust. This is different from the previous work that uses the expectation instead of learning a function.

Section snippets

Related works

There are several approaches to head pose estimation based on RGB image, depth image or both of them. Two mainstreams among the approaches are (1) regression/discriminative and (2) registration model.

Regression/discriminative approach: It directly infers 3D head pose angle from 2D image. Whitehill and Movellan [4] detected facial features such as center of both eyes, tip of nose and center of mouth, and localized face location frame-by-frame by feeding the facial landmark into a classifier

Proposed method

We first convert a depth image of head into a point cloud of size N×(d+C), where N is the number of points, d is the number of the point cloud dimensions, and C is other point features such as normals. In this work, d=3 that indicates the x, y, z coordinates, while C=0 that we do not use any other point features. Then, we feed the normalized point cloud into PointNet++ [19] to extract features and get a classification loss. Finally, we perform regression with the classification loss to get the

Experimental results

To verify the performance of the proposed method, we perform experiments on BIWI Kinect head pose dataset. We use a PC with a Intel I7 7700 CPU with 16GB RAM and an Nvidia GTX 1080Ti GPU. The network architecture is implemented in Pytorch 1.2 framework. Moreover, we conduct ablation study on head pose estimation without our classification module to show the effect of classification loss on the performance. We provide some head pose estimation results by the proposed method in Fig. 6.

Conclusion

In this paper, we have proposed a point cloud based deep neural network for head pose estimation. We have used 3D point cloud generated from a depth image for head pose estimation to effectively capture 3D information. Moreover, we have utilized graph convolution that is suitable for extracting features from unordered 3D point cloud input. First, we have built a network architecture for head pose estimation that takes 3D point cloud as input. Second, we have performed graph convolution on 3D

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work was supported by the National Natural Science Foundation of China (No. 61872280).

Yuanquan Xu received the B.S. degree in physical engineering from Zhengzhou University, China, in 2018. He is currently pursuing the M.S. degree in Xidian University, China. His research interests include computer vision, 3D reconstruction and machine learning.

References (38)

  • B. Ahn et al.

    Real-time head pose estimation using multi-task deep neural network

    Rob Auton Syst

    (2018)
  • Y. Wang et al.

    A deep coarse-to-fine network for head pose estimation from synthetic data

    Pattern Recognit

    (2019)
  • K. Chan et al.

    A 3-d-point-cloud system for human-pose estimation

    IEEE Transactions on Systems, Man, and Cybernetics: Systems

    (2017)
  • N. Ruiz et al.

    Fine-grained head pose estimation without keypoints

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

    (2018)
  • G. Fanelli et al.

    Real time head pose estimation with random regression forests

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2011)
  • T. Vatahska et al.

    Feature-based head pose estimation from images

    Proceedings of the IEEE-RAS International Conference on Humanoid Robots

    (2007)
  • J. Whitehill et al.

    A discriminative approach to frame-by-frame head pose tracking

    Proceedings of the IEEE Conference on Automatic Face & Gesture Recognition

    (2008)
  • V. Drouard et al.

    Head pose estimation via probabilistic high-dimensional regression

    Proceedings of the IEEE International Conference on Image Processing

    (2015)
  • V. Drouard et al.

    Switching linear inverse-regression model for tracking head pose

    Proceedings of the IEEE Winter Conference on Applications of Computer Vision

    (2017)
  • B. Ahn et al.

    Real-time head orientation from a monocular camera using deep neural network

    Proceedings of the Asian Conference on Computer Vision

    (2014)
  • S. Lathuilière et al.

    Deep mixture of linear inverse regressions applied to head-pose estimation

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2017)
  • X. Xu et al.

    Joint head pose estimation and face alignment framework using global and local cnn features

    Proceedings of the IEEE International Conference on Automatic Face & Gesture Recognition

    (2017)
  • T.-Y. Yang et al.

    Fsa-net: Learning fine-grained structure aggregation for head pose estimation from a single image

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2019)
  • R. Yang et al.

    Model-based head pose tracking with stereovision

    Proceedings of the IEEE Conference on Automatic Face Gesture Recognition

    (2002)
  • C. Cao et al.

    3D shape regression for real-time facial animation

    ACM Trans Graph

    (2013)
  • M. Storer et al.

    3d-mam: 3d morphable appearance model for efficient fine head pose estimation from still images

    Proceedings of the IEEE Conference on Computer Vision Workshops

    (2009)
  • C. Luo et al.

    Real-time head pose estimation and face modeling from a depth image

    IEEE Trans Multimedia

    (2019)
  • P. Padeleris et al.

    Head pose estimation on depth data based on particle swarm optimization

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

    (2012)
  • M. Venturelli et al.

    From depth data to head pose estimation: a siamese approach

    arXiv preprint arXiv:1703.03624

    (2017)
  • Cited by (42)

    View all citing articles on Scopus

    Yuanquan Xu received the B.S. degree in physical engineering from Zhengzhou University, China, in 2018. He is currently pursuing the M.S. degree in Xidian University, China. His research interests include computer vision, 3D reconstruction and machine learning.

    Cheolkon Jung is a Born Again Christian. He received the B.S., M.S., and Ph.D. degrees in electronic engineering from Sungkyunkwan University, South Korea, in 1995, 1997, and 2002, respectively. He was a Research Staff Member with Samsung Advanced Institute of Technology, Samsung Electronics, South Korea, from 2002 to 2007. He was also a Research Professor with the School of Information and Communication Engineering, Sungkyunkwan University, from 2007 to 2009. Since 2009, he has been with the School of Electronic Engineering, Xidian University, China, where he is currently a Full Professor and the Director of the Xidian Media Laboratory. His main research interests include image and video processing, computer vision, pattern recognition, machine learning, computational photography, video coding, virtual reality, information fusion, multimedia content analysis and management, and 3DTV.

    Yakun Chang received the B.S. degree in electronic engineering from Xidian University, China, in 2015. He is currently pursuing the Ph.D. degree in the same university. His research interests include computer vision, pattern recognition and machine learning.

    View full text