Head pose estimation using deep neural networks and 3D point clouds

doi:10.1016/j.patcog.2021.108210

Pattern Recognition

Volume 121, January 2022, 108210

https://doi.org/10.1016/j.patcog.2021.108210 Get rights and content

Highlights

•
We propose head pose estimation using deep neural networks and 3D point cloud.
•
We adopt 3D point cloud data generated from depth to estimate 3D head pose.
•
This is the first work of 3D point cloud based head pose estimation in a deep learning framework.

Abstract

In this paper, we propose head pose estimation using deep neural networks and 3D point cloud. Unlike existing methods that either take 2D RGB image or 2D depth image as input, we adopt 3D point cloud data generated from depth to estimate 3D head poses. To further improve robustness and accuracy of head pose estimation, we classify 3D angles of head poses into 36 classes with 5 $^{\circ}$ interval and predict the probability of each angle in a class based on multi-layer perceptron (MLP). While traditional iterative methods for head model construction require high computation and memory costs, the proposed method is lightweight and computationally efficient by utilizing a sampled 3D point cloud as input combined with a graph convolutional neural network (GCNN). Experimental results on Biwi Kinect Head Pose dataset show that the proposed method achieves outstanding performance in head pose estimation and outperforms state-of-the-art ones in terms of accuracy.

Introduction

Head pose estimation are used for various vision applications such as human computer interaction and virtual reality. Moreover, it is a necessary preprocessing step for identification, gaze estimation, facial expression recognition and face 3D reconstruction [1], [2]. Head pose estimation aims to estimate the 3D Euler angles (roll, yaw and pitch) of head poses. Most existing methods [1], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14] take 2D RGB image as input to learn a mapping between 2D and 3D spaces. It is feasible to use traditional methods for regression and classification of the facial landmarks. However, it is difficult to accurately estimate the 3D head pose in a complex environment. In recent years, researchers [1], [7], [8], [9], [10], [11], [12] have used the powerful learning ability of convolution neural networks (CNNs) to extract features for head pose estimation. However, a single RGB image does not contain 3D information, thus causing estimation errors by the mapping from 2D space to 3D space. Mid-range and short-range depth cameras enable people to utilize depth information from depth images in a very convenient way. Model-based methods use depth information for head pose estimation by registering a 3D model via rigid iterative closest points (ICP) [15], [16] and particle swarm optimization (PSO) [17]. Although directly feeding a depth image into CNN seems to work well in head pose estimation [18], it is difficult for 2D CNN to extract features from 2D image that fully coincide with 3D spatial information of depth image.

To tackle the lack of 3D spatial information, we introduce a new data format, 3D point cloud, for head pose estimation. Point cloud is a set of 3D points on the visible surface generated from a single depth image obtained by a depth camera. The point cloud converted from the depth image is regarded as ordered, but after removing irrelevant parts and sampling it to the fixed number of points, the input point cloud is out of order. Meanwhile, typical convolution architectures like CNNs do not work well on this kind of unordered data. When we feed the point cloud into a traditional CNN, we shuffle the points and the CNN regards the shuffled point cloud as a different one even though they are the same as the original one. PointNet++ [19] performs a grouping operation that gathers the nearest points to a group as the convolution input. Point cloud is a data format that is suitable for building a graph. If we use graph convolution, we do not need to construct the data shape for the traditional convolution by stacking points in a group. The depth image also contains 2.5D information with x and y coordinates in an implicit way. However, the point cloud transforms the implicit x and y coordinates into the explicit data format, which removes the background outside face or zero value pixels inside face that may mislead the head pose estimation, thus resulting in performance improvement. Fig. 1 shows some head point clouds with their RGB and depth images.

In this paper, we propose head pose estimation from a single depth image using deep neural networks and 3D point cloud. Specifically, we first segment the input image using a head mask provided by the dataset, then convert the depth image of head into a set of 3D point cloud. Prior to feeding the 3D points into the network, we downsample and normalize them to have a normal distribution for robust estimation. Next, we perform the abstraction layer of PointNet++ to extract features from the input 3D points. Based on the extracted feature, we classify the head pose angle into 36 classes with 5 $^{\circ}$ interval and predict the probability of each angle in a class based on multi-layer perceptron (MLP). Finally, we use the classification loss in the following regression layer to estimate the final head pose angle.

Compared with existing methods, main contributions of this paper are summarize as follows:

•
We propose a network architecture for head pose estimation that takes 3D point cloud from depth as input instead of RGB image or depth image. Thus, the proposed network captures 3D spatial information better than those based on RGB image or depth image.
•
We adopt graph convolution on the 3D point cloud to prevent stacking of points and generate sufficient dimensions for input, thus resulting in low memory cost and lightweight model.
•
We perform regression networks after classification of poses to make head pose estimation robust. This is different from the previous work that uses the expectation instead of learning a function.

Section snippets

Related works

There are several approaches to head pose estimation based on RGB image, depth image or both of them. Two mainstreams among the approaches are (1) regression/discriminative and (2) registration model.

Regression/discriminative approach: It directly infers 3D head pose angle from 2D image. Whitehill and Movellan [4] detected facial features such as center of both eyes, tip of nose and center of mouth, and localized face location frame-by-frame by feeding the facial landmark into a classifier

Proposed method

We first convert a depth image of head into a point cloud of size $N \times (d + C)$ , where $N$ is the number of points, $d$ is the number of the point cloud dimensions, and $C$ is other point features such as normals. In this work, $d = 3$ that indicates the x, y, z coordinates, while $C = 0$ that we do not use any other point features. Then, we feed the normalized point cloud into PointNet++ [19] to extract features and get a classification loss. Finally, we perform regression with the classification loss to get the

Experimental results

To verify the performance of the proposed method, we perform experiments on BIWI Kinect head pose dataset. We use a PC with a Intel I7 7700 CPU with 16GB RAM and an Nvidia GTX 1080Ti GPU. The network architecture is implemented in Pytorch 1.2 framework. Moreover, we conduct ablation study on head pose estimation without our classification module to show the effect of classification loss on the performance. We provide some head pose estimation results by the proposed method in Fig. 6.

Conclusion

In this paper, we have proposed a point cloud based deep neural network for head pose estimation. We have used 3D point cloud generated from a depth image for head pose estimation to effectively capture 3D information. Moreover, we have utilized graph convolution that is suitable for extracting features from unordered 3D point cloud input. First, we have built a network architecture for head pose estimation that takes 3D point cloud as input. Second, we have performed graph convolution on 3D

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work was supported by the National Natural Science Foundation of China (No. 61872280).

Yuanquan Xu received the B.S. degree in physical engineering from Zhengzhou University, China, in 2018. He is currently pursuing the M.S. degree in Xidian University, China. His research interests include computer vision, 3D reconstruction and machine learning.

References (38)

B. Ahn et al.
Real-time head pose estimation using multi-task deep neural network
Rob Auton Syst
(2018)
Y. Wang et al.
A deep coarse-to-fine network for head pose estimation from synthetic data
Pattern Recognit
(2019)
K. Chan et al.
A 3-d-point-cloud system for human-pose estimation
IEEE Transactions on Systems, Man, and Cybernetics: Systems
(2017)
N. Ruiz et al.
Fine-grained head pose estimation without keypoints
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops
(2018)
G. Fanelli et al.
Real time head pose estimation with random regression forests
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2011)
T. Vatahska et al.
Feature-based head pose estimation from images
Proceedings of the IEEE-RAS International Conference on Humanoid Robots
(2007)
J. Whitehill et al.
A discriminative approach to frame-by-frame head pose tracking
Proceedings of the IEEE Conference on Automatic Face & Gesture Recognition
(2008)
V. Drouard et al.
Head pose estimation via probabilistic high-dimensional regression
Proceedings of the IEEE International Conference on Image Processing
(2015)
V. Drouard et al.
Switching linear inverse-regression model for tracking head pose
Proceedings of the IEEE Winter Conference on Applications of Computer Vision
(2017)
B. Ahn et al.
Real-time head orientation from a monocular camera using deep neural network
Proceedings of the Asian Conference on Computer Vision
(2014)

S. Lathuilière et al.

Deep mixture of linear inverse regressions applied to head-pose estimation

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2017)

X. Xu et al.

Joint head pose estimation and face alignment framework using global and local cnn features

Proceedings of the IEEE International Conference on Automatic Face & Gesture Recognition

(2017)

T.-Y. Yang et al.

Fsa-net: Learning fine-grained structure aggregation for head pose estimation from a single image

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2019)

R. Yang et al.

Model-based head pose tracking with stereovision

Proceedings of the IEEE Conference on Automatic Face Gesture Recognition

(2002)

C. Cao et al.

3D shape regression for real-time facial animation

ACM Trans Graph

(2013)

M. Storer et al.

3d-mam: 3d morphable appearance model for efficient fine head pose estimation from still images

Proceedings of the IEEE Conference on Computer Vision Workshops

(2009)

C. Luo et al.

Real-time head pose estimation and face modeling from a depth image

IEEE Trans Multimedia

(2019)

P. Padeleris et al.

Head pose estimation on depth data based on particle swarm optimization

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

(2012)

M. Venturelli et al.

From depth data to head pose estimation: a siamese approach

arXiv preprint arXiv:1703.03624

(2017)

Cited by (42)

Head pose estimation with uncertainty and an application to dyadic interaction detection
2024, Computer Vision and Image Understanding
Determining the visual focus of attention of people in a scene is a fundamental cue to understand social interactions from videos. Gaze direction is ideal for determining eye contact, a basic cue of non-verbal communication, but it is not always easy to recognize. Head direction is a well-known proxy of gaze direction, more robust to the variability of the scene, thus offering a valuable alternative. In this work, we consider HHP-net, a method for estimating the head direction from single frames based on a heteroscedastic neural network to estimate people’s head pose from a minimal set of head key points. We formulate the problem as a multi-task regression, to predict the pose as a triplet of Euler angles from the output of a 2D pose estimator. HHP-net also provides a measure of the aleatoric heteroscedastic uncertainties associated with the angles, through an ad-hoc loss function we introduce. In a thorough experimental analysis, we show that our model is efficient and effective compared with the state of the art, with only $\sim$ 2 degrees of degradation in the worst case counterbalanced by a space occupation $\sim$ 12 times smaller. We also show the beneficial effects of uncertainty on interpretability. Finally, we discuss the robustness of our method to input variability, showing that it can be seen as a plug-in to different pose estimators. As a proof-of-concept, we address social interaction analysis, with an algorithm to detect dyadic interactions in images.
Fusion-competition framework of local topology and global texture for head pose estimation
2024, Pattern Recognition
RGB image and point cloud involve texture and geometric structure, which are widely used for head pose estimation. However, images lack of spatial information, and the quality of point cloud is easily affected by sensor noise. In this paper, a novel fusion-competition framework (FCF) is proposed to overcome the limitations of a single modality. The global texture information is extracted from image and the local topology information is extracted from point cloud to project heterogeneous data into a common feature subspace. The projected texture feature weighted by the channel attention mechanism is embedded into each local point cloud region with different topological features for fusion. The scoring mechanism creates competition among the regions involving local-global fused features to predict final pose with the highest score. According to the evaluation results on the public and our constructed datasets, the FCF improves the estimation accuracy and stability by an average of 13.6 % and 12.7 %, which is compared to nine state-of-the-art methods.
Real-time 6DoF full-range markerless head pose estimation[Formula presented]
2024, Expert Systems with Applications
Head pose estimation is a fundamental function for several applications in human–computer interactions. Accurate six degrees of freedom head pose estimation (6DoF-HPE) with full-range angles make up most of these applications, which require sequential images of the human head as input. Most existing head pose estimation methods focus on a three degrees of freedom (3DoF) frontal head, which restricts their applications in real-world scenarios. This study presents a framework designed to estimate a head pose without landmark localization. The novelty of our framework is to estimate the 6DoF head poses under full-range angles in real-time. The proposed framework leverages deep neural networks to detect human heads and predict their angles using single shot multibox detector (SSD) and RepVGG-b1g4 backbone, respectively. This work uses red, green, blue, and depth (RGB-D) data to estimate the rotational and translational components relative to the camera pose. The proposed framework employs a continuous representation to predict the angles and a multi-loss approach to update the loss functions for the training strategy. The regression function combines the geodesic loss with the mean squared error. The ground-truth labels were extracted from the public dataset Carnegie Mellon university (CMU) Panoptic for full head angles. This study provides a comprehensive comparison with state-of-the-art methods using public benchmark datasets. Experiments demonstrate that the proposed method achieves or outperforms state-of-the-art methods. The code and datasets are available at: (https://github.com/Redhwan-A/6DoFHPE).
Energy-efficient motion planning of an autonomous forklift using deep neural networks and kinetic model
2024, Expert Systems with Applications
Autonomous Forklifts (AFs) play a vital role in smart factories, particularly in the transportation of heavy loads. However, their energy consumption poses a significant challenge as they need to operate for extended periods on a single battery charge. Therefore, energy-efficient motion is necessary to raise their availability. The AF’s movement is dynamically determined by a motion planning algorithm within its navigation system. In light of this, this article introduces a strategy to improve energy efficiency during the motion planning phase. This strategy involves a cooperative approach, utilizing the Deep Neural Networks (DNNs) and AF’s kinetic model to achieve this energy-saving goal. Unlike traditional methods that rely solely on the vehicle’s kinematic model, our approach considers an additional factor, incorporating the influence of the vehicle’s kinetic model for a more comprehensive and accurate energy consumption analysis. First, the kinetic model of an AF is developed by considering the effect of the front-powered wheel. Second, the kinetic model is employed within a time-energy optimization technique, aiming to find the AF’s ideal acceleration. This optimization process generates a dataset that covers a range of AF maneuvers and dynamic parameters. Third, a DNNs model is trained using this dataset to predict the optimal acceleration for the AF. Finally, the trained model is integrated into a motion planning algorithm to determine the optimal and acceptable limits for both linear and angular acceleration. Experiments illustrate that the suggested motion planning method can generate trajectories that are both feasible and optimized for energy consumption. This differs significantly from the typical algorithms which generally results in higher energy use by the AF, occasionally leading to the generation of infeasible trajectories.
Objformer: Boosting 3D object detection via instance-wise interaction
2024, Pattern Recognition
Deep learning on point clouds drives 3D object detection. Despite rapid progress, point-based methods still suffer from the problems such as incompletion and occlusion, which are caused by the material properties of objects and cluttered scenes. These difficult targets increase the difficulty of identification or even lead to misidentification, severely weakening the performance of point-based methods on 3D object detection. To alleviate the above problems, we propose the Objformer to boost point-based 3D object detection via instance-wise interaction. We design an instance feature encoder to encode clean instance features, which contain key geometric priors and holistic semantic information. Further, an instance interaction module is devised to aggregate the complementary features across instances with label-guided interaction, boosting the performance of the 3D object detection. Experiments show that Objformer outperforms previous point-based state-of-the-arts on two popular benchmarks, ScanNet V2 and SUN RGB-D. Especially, our single-modal Objformer even outperforms the competing advanced multi-modal fusion method on both SUN RGB-D and ScanNet V2.
A new method for point cloud registration: Adaptive relation-oriented convolution and recurrent correspondence-walk
2024, Knowledge-Based Systems
For point cloud registration (PCR), a matching matrix is critical. Unfortunately, the existing approaches do not explicitly devise schemes to refine the matching matrix. Furthermore, previous studies focused on the design of feature interactions between two point clouds and lacked attention to the discriminative features required for point cloud registration. This study presents a novel PCR method called RecARONet. The method mainly includes two innovations: an adaptive relation-oriented convolution (ARO-Conv) with effectiveness and a recurrent refinement technique of the correspondence based on the adaptive neighbourhood consensus constraint, mainly for more accurate registration. Specifically, ARO-Conv reconstructs the node representation by weighting the relations in the local neighbourhood rather than generating point features from the embeddings of the neighbours. This simple but effective operation can reduce feature redundancy and alleviate structural smoothness to a certain extent. It can also assign appropriate weights to the relations and different channels of features to capture more distinct local topological information. In addition, a recurrent correspondence-walk with a semantic adornment algorithm based on the adaptive neighbourhood consensus constraint is depicted, which can adaptively capture the differences in the local structure among proxy point pairs and recurrently update correspondences. Registration evaluations were performed on several complete/partial point cloud datasets, which revealed that the constructed model achieved excellent performance.

View all citing articles on Scopus

Cheolkon Jung is a Born Again Christian. He received the B.S., M.S., and Ph.D. degrees in electronic engineering from Sungkyunkwan University, South Korea, in 1995, 1997, and 2002, respectively. He was a Research Staff Member with Samsung Advanced Institute of Technology, Samsung Electronics, South Korea, from 2002 to 2007. He was also a Research Professor with the School of Information and Communication Engineering, Sungkyunkwan University, from 2007 to 2009. Since 2009, he has been with the School of Electronic Engineering, Xidian University, China, where he is currently a Full Professor and the Director of the Xidian Media Laboratory. His main research interests include image and video processing, computer vision, pattern recognition, machine learning, computational photography, video coding, virtual reality, information fusion, multimedia content analysis and management, and 3DTV.

Yakun Chang received the B.S. degree in electronic engineering from Xidian University, China, in 2015. He is currently pursuing the Ph.D. degree in the same university. His research interests include computer vision, pattern recognition and machine learning.

View full text

Head pose estimation using deep neural networks and 3D point clouds

Highlights

Abstract

Introduction

Section snippets

Related works

Proposed method

Experimental results

Conclusion

Declaration of Competing Interest

Acknowledgement

Rob Auton Syst

Pattern Recognit

IEEE Transactions on Systems, Man, and Cybernetics: Systems

Fine-grained head pose estimation without keypoints

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

Real time head pose estimation with random regression forests

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Feature-based head pose estimation from images

Proceedings of the IEEE-RAS International Conference on Humanoid Robots

A discriminative approach to frame-by-frame head pose tracking

Proceedings of the IEEE Conference on Automatic Face & Gesture Recognition

Head pose estimation via probabilistic high-dimensional regression

Proceedings of the IEEE International Conference on Image Processing

Switching linear inverse-regression model for tracking head pose

Proceedings of the IEEE Winter Conference on Applications of Computer Vision

Real-time head orientation from a monocular camera using deep neural network

Proceedings of the Asian Conference on Computer Vision

Deep mixture of linear inverse regressions applied to head-pose estimation

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Joint head pose estimation and face alignment framework using global and local cnn features

Proceedings of the IEEE International Conference on Automatic Face & Gesture Recognition

Fsa-net: Learning fine-grained structure aggregation for head pose estimation from a single image

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Model-based head pose tracking with stereovision

Proceedings of the IEEE Conference on Automatic Face Gesture Recognition

3D shape regression for real-time facial animation

ACM Trans Graph

3d-mam: 3d morphable appearance model for efficient fine head pose estimation from still images

Proceedings of the IEEE Conference on Computer Vision Workshops

Real-time head pose estimation and face modeling from a depth image

IEEE Trans Multimedia

Head pose estimation on depth data based on particle swarm optimization

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

From depth data to head pose estimation: a siamese approach

arXiv preprint arXiv:1703.03624