Real-time head pose estimation using multi-task deep neural network

doi:10.1016/j.robot.2018.01.005

Robotics and Autonomous Systems

Volume 103, May 2018, Pages 1-12

https://doi.org/10.1016/j.robot.2018.01.005 Get rights and content

Abstract

Driver inattention is one of the main causes of traffic accidents. To avoid such accidents, advanced driver assistance system that passively monitors the driver’s activities is needed. In this paper, we present a novel method to estimate a head pose from a monocular camera. The proposed algorithm is based on multi-task learning deep neural network that uses a small grayscale image. The network jointly detects multi-view faces and estimates head pose even under poor environment conditions such as illumination change, vibration, large pose change, and occlusion. We also propose a multi-task learning method that does not bias on a specific task with different datasets. Moreover, in order to fertilize training dataset, we establish and release the RCVFace dataset that has accurate head poses. The proposed framework outperforms state-of-the-art approaches quantitatively and qualitatively with an average head pose mean error of less than 4° in real-time. The algorithm applies to driver monitoring system that is crucial for driver safety.

Introduction

Driver inattention is a major cause of traffic accidents. According to the National Highway Traffic Safety Administration (NHTSA), many of the traffic accident fatalities and casualties in the United States during the past two years have been caused by driver inattention. In addition, about 3400 of the 35,092 US traffic deaths in 2015 were caused by driver distraction, which is 8.8% more than the 3197 deaths from the same cause in 2014. This is because the probability of a traffic accident is high due to a mistake of a driver rather than a defect in a car or a road. Therefore, if the driver inattention is automatically detected, a traffic accident can be avoided by giving a warning to the driver in advance. Driver inattention occurs mainly when the driver’s distracted or tired. If this happens, the driver adopts a different head pose than usual, so driver inattention can be detected before an accident. Therefore, head pose estimation plays an important role in active safety and advanced driver assistance systems (ADAS) in intelligent vehicles.

From the viewpoint of computer vision, head pose estimation is a process of inferring the position $(x, y)$ and direction ( $r o l l$ , $p i t c h$ , and $y a w$ ) from input face images. The existing approaches can be roughly classified into two types: generative methods and discriminative methods. Generative methods use geometric clues or a variable face model. These methods output continuous head pose values rather than individual categories, and they have the advantage of obtaining facial landmarks for various applications. However, since these methods heavily rely on the detection of facial feature points, estimated head pose gets less reliable in environments where facial feature points are difficult to detect, such as large variations in head pose or facial expression, occlusion, noise, blur, and low-resolution images. Discriminative methods use machine learning methods along with visual features of the entire face. These methods are robust to challenging head poses and low-resolution images. However, most methods use facial images divided into specific head pose intervals, and classify input images into corresponding categories. Therefore, the estimates are categorized at large intervals (usually over 10°) rather than continuous values.

Head pose estimation is challenging problem in practice. Lighting changes, severe vibrations, and large pose changes frequently occur, and these affect appearance of a driver’s face. In addition, it is necessary to calculate the head pose in real time and give a warning to the driver. This paper addresses these problems using a multi-task deep learning method. Our approach uses low-resolution grayscale images for real-time calculation. The proposed method was found to be superior to existing methods through qualitative and quantitative evaluation.

Section snippets

Related work

There have been several approaches to head pose estimation using an image. This section presents the related studies according to approaches, and discusses the advantages and disadvantages of representative algorithms in each category.

Proposed method and datasets

In this section, we first give an overview of the proposed multi-view face detection and head pose estimation algorithm. The next sections discuss the details of the multi-task learning network and datasets.

Experimental results

In this section, we demonstrate our real-time head pose estimation algorithm. We report experiments that we conducted on various aspects to quantitatively and qualitatively verify the performance of the proposed algorithm. We verified the validity of the proposed multi-task learning methodology, tested the performance of the proposed algorithm in face detection and head pose estimation, and compared it with those of state-of-the-art algorithms. Finally, we present the results of our algorithm

Conclusion

In this paper, we proposed a multi-task learning-based real-time deep learning framework that can robustly estimate a driver’s head pose using images obtained under poor conditions in various vehicle environments. We also introduced a method that trains multi-task learning DNN with individual datasets, even if there are no jointly annotated datasets. Compared with the single DNN-based learning method, the proposed multi-task learning-based system showed better accuracy without overfitting to

Acknowledgments

This research was supported by the Ministry of Trade, Industry and Energy and the Korea Evaluation Institute of Industrial Technology (KEIT) with the program number of “10060110”.

Byungtae Ahn received a B.S degree in Electronic Engineering from Kumoh national Institute of Technology, Korea, in 2007, and a M.S. degree in Bio-mechatronics from Sungkyunkwan University, Korea, in 2011. He is currently working toward the Ph.D. degree in Robotics Program at KAIST. He received a Qualcomm Innovation Award in 2013, and has been listed in Marquis Who’s Who in the World, 2016. His research interests include deep learning, Human–Robot Interaction (HRI), and Advanced Driver

References (52)

CootesT.F. et al.
Active shape models-their training and application
Comput. Vis. Image Underst.
(1995)
DopferA. et al.
3d active appearance model alignment using intensity and range data
Robot. Auton. Syst.
(2014)
NuevoJ. et al.
RSMAT: Robust simultaneous modeling and tracking
Pattern Recognit. Lett.
(2010)
HugY. et al.
Estimating face pose by facial asymmetry and geometry
CootesT.F. et al.
Active appearance models
IEEE Trans. Pattern Anal. Mach. Intell.
(2001)
MartinsP. et al.
Accurate single view model-based head pose estimation
XiongX. et al.
Supervised descent method and its applications to face alignment
F.D. la Torre, W.S. Chu, X. Xiong, F. Vicente, X. Ding, J. Cohn, Intraface, in: IEEE International Conference on...
TawariA. et al.
Continuous head movement estimator for driver assistance: issues, algorithms, and on-road evaluations
IEEE Trans. Intell. Transp. Syst.
(2014)
NarayananA. et al.
Yaw estimation using cylindrical and ellipsoidal face models
IEEE Trans. Intell. Transp. Syst.
(2014)

NarayananA. et al.

Estimation of driver head yaw angle using a generic geometric model

IEEE Trans. Intell. Transp. Syst.

(2016)

ZhuX. et al.

Face detection, pose estimation and landmark localization in the wild

VicenteF. et al.

Driver gaze tracking and eyes off the road detection system

IEEE Trans. Intell. Transp. Syst.

(2015)

BalasubramanianV.N. et al.

Biased manifold embedding: a framework for person-independent head pose estimation

FoytikJ. et al.

A two-layer framework for piecewise linear manifold-based head pose estimation

Int. J. Comput. Vis.

(2013)

GrujićN. et al.

3D facial pose estimation by image retrieval

GourierN. et al.

Estimating face orientation from robust detection of salient facial features

HuangC. et al.

Head pose estimation based on random forests for multiclass classification

BenAbdelkaderC.

Robust head pose estimation using supervised manifold learning

JiH. et al.

Robust head pose estimation via convex regularized sparse regression

BreitensteinM.D. et al.

Real-time face pose estimation from single range images

FanelliG. et al.

Real time head pose estimation from consumer depth cameras

FanelliG. et al.

Random forests for real time 3D face analysis

Int. J. Comput. Vis.

(2013)

LecunY. et al.

Backpropagation applied to handwritten zip code recognition

Neural Comput.

(1989)

KrizhevskyA. et al.

Imagenet classification with deep convolutional neural networks

Adv. Neural Inf. Process. Syst.

(2012)

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with...

Cited by (53)

Fusion-competition framework of local topology and global texture for head pose estimation
2024, Pattern Recognition
RGB image and point cloud involve texture and geometric structure, which are widely used for head pose estimation. However, images lack of spatial information, and the quality of point cloud is easily affected by sensor noise. In this paper, a novel fusion-competition framework (FCF) is proposed to overcome the limitations of a single modality. The global texture information is extracted from image and the local topology information is extracted from point cloud to project heterogeneous data into a common feature subspace. The projected texture feature weighted by the channel attention mechanism is embedded into each local point cloud region with different topological features for fusion. The scoring mechanism creates competition among the regions involving local-global fused features to predict final pose with the highest score. According to the evaluation results on the public and our constructed datasets, the FCF improves the estimation accuracy and stability by an average of 13.6 % and 12.7 %, which is compared to nine state-of-the-art methods.
An efficient multitask neural network for face alignment, head pose estimation and face tracking
2022, Expert Systems with Applications
Citation Excerpt :
However, compared to regressing landmarks from the input image directly, these methods are much less efficient because of a large number of bottom-up and top-down convolution layers. Existing head pose estimation methods only utilize the geometric information of facial landmarks (model-based approach (Baltrusaitis et al., 2018; Martins & Batista, 2008)) or appearance information of the input image (appearance-based approach Ahn et al., 2018; Drouard et al., 2017; Ranjan et al., 2019; Ruiz et al., 2018) to estimate the Euler angle of faces. Model-based approaches fit a predefined 3D face model to the face in image according to facial landmarks.
While Convolutional Neural Networks (CNNs) have significantly boosted the performance of face related algorithms, maintaining accuracy and efficiency simultaneously in practical use remains challenging. The state-of-the-art methods employ deeper networks for better performance, which makes it less practical for mobile applications because of more parameters and higher computational complexity. Therefore, we propose an efficient multitask neural network, Alignment & Tracking & Pose Network (ATPN) for face alignment, face tracking and head pose estimation. Specifically, to achieve better performance with fewer layers for face alignment, we introduce a shortcut connection between shallow-layer and deep-layer features. We find the shallow-layer features are highly correspond to facial boundaries that can provide the structural information of face and it is crucial for face alignment. Moreover, we generate a cheap heatmap based on the face alignment result and fuse it with features to improve the performance of the other two tasks. Based on the heatmap, the network can utilize both geometric information of landmarks and appearance information for head pose estimation. The heatmap also provides attention clues for face tracking. The face tracking task also saves us the face detection procedure for each frame, which also significantly boost the real-time capability for video-based tasks. We experimentally validate ATPN on four benchmark datasets, WFLW, 300VW, WIDER Face and 300W-LP. The experimental results demonstrate that it achieves better performance with much less parameters and lower computational complexity compared to other light models.
Robotic manipulation based on 3-D visual servoing and deep neural networks
2022, Robotics and Autonomous Systems
Citation Excerpt :
Meanwhile, vision-based robot motion control methods are also reviewed. Traditional approaches utilized RGB images with local interest-points and feature matching algorithms in order to achieve object recognition and pose estimation task [9]. The methods required certain local descriptor for different scale, rotation, and viewpoints.
The critical challenge, for robot–object-interaction, is to estimate visually the pose of the target object in a 3D space and combine it into a vision-based control scheme in manipulation applications. This paper proposes a novel reliable framework for deep ConvNet combined with visual servoing using a single RGB camera. We introduce an extensive system called Deep-Visual-Servoing (DVS) that addresses an integration of: (I) training of deep-CNNs using synthetic dataset only and operates successfully in real-world scenario, (II) continuous 3 D object pose estimation as the sensing feedback in a 3D visual servoing control scheme, and (III) design, integration and experimentation of visual servoing approach based on Lyapunov’s theory. The proposed deep based learning approach, the kinematic modeling and controller design are experimentally verified and discussed using the 6 DOF UR5 manipulator.
Head pose estimation using deep neural networks and 3D point clouds
2022, Pattern Recognition
Citation Excerpt :
However, it is difficult to accurately estimate the 3D head pose in a complex environment. In recent years, researchers [1,7–12] have used the powerful learning ability of convolution neural networks (CNNs) to extract features for head pose estimation. However, a single RGB image does not contain 3D information, thus causing estimation errors by the mapping from 2D space to 3D space.
In this paper, we propose head pose estimation using deep neural networks and 3D point cloud. Unlike existing methods that either take 2D RGB image or 2D depth image as input, we adopt 3D point cloud data generated from depth to estimate 3D head poses. To further improve robustness and accuracy of head pose estimation, we classify 3D angles of head poses into 36 classes with 5 $^{\circ}$ interval and predict the probability of each angle in a class based on multi-layer perceptron (MLP). While traditional iterative methods for head model construction require high computation and memory costs, the proposed method is lightweight and computationally efficient by utilizing a sampled 3D point cloud as input combined with a graph convolutional neural network (GCNN). Experimental results on Biwi Kinect Head Pose dataset show that the proposed method achieves outstanding performance in head pose estimation and outperforms state-of-the-art ones in terms of accuracy.
Driver distraction analysis using face pose cues
2021, Expert Systems with Applications
Citation Excerpt :
In this architecture there are 5 convolutional layers, 3 fully connected layers and the output has 3 neurons to predict the pose angles. Ahn, Choi, Park, and Kweon (2018) proposed a real-time head pose estimation for driver distraction monitoring using multi-task deep neural network. The trained network was used for face detection, optimal bounding box extraction, and head pose estimation.
Vehicle driver distraction is one of the major reasons for road accidents. Involvement with a co-passenger, use of in-vehicle devices or phone leads to a situation where the driver head pose varies and the eye is off the road. A low cost early warning system should reduce the distracted driving instances, thus making our roads safer. Face pose information forms an important cue to determine driver distraction. The main objective of this work is to analyse the distractions of the driver based on his/her face pose cues. A straight pose or slight variation would indicate a non-distracted driver, while a large pose variation from the center would indicate a high probability for a distracted driver. Face pose database of vehicle drivers is developed and is bench-marked. A clustered two layer approach on Gabor features is proposed. A five layer convolutional network with three fully connected layers is also used to bench-mark the data. The proposed clustered two-layer approach with Gabor features and SVM classifier provides better results in driver distraction analysis when compared to the deep learning approach and other manifold approaches. The improved accuracy could be attributed to the improved modeling of manifold in our approach, better class discrimination of the Gabor features together with better classification provided by the SVM classifier.
Analyzing Head Orientation of Neurotypical and Autistic Individuals in Triadic Conversations
2023, arXiv

View all citing articles on Scopus

Dong-Geol Choi received the B.S and M.S degree in Electric Engineering and Computer Science from Hanyang University in 2005 and 2007, respectively, and the Ph.D degrees in the Robotics Program from KAIST in 2016. He is currently a post-doctoral researcher at the Information & Electronics Research Institute, in KAIST. His research interests include sensor fusion, autonomous robotics, and artificial intelligence issues. Dr. Choi received a fellowship award from Qualcomm Korea R&D Center in 2013. He was a member of ‘Team KIAST,’ which won the first place in DARPA Robotics Challenge Finals 2015. He is a member of the IEEE.

Jaesik Park received his Bachelor degree (Summa cum laude) in media communication engineering from Hanyang University in 2009. He received his Master degree and Ph.D. degree in electrical engineering from Korea Advanced Institute of Science and Technology (KAIST) in 2011 and 2015, respectively. He joined Intel Labs as a research scientist in 2015. His research interests include depth map refinement, image-based 3D reconstruction. He is a member of the IEEE.

In So Kweon received the B.S. and M.S. degrees in Mechanical Design and Production Engineering from Seoul National University, Seoul, Korea, in 1981 and 1983, respectively, and the Ph.D. degree in Robotics from the Robotics Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania, in 1990. He worked for the Toshiba R&D Center, Japan, and joined the Department of Automation and Design Engineering, KAIST, Seoul, Korea, in 1992, where he is now a professor with the Department of Electrical Engineering. His research interests are sensor fusion, color modeling and analysis, visual tracking, and visual SLAM. He was the general chair for the Asian Conference on Computer Vision 2012 and he is on the honorary board of the International Journal of Computer Vision (IJCV). He has been serving as a director for the Personal Plug and Play DigiCar Center which is one of the National Core Research Center since 2010. He was a member of ‘Team KAIST’ which won the first place in DARPA Robotics Challenge Finals 2015. He is a member of the IEEE.

¹: This work is done while J. Park was with Robotics and Computer Vision Lab. He is currently with Intel Labs, 2200 Mission College Blvd., Santa Clara, CA 95054-1549, USA.

View full text

Real-time head pose estimation using multi-task deep neural network

Abstract

Introduction

Section snippets

Related work

Proposed method and datasets

Experimental results

Conclusion

Acknowledgments

Comput. Vis. Image Underst.

Robot. Auton. Syst.

Pattern Recognit. Lett.

Estimating face pose by facial asymmetry and geometry

Active appearance models

IEEE Trans. Pattern Anal. Mach. Intell.

Accurate single view model-based head pose estimation

Supervised descent method and its applications to face alignment

Continuous head movement estimator for driver assistance: issues, algorithms, and on-road evaluations

IEEE Trans. Intell. Transp. Syst.

Yaw estimation using cylindrical and ellipsoidal face models

IEEE Trans. Intell. Transp. Syst.

Estimation of driver head yaw angle using a generic geometric model

IEEE Trans. Intell. Transp. Syst.

Face detection, pose estimation and landmark localization in the wild

Driver gaze tracking and eyes off the road detection system

IEEE Trans. Intell. Transp. Syst.

Biased manifold embedding: a framework for person-independent head pose estimation

A two-layer framework for piecewise linear manifold-based head pose estimation

Int. J. Comput. Vis.

3D facial pose estimation by image retrieval

Estimating face orientation from robust detection of salient facial features

Head pose estimation based on random forests for multiclass classification

Robust head pose estimation using supervised manifold learning

Robust head pose estimation via convex regularized sparse regression

Real-time face pose estimation from single range images

Real time head pose estimation from consumer depth cameras

Random forests for real time 3D face analysis

Int. J. Comput. Vis.

Backpropagation applied to handwritten zip code recognition

Neural Comput.

Imagenet classification with deep convolutional neural networks

Adv. Neural Inf. Process. Syst.