From circle to 3-sphere: Head pose estimation by instance parameterization
Introduction
The task of inferring the orientation of human head is defined as head pose estimation. In the computer vision context, head pose estimation is specified as processing steps to transform a pixel-based digital image representation of a head into a high-level concept of direction [1]. Many tasks related to face analysis rely on accurate head pose estimation. For instance, a multi-pose face recognition system can carry out head pose estimation first and then select face images with similar poses for matching; a 3D face tracker can use head pose information to render the face model for the optimal fitting. Other applications of head pose estimation include inferring human gaze direction in the human–computer interaction (HCI) system, monitoring driver awareness for safety driving [2] and inferring the intentions of people in both verbal, and nonverbal communication environments [3].
It is often assumed that the human head is a rigid object and three Euler angles, pitch, yaw and roll, can be used to depict the head orientation [4]. Estimating the three angles from a single 2D image is a challenging task, since there exists extensive variations among pose-unrelated factors such as identity, facial expression, illumination condition and other latent variables. Fig. 1 shows the examples of the variations. In many cases, these pose-unrelated factors play a more significant role on the appearance variations than pose changes do [5], [6], [7]. Therefore, extracting information that ensures pose changes can dominate over pose-unrelated factors is a crucial point in designing the head pose estimator.
Numerous of approaches have been proposed over the past decades for head pose estimation. We arrange existing methods into four categories: classification-based approaches [8], [9], regression-based approaches [10], [11], [12], [7], [13], deformable-model-based approaches [14], [15], [16] and manifold-embedding based approaches [17], [18], [5], [6]. Classification-based approaches are limited to estimate 1-dimensional (only yaw angle) discrete head pose. Regression-based approaches can predict 3-dimensional continuous pose efficiently, but they are extremely sensitive to noise and pose-unrelated factors. Deformable-model-based approaches rely on the localization of facial landmarks, which limits their capability to handle extensive instance variations especially in low resolution images. Manifold-embedding-based approaches assume that facial images with consecutive head poses can be viewed as nearby points lying on a low-dimensional manifold embedded in the high-dimensional feature space. Head pose angles can be recovered by measuring the points’ distribution in the manifold embedding space.
Although manifold-embedding-based approaches have achieved great success in the former research, they still suffer from multiple limitations. First, there is no guarantee that pose-related factors can dominate over pose-unrelated factors in the manifold embedding process, since pose-unrelated factors will distort the manifold building process and result in geometry deformation across instance manifolds (different combinations of identity, expression and illumination) [1]. Though various approaches [17], [5], [6] have been proposed to partially solve this problem, they either focus on single pose-unrelated factor like identity while ignoring the others [17], or cannot handle multiple pose-unrelated factors in a uniform way [6]. Second, former methods tries to learn the mapping from the high-dimensional feature space to the low-dimensional manifold embedded representation. This mapping direction would cause manifold degradation (highly folded or self-intersection) [19] when the manifold topology is complicate (in the case of 3-dimensional pose estimation). Hence, most manifold-embedding-based estimators are limited to provide only 1-dimensional yaw estimation while ignoring the pitch and roll variations. Third, the projections from the image feature space to the low-dimensional manifold are defined only on the training space [20]. The entire embedding procedure has to be repeated since they lack of the ability to depict the out-of-sample inputs in an efficient way [5].
To address the limitations of existing methods, we propose a manifold-embedding-based coarse-to-fine framework for 3-dimensional head pose estimation. This approach employs the unit circle and 3-sphere to model the uniform manifold topology on the coarse and fine layer respectively. By learning the instance-dependent nonlinear mappings from the unit circle or 3-sphere to every instance manifold (certain person with certain expression under certain illumination condition), the pose-related and -unrelated factors can be decoupled in a latent instance parametric subspace. The basic idea is that pose-unrelated factors dominate the geometry deformations across different instance manifolds. Hence, we can factorize the instance variations, which are encoded in the geometry deformations, in the instance parametric subspace.
There are several merits of our approach. First, the coarse-to-fine framework guarantees the efficient and accurate 3-dimensional continuous head pose estimation. Second, it can uniformly parameterize multiple pose-related and -unrelated factors under a uniform framework in the latent space. Third, the designed mapping direction of the manifold embedding, which is completely different from the existing methods, can effectively avoid the manifold degradation problem when 3-dimensional pose estimation is performed. Last but not least, the out-of-sample data can be effectively synthesized in the instance parametric subspace, which guarantees the generative ability of our approach.
The remainder of this paper is organized as follows. We briefly review existing head pose estimation approaches in Section 2. Section 3 elaborates the motivation and details of our approach. In Section 4, We carry out experiments in multiple databases to verify our approach and compare its performance with the state-of-the-art. Section 5 summarizes paper.
Section snippets
Background
Head pose estimation from visual perception has been a broad and diverse field for decades. To motivate our approach, we summarize existing methods and briefly review the most representative and related works.
Approach
This section describes our approach in details. First, we discuss the motivation of the coarse-to-fine pose estimation framework. Then we propose the instance parametric subspace and the uniform geometry representation. The instance parameterization can be achieved by conducting instance-dependent mappings and pose-related/unrelated factorization in the subspace. Finally, an efficient pose referring solution is provided to estimate head pose in the testing image. An overview of our approach is
Experiments
In this section, we carry out a series of experiments to demonstrate the validity of our approach and evaluate its performance. Several state-of-the-art approaches are compared with our approach on both experimental and faces in-the-wild databases.
Conclusion
In this paper, we presented a novel head pose estimation approach. We propose the instance parametric subspace to handle multiple instance variations in a generative way. The coarse-to-fine framework, which employs a unit circle on the coarse layer and a 3-sphere on the fine layer to model the uniform geometry representation, can significantly alleviate the manifold degradation problem by learning instance-dependent nonlinear mappings in an unconventional direction. Experiments on both
References (37)
- et al.
A review of motion analysis methods for human nonverbal communication computing
Image Vision Comput.
(2013) - et al.
Active range of motion of the head and cervical spine: a three-dimensional investigation in healthy young adults
J. Orthop. Res.
(2002) - et al.
A two-stage head pose estimation framework and evaluation
Pattern Recogn.
(2008) - et al.
Homeomorphic manifold analysis (HMA): generalized separation of style and content on manifolds
Image Vision Comput.
(2013) - et al.
Active shape models their training and application
Comput. Vision Image Underst.
(1995) - et al.
Composite splitting algorithms for convex optimization
Comput. Vision Image Underst.
(2011) - et al.
Image distance functions for manifold learning
Image Vision Comput.
(2007) - et al.
Multi-pie
Image Vision Comput.
(2010) - et al.
Head pose estimation in computer vision: a survey
IEEE Trans. Pattern Anal. Mach. Intell.
(2009) - E. Murphy-chutorian, A. Doshi, M.M. Trivedi, Head pose estimation for driver assistance systems: a robust algorithm and...
Recognition of human head orientation based on artificial neural networks
IEEE Trans. Neural Netw.
Cited by (39)
Locality constraint distance metric learning for traffic congestion detection
2018, Pattern RecognitionCitation Excerpt :Thus, the influence among different scenes will affect the prediction of Metric Learning for Kernel Regression, and locality constraint can reduce that influence efficiently. To confirm the effectiveness of locality constraint metric learning, the Metric Learning for Kernel Regression (MLKR) [28], Local Linear Embedding (LLE) [46,47], and Robust Principal Component Analysis (RPCA) [48,49] are included as the comparison. A visualization of the regression results is shown in Fig. 11.
Multi-level structured hybrid forest for joint head detection and pose estimation
2017, NeurocomputingCitation Excerpt :Wu et al. [15] proposed a two-stage framework for head pose estimation based on a geometrical structure. Peng et al. [16] proposed a coarse-to-fine pose estimation framework, where the unit circle and 3-sphere are employed to model the manifold topology on the coarse and fine layers, respectively. The pose-related and unrelated factors can be decoupled in a latent instance parametric subspace.
Disentangled Representation Learning and Its Application to Face Analytics
2021, Advances in Computer Vision and Pattern RecognitionFASHE: A FrActal Based Strategy for Head Pose Estimation
2021, IEEE Transactions on Image ProcessingTowards Image-to-Video Translation: A Structure-Aware Approach via Multi-stage Generative Adversarial Networks
2020, International Journal of Computer VisionThe use of 3D imaging to determine the orientation and location of the object based on the CAD model
2019, Proceedings of SPIE - The International Society for Optical Engineering