Keywords

1 Introduction

Human pose estimation, also called as human keypoints detection, has received extensive attention in recent years. The primary purpose of human pose estimation is to predict human joint locations from monocular RGB information. Human pose estimation is a classical middle-level computer vision task and can greatly facilitate other related high-level tasks such as pedestrian detection [28] and action recognition [7].

Following the success of deep convolutional networks, current 2d human pose estimation methods perform well even in complex outdoor environments. Figure 1 shows typical 2d human pose estimation results predicted by stacked hourglass [18] on Human3.6M dataset [11]. However, unlike on Human3.6M dataset [11]. However, unlike 2d human pose estimation, it is challenging to obtain annotated data for 3d human pose estimation tasks. Most 3d human pose datasets only contain indoor data collected in a laboratory environment, which leads to lack of diversity. Thus, models tend to overfit when training on such datasets. Besides, ambiguity is a widespread problem when mapping 2d to 3d, which also results in unreasonable predictions.

Fig. 1.
figure 1

Typical 2d human pose estimation results produced by stacked hourglass model [18]. Images are from Human3.6M dataset. We can see that stacked hourglass model performs well on Human3.6M dataset.

In this paper, we propose a novel coarse-to-fine method for 3d human pose estimation. From our analysis, we find that current models usually produce large errors when predicting keypoints located at the end of limbs, such as wrists and ankles. In contrast, joints like shoulders and hips are relatively easy to predict. Table 1 shows detailed statistics about errors of each joint by [14]. We assume that easy joints can be helpful to guide the prediction of hard joints. Therefore we propose a coarse-to-fine method to predict different joints in a progressive way. An intuitive way to deal with ambiguity in 3d human pose estimation is to leverage the prior of human structure. For instance, Dabral et al. [5] use legal angular constraints in their model. Here, we propose a set of limb length ratio (LLR) constraints to reduce the shifts of joints from the true locations.

Our contributions can be summarized as follows:

  • We propose a specific coarse-to-fine method for 3d human pose estimation task to enhance precision of the joints far from the torso. Based on the statistical analysis of predictions produced by the previous state-of-the-art method, we divide joints into three groups according to different difficulty levels. Easy joints are predicted first, and then they are used to facilitate the prediction of harder joints.

  • A set of human limb length ratio (LLR) constraints based on the statistics of physical human body structure are used to avoid unreasonable predictions, allowing the model to perform more robust on hard joints.

  • By combining the coarse-to-fine model and LLR constraints, our method outperforms the baseline on the Human 3.6M dataset. Especially the improvement is more significant for those joints far from the torso.

Table 1. Detailed statistics on the error of each joint produced by [14]. Numbers denote the error of each joint in millimeters. Under protocol 2, the model predictions are post-processed with rigid alignments.

2 Related Work

Since our method is specifically designed for the 3d human pose estimation task, we will first review recent works on it. Moreover, we will review recent works on the usage of human structure prior to the task for human pose estimation.

2.1 3D Human Pose Estimation

The topic of 3d human pose estimation attracts increasing attention in recent years due to its potentially broad application prospects. The purpose of 3d human pose estimation task is to estimate accurate spatial position coordinates of human keypoints from RGB images. It is proven that positions of human keypoints are beneficial for generic action recognition tasks in previous works [13, 22]. In the current stage, it is almost impossible to predict 3d coordinates in the world coordinate system, as is declared in [14]. Thus most of the current methods predict coordinates in the camera coordinate system [5, 9, 25]. In this paper, our model predicts 3d human keypoint locations in the camera coordinate system as well.

Various types of methods, as well as diverse representations are proposed for 3d human pose estimation. A typical way of 3d human pose estimation is to use 3d coordinates to represent human keypoint locations and to regress coordinates from a single RGB image directly, as is proposed in [21]. However, the mapping from RGB images to 3d coordinates is so complex that it is challenging to learn the potential knowledge between images and coordinates. In order to overcome this problem, volumetric representation is used as supervision [21, 27], which contains richer information than coordinates. Volumetric representation, however, leads to a huge number of model parameters and increasing computational complexity. A compromise solution is to use 3d coordinates as supervision, leveraging 2d human pose predictions at the same time. With the help of powerful convolutional neural networks (CNN), the performance of 2d human pose estimation has great improvements in recent years. A simple yet effective method is to use 2d human pose predictions as input to regress 3d coordinates of human keypoints [14]. Based on this work, [9] combines temporal information with 2d to 3d pose regression, which allows the model to perform well. However, temporal information puts high demands on the data, and also such a model costs too much computation, making it hard to be used in practical applications.

These works make good progress, but it is worth mentioning that the points far from torso flutter heavily in their predictions. This phenomenon is consistent with the problem in 2d pose estimation, as proposed in [24]. In this paper, we propose a coarse-to-fine method, which takes 2d human pose prediction from a single image as input and predicts the 3d coordinates of human keypoints. We divide human keypoints to three groups according to different difficulty levels. The further the keypoints are from the human torso, the harder they are for a model to predict. Our model predicts easy keypoints first and then predicts medium and hard keypoints in turn, leveraging former prediction results.

2.2 Human Structure Prior in Pose Estimation

In previous works, models often generate unreasonable predictions, which makes human structure prior indispensable in human pose estimation tasks. In 2d human pose estimation, [4] leverages generative adversarial networks to guide a model to learn human structure prior implicitly. [5] proposes angular constraints based on the human prior that the range of motions of human joints is limited and symmetry. These constraints are reasonable while the limb length ratio can be another useful constraint, whose distribution is proven to obey specific rules [6]. In this paper, we propose a set of constraints based on the human limb length ratio, and experiments demonstrate it is helpful for a model to get better performance in the task of 3d human pose estimation.

3 Method

In this section, we will discuss the method proposed for 3d human pose estimation. We start with the coarse-to-fine method and introduce the limb length ratio (LLR) constraint to solve the problem better.

Fig. 2.
figure 2

Keypoints grouped by prediction difficulties. Circles colored in blue, orange and red denote easy, medium and hard joints respectively. The position of hip is the midpoint of left hip and right hip. (Color figure online)

Fig. 3.
figure 3

Network structure of our coarse-to-fine model. For a given RGB image, we first obtain 2d keypoint locations via a 2d human pose estimator. Then we design the coarse-to-fine model in order to predict 3d keypoint coordinates from 2d keypoints. Our method can be divided into 3 stages and we predict positions of easy, medium and hard joints in order. During the second and the third stages, the model leverages predictions from previous stage(s).

3.1 Coarse-to-Fine Model

In previous works, models usually perform worse when predicting keypoints far from the torso such as wrists and ankles. In order to overcome this problem, we propose a coarse-to-fine method. In our method, we first divide keypoints into three groups according to the prediction difficulty. From Table 1, we can observe that the closer the keypoints are to the body torso, the more accurate the model prediction is. For instance, the model performs better when predicting the location of the head than elbows; and performs worse when predicting ankles than knees. Thus we can divide keypoints, according to their distance to the torso, into three groups: easy, medium, and hard. A detailed demonstration is shown in Fig. 2. According to Table 1, we classify head, spine, thorax, hip and shoulder as easy joints, elbow and knee as medium joints, wrists and ankles as hard joints, as shown in Fig. 2.

Based on the characteristic of different difficulty levels of joints, we design a specific coarse-to-fine model. The network structure of our model is shown in Fig. 3. The input of our model is 2d keypoints predictions produced by a 2d human pose estimator, and the output is predictions of 3d human keypoints coordinates. As we can see in Fig. 3, our model contains three stages. In the first stage, we predict easy joints by using a simple fully-connected network, which is effective in a regression task mapping 2d coordinates to 3d coordinates [14]. In the second and third stage, we predict medium and hard keypoints, taking both 2d keypoints and 3d coordinates predictions produced in the previous stage(s) as input. Therefore we can leverage predicted 3d joint coordinates as auxiliary information to guide the model to produce more accurate predictions for challenging keypoints. In order to merge 2d keypoints and 3d keypoint predictions produced in previous stages, we adopt channel wise self-attention blocks, as is proposed in [10], to guide the model to assign appropriate weights for predicted 3d keypoint coordinates in the second and third stages. We compute Euclidean distance between 3d keypoints prediction and groundtruth as the keypoints loss \(L_K\),

$$\begin{aligned} L_K(x,y) = \frac{1}{m}\sum _{i=1}^{m}\Vert x_i - y_i \Vert , \end{aligned}$$
(1)

where x, y stands for the model prediction and groundtruth respectively, m stands for the number of keypoints. Considering that our model produces predictions in three stages, the loss function is written as

$$\begin{aligned} L_{CTF}(x,y) = \theta _1 L_K(x_e,y_e) + \theta _2 L_K(x_m,y_m) + \theta _3 L_K(x_h,y_h) , \end{aligned}$$
(2)

where subscript e, m, h denotes easy, medium, and hard keypoints respectively.

3.2 LLR Constraint

Human pose prior knowledge is helpful in the 3d human pose estimation task; and human limb length ratio (LLR) is an important prior, which is studied in [6]. Within the best of our knowledge, few researches focus on LLR prior, which helps predict accurate 3d coordinates. In this paper, we propose a set of LLR constraints based on the LLR prior. According to the research of De Leva [6], we can assume that the distribution of adult limb length ratio obeys normalization distribution. Therefore we can census the dataset to get the mean value and stand deviation of the limb length ratio of the dataset.

Table 2. Comparison to current state-of-the-art methods on the Human3.6M validation set under protocol 1. Bold indicates the best results.
Table 3. Comparison to current state-of-the-art methods on the Human3.6M validation set under protocol 2. Bold indicates the best results.

The length of a limb can be computed as follows,

$$\begin{aligned} l(x_1,x_2) = \Vert x_1 - x_2 \Vert , \end{aligned}$$
(3)

where \(x_1\), \(x_2\) stands for 3d coordinates of corresponding keypoints lying at the ends of limbs. The limb length ratio between limb p and limb q can be computed as follows,

$$\begin{aligned} r(p,q) = \frac{l(p_{x_1},p_{x_2})}{l(q_{x_1},q_{x_2})}, \end{aligned}$$
(4)

where \(p_{x_1}\), \(p_{x_2}\), \(q_{x_1}\) and \(q_{x_2}\) stand for 3d keypoint coordinates lying at the ends of limb p and limb q respectively. Then the LLR loss can be written as

$$\begin{aligned} L_{LLR}(X) = \frac{1}{m} \sum _{i=1}^{m} \left( {1- \frac{1}{\sqrt{2\pi }s}exp(-\frac{1}{2} \frac{\left( r(X_{i_p},X_{i_q})-\overline{R}\right) ^2 }{s}}\right) , \end{aligned}$$
(5)

where \(X_{i_p}\) and \(X_{i_q}\) denote the limb in the ratio pair respectively, \(\overline{R}\) and s denote the mean value and standard deviation of the limb length ratio of a chosen pair \(r(X_{i_p},X_{i_q})\) that computed on the training set respectively. In addition, we use the Gaussian function to punish the ratio offset. Then the final loss function is

$$\begin{aligned} Loss = \alpha L_{CTF} + \beta L_{LLR}, \end{aligned}$$
(6)

where \(\alpha \) and \(\beta \) are hyper-parameters and denote scale coefficients of the corresponding loss items.

4 Experiments

In this section, we will first describe the implementation details, followed by experimental results on the Human3.6M dataset [11]. In addition, intuitive comparisons between our model and benchmark methods are present.

Table 4. Comparison of the baseline and our method w.r.t the prediction errors of medium and hard keypoints.
Fig. 4.
figure 4

Qualitative results of our method on the Human3.6M dataset. Each row of the figure contains 2 samples and each sample contains 4 columns. In each sample, each column represents RGB image, 2d human pose prediction produced by stacked hourglass model [18], 3d human pose prediction of our method and the ground truth of 3d human pose in turn. In order to more clearly present the 3d predictions, we rotate the figures in the third and forth columns slightly around the y axis.

Fig. 5.
figure 5

Qualitative results on the MPII dataset [2]. Each row contains two samples and each sample includes three columns. In each sample, each column represents the RGB image with corresponding 2d human pose prediction, 3d predictions of [14] and 3d predictions of our method in turn.

4.1 Dataset

We conduct experiments on the Human3.6M dataset to demonstrate the performance of our method. Human3.6M is a widely used dataset in the field of 3d human pose estimation, which contains comprehensive annotations. The data of Human3.6M dataset are collected in a laboratory environment, including 11 professional actors and 17 scenarios. 3d human keypoint position annotations are obtained from a high-speed motion capture system with 4 calibrated cameras. In this paper, we choose 5 actors as the training set and 2 actors as the validation set, which is consistent with widely used protocols [12, 14, 27]. It is worth mentioning that we do not leverage the temporal information considering real-time performance.

4.2 Implementation Details

In our coarse-to-fine method, we use the predictions of stacked hourglass [18], a state-of-the-art 2d human pose estimation method, as the input of our coarse-to-fine method. A prediction of stacked hourglass includes 16 keypoints. We reshape each 2d human pose prediction to a vector with shape \(1\times 32\) and reshape corresponding 3d human pose ground truth to a vector with shape \(1\times 48\) during data preprocessing. The 3d human pose ground truth coordinate is transformed to the camera coordinate system. In order to facilitate comparisons with other methods, we set the keypoint Hip as the coordinate system origin, which is the midpoint of the left hip and right hip, following [9, 14]. In order to make the model easier to convergence, we normalize 2d pose predictions and 3d pose ground truth with mean and variance calculated in the training set. In order to avoid the gradient explosion problem, we clip the maximum L2 norm of gradient every time backpropagation is operated. The model is trained with 128 batch size and 1.22 million iterations in total; the initial learning rate is set to \(1\times 10^{-3}\), which is decreased by 0.96 every 10k iterations.

All experiments are conducted on one Nvidia Tesla K80 GPU with 12 Gigabyte memory.

4.3 Comparison with State-of-the-Art Methods

In Table 2, we present the results of our methods and make a comparison with the state-of-the-art methods under protocol 1. We can see clearly that our coarse-to-fine method performs well on Human3.6M dataset. When combined with LLR loss, the performance of our method is further improved and decreases the average error to 60.6 mm. Under protocol 2, rigid alignment is applied to the predictions and our method outperforms comparison methods on every action, as shown in Table 3. In Table 4, we can clearly see that our method, which combines LLR loss and coarse-to-fine method, outperforms the baseline method when predicting medium and hard keypoints. Figure 4 presents some examples of our predicted 3d human poses on the Human3.6M dataset.

In order to explore the generalization performance of our method, we conduct qualitative experiments on MPII dataset [2] and make a comparison between our method and [14], as is shown in Fig. 5. We can see that in most situations, our method produces more reasonable predictions compared with [14] even in wild scenes. While it is worth mentioning that the occlusion of 2d joints has a huge negative impact on 3d prediction.

5 Conclusion

In this paper, we propose a coarse-to-fine method for 3d human pose estimation and a set of human structure based limb length ratio constraints. Experimental results indicate that our method is useful, mainly when predicting challenging keypoints that are far from the torso. Encouraged by the current results, we will investigate how to explore context information to improve the performance further.