1 Introduction

The research of 3D hand pose estimation is a hotspot in the field of computer vision, virtual reality and robotics [5, 18]. With the advent of depth cameras, studies based on depth image have made significant progress [28]. Nevertheless, there is still a challenge for the recovery of 3D hand poses due to the poor quality of depth images, high joint flexibility, local self-similarity and severe self-occlusions.

In general, depth based hand pose estimation can be categorized into two main approaches as either generative model-based or discriminative learning-based methods. Model based approaches assume a pre-defined hand model and then fit it to the input depth image by minimizing specific objective functions [13, 21, 22, 24, 26, 31, 32]. However, the accuracy of these methods is highly dependent on the objective function and sensitive to initialization. Additionally, such tracking-based model approaches are awkward to deal with large changes between two adjacent frames, which are common as the hand tends to move fast. Alternatively, learning based approaches train a model with a large amount of data, and the hand pose parameters can be regressed directly. In this way, detecting hand pose frame by frame is easy to handle with fast hand movements.

Recently, learning based approaches have achieved remarkable performance in hand pose estimation from a single depth image. Although traditional machine learning methods have made significant progress, their performances are too dependent on the hand-craft features [12, 27, 28, 30, 35]. In recent years, Deep Learning methods have been paid more attention due to their abilities of learning effective features automatically. Early studies regressed joint locations from a depth image with a simple 2D Convolutional Neural Network [16, 20, 33, 39], which had high frame rate but low precision. To improve the accuracy, different strategies were proposed. One way was to improve the data quality. [19] used data augmentation to reduce the prediction error. [8, 17] converted the 2.5D depth image to 3D voxel representation to make use of the 3D spatial structure. [23] learned the feature mapping from a synthetic image with high quality to a real image. The other way was to design more complex network to extract more features. [9, 10, 17, 19] added residual module in their network. [17, 34] used encoder and decoder to learn features in the latent space. [8, 17] applied a 3D CNN instead of 2D CNN to estimate per-voxel likelihood of 3D locations for each hand joint. By combining the effective strategies mentioned above, [17] achieved the best results in the Hands In the Million (HIM2017) Challenge Competition so far [36]. However, their methods were too complex both in data preprocessing procedure and in network structure to get the efficient training and testing.

In order to improve the efficiency while ensuring accuracy, in this paper, we design a highly efficient and relatively simple Convolutional Neural Network structure named Hand Branch Ensemble network (HBE). The proposed network can achieve comparable accuracy with state-of-the-art studies even better than them using fewer training data and shorter training time but faster frame rates. Figure 1 gives an overview of our proposed network structure. The core idea is to take advantage of the prior knowledge of the motion and the functional importance of different fingers [2, 4, 15, 29]. Since the thumb and the index finger play an even more important role in the grasping, manipulation and communication, while the middle finger, ring finger and little finger play an auxiliary role in most cases, we simplify the five-finger structure into three parts: thumb, index fingers and the other fingers. Correspondingly, the proposed HBE network learns the features of each part by each branch respectively. It makes full use of the shallow low-level image features that are more sensitive to the size, orientation and location information, which can greatly reduce the computational complexity and the training time. Moreover, we propose a branch ensemble strategy by concatenating features from the last fully connected layers of each branch and then the integrated features are used to infer the joint coordinates with the extra regression layers. Different from REN [10] training individual fully-connected layers on multiple feature regions and combining them as ensembles, our ensemble strategy directly exploits the features of different hand parts, which is more intuitive for the hand pose estimation. Motivated by Deep Prior [20], we add a bottleneck layer as a low dimensional embedding to learn the hand pose physical prior before the output layer.

Fig. 1.
figure 1

The Hand Branch Ensemble (HBE) network: based on the activity space and functional importance of five fingers. The top branch handles the thumb, the median branch handles the index finger and the bottom branch handles the other fingers. The features are ensemble along with an additional fully connected layer and a bottleneck layer

The proposed HBE network is evaluated on three challenging benchmarks: the HIM2017 Challenge Dataset [37], the ICVL hand pose dataset [30] and the MSRA dataset [27]. The experiments show that our method achieves results comparable or better than state-of-the-art methods.

In summary, our contributions are:

1. We propose a new three-branch Convolutional Neural Network estimating full 3D hand joint locations from a single depth image. The structural design inspiration comes from the understanding of the differences in the functional importance of different fingers. In addition, a branch feature ensemble strategy is introduced to merge features of each branch along with a fully connected layer and a low-dimensional embedding layer, which emphasizes the correlation of different hand parts and ensures the overall hand shape constraints.

2. We design a relatively lightweight architecture and achieve comparable or better performance to state-of-the-art methods on publicly available datasets with less training data, shorter training time and faster frame rate.

The paper is organized as follows. After reviewing the related work in Sect. 2, we describe our proposed method in Sect. 3. Experimental results and discussions are reported in Sect. 4, and the conclusions are drawn in Sect. 5.

2 Related Work

In this section, we briefly discuss the Deep Learning based works on hand pose estimation, especially those closely related to our method. These approaches have achieved good performance due to the success of Deep Learning as well as the public large hand pose datasets [6, 27, 30, 33, 38]. However, most studies estimated the hand pose with all joints directly through a single-branch network. Deep Prior [20] proposed a bottleneck layer into the network for the first time to learn a pose prior, and Deep Model [39] adopted a forward kinematics based layer to ensure the geometric validity of the estimated poses. In spite of introducing hand physical constraints, the performances of these networks are not good enough.

To improve the accuracy, the single-branch network was designed more complicated to extract complex features. [19] improved Deep Prior greatly in accuracy by using residual network architecture, data augmentation, and better hand segmentation. [17] also used residual blocks and converted the depth image into a 3D representation form. They implemented an intricate 3D CNN in a voxel-to-voxel mapping manner for prediction. Although the accuracy is significantly improved, data conversion and network structure are too complex so that training and testing process are time-consuming. REN [10] also applied residual blocks in their feature extraction module and divided the feature maps of the last convolutional layer into several regions which were integrated in the subsequent fully connected layers. However, REN used uniform grid to extract region features without considering the spatial information of the hand feature maps.

Hierarchical branch structure can better model the hand topology. Based on REN, Pose-REN [3] boosted the accuracy by iterative refinement. Similar to our approach, they fused features of different joints and different fingers according to the topology of the hand. But they used posterior branch strategy focusing on the iterative refinement. In contrast, we use anterior branch to extract features of different hand parts. The network designed in this way can estimate simpler local poses, and let the training process converge faster. By a posterior branch structure [16] uses 6 branches to represent the wrist and each finger based on the hand geometric structure. Different from their work, we consider it both from the hand functional and kinematic features according to the biological viewpoint, designing an anterior branch structure with learning specific features of each functional part first and then merging them to learn the global features by a bottleneck layer. In addition, we group the last three fingers in one branch rather than one branch for each finger, which guarantees the muscle-association among them and speeds up the network convergence.

3 Methodology

In this section, we will elaborate on our proposed method, network structure and implementation details. Our goal is to estimate the 3D coordinates of J joints: \(C = \left\{ c_{i} \right\} _{i=1}^{J}\) with \(c_{i}= \left[ x_{i};y_{i};z_{i} \right] \) in the hand from a single depth image. We design a novel three-branch Convolutional Neural Network based on the functional importance and activity space of different fingers, and then ensemble features to regress all 3D joint locations. The overview of our proposed HBE network is shown in Fig. 1.

3.1 Network Architecture

Hands are frequently used to deal with different tasks, and each finger has different importance and occupies different activity space [2, 15]. The thumb has a unique structure as the opposable characteristic, which plays an important role in in communication or dexterous manipulation. Therefore, the thumb is the most important due to the highest DOF and the largest activity space, so we use a separate branch to learn its features. Although each of the other four fingers have the same DOF, the index finger is closest to thumb and the two fingers alone can generate some gestures, thus the index finger is the second most important and is assigned to a separate branch. Considering the muscle-associated movement among the last three fingers and high correlation in activity, we group them in a single branch.

We design the hand pose estimation network based on above mentioned fingers functional importance. The five-finger structure of the hand is simplified into three parts, corresponding to the three branches of the network respectively. As shown in Fig. 1, three convolutional branches in this network are used to extract the features of each hand part. Since the function of the middle, the ring and the little finger is less important and similar in movements, we merge them into one part and abstractly understand the 5-finger structure of the hand as a 3-part structure. Each part is of equal importance. Therefore, the feature extraction network structure of each branch is the same.

The features from each branch are fused to predict the hand pose. Here we introduce the branch ensemble strategy: features from the last fully connected layers in all branches are concatenated and used to infer 3D joint coordinates with an extra regression layer. It should be pointed out that before the output layer, inspired by the idea of Deep Prior [20], we add a linear bottleneck layer. The bottleneck embedding forces the network to learn a low dimensional representation of the hand pose as a global physical constraint of the hand shape in the network. The label dimensions (J \(\times \) 3) of the training data are reduced by Principal Component Analysis (PCA) and used as the ground truth of the bottleneck embedding layer. The principal component and the mean value of the low-dimensional data are used as the weights and the biases of the output layer respectively. Finally, the output layer recovers the low-dimensional predictions of the bottleneck layer to the original J \(\times \) 3-dimensional joint positions.

3.2 Branch Details

When designing the feature extraction layers, we believe that the regression problem of predicting joint positions is rather different from the classification problem of object recognition, because semantic features are crucial to the latter one. Since shallow network learns low-level spatial features that are more susceptible to the size, direction, and position of an object. Common convolutional layers and max pooling layers for the feature extraction module in each branch are shown in Fig. 2. The estimation of the complex global pose is reduced to the estimations of simpler local poses, enabling the network to be more lightweight and easier to train. A larger convolution kernel that can obtain more spatial information and lager receptive fields, is very useful for location regression and effective to infer the occluded joints. At each branch, we use a stack of two 5 \(\times \) 5 convolutional layers instead of a single larger one, which gains the same size of effective receptive field to a single 9 \(\times \) 9 convolutional layer as well as decreases the number of parameters, as calculated in [25]. In the feature mapping module, we add Batch Normalization (BN) layer after each fully connected layer. The distribution change of the training data is accumulated after processed by the middle hidden layers, which will affect the network training. The BN layer has the ability to solve this data distribution change problem, which makes the gradient transfer more fluent and improves the robustness and generalization ability of the training model [11]. All layers use Rectified Linear Unit (ReLU) activation functions.

Fig. 2.
figure 2

The structure details of the feature extraction branch. Ci represents the convolutional layer, MP represents the maxpooling layer, and FC represents the fully connected layer

3.3 Loss Function

The loss function of our network is defined as:

$$\begin{aligned} Loss = L + \lambda R(w) \end{aligned}$$
(1)

where \(\lambda R(w)\) is the L2-norm regularization term and the regularization coefficient \(\lambda \) is set to 0.001 in our experiments. L is the mean square error between the predicted value and the ground truth. Specifically, we define the loss term L in the form:

$$\begin{aligned} L= \alpha \times L_{thumb}+\beta \times L_{index}+\gamma \times L_{others}+\sigma \times L_{d} \end{aligned}$$
(2)

where \(L_{thumb}\) is the loss of the thumb branch, \(L_{index}\) is the loss of the index finger branch, \(L_{others}\) is the loss of the other fingers branch, \(L_{d}\) is the loss of low-dimensional embedding layer, and \(\left\{ \alpha ,\beta ,\gamma ,\sigma \right\} \) are factors to balance these losses. In our experiment we set them to be 1 for simplification.

Let \(c_{i}\) be the outputs of the branch predicting joint positions in 3D form and \(C_{i}\) be the ground-truth, both \(c_{i}\) and \(C_{i}\) have the form of \(\left[ x_{i};y_{i};z_{i} \right] \). We define the loss of each branch as:

$$\begin{aligned} L_{b}= \sum _{i=1}^{J_{b}}\left\| c_{i}- C_{i} \right\| _{2}^{2}, \quad b\in \left\{ thumb,index,others \right\} \end{aligned}$$
(3)

where \(J_{b}\) is the number of joints in each branch.

As for the bottleneck embedding, let D be the number of reduced dimension which is much less than \(J \times 3\), \(p_{i}\) be the output of the bottleneck layer, \(P_{i}\) be the dimension reduced training label as the ground truth. We define the loss of the low-dimensional embedding as:

$$\begin{aligned} L_{d}= \sum _{i=1}^{D}\left\| p_{i}- P_{i} \right\| _{2}^{2} \end{aligned}$$
(4)

3.4 Implementation Details

The input of our network is a hand-only depth image, which is generated after a series of preprocessing steps on the dataset. First of all, we cut out the hand area according to the ground truth labels provided by the dataset, then fill the cropped image up into a square, at last resize it to 128 \(\times \) 128, and in the meanwhile, we normalize the hand depth value in [−1,1]. Pixel values that are larger than the maximum hand depth or unavailable because of noise are set to 1. This depth normalization step is important for the network to adapt to different distances from the hand to the camera.

Our model is trained and tested on a computer with Intel Core i7 CPU, 32 GB of RAM and an NVIDIA GTX1080 GPU. Our network is implemented in Python using the Tensorflow [1] framework. Except for the output layer, all weights are initialized from the zero-mean Normal distribution with 0.01 standard deviation. The network is trained with back propagation using Adam [14] optimizer with a batch size of 128 for 100 epochs. We use a dynamic learning rate with an initial value of 0.001 and reduce it by a factor 0.95 for every epoch. And the dropout rate is set to be 0.85 (keep probability).

4 Experiments

In this section we evaluate our Hand Branch Ensemble(HBE) network on several challenging public hand pose datasets. First of all, we introduce these datasets and the parameters of our methods. Then we describe the evaluation metrics, and finally we present and discuss our quantitative as well as qualitative results.

4.1 Datasets

We evaluate our network on three recent public hand pose datasets: the latest high-quality HIM2017 Challenge dataset [37], the traditional widely used ICVL dataset [30] and MSRA dataset [27].

ICVL Dataset [30] includes a training set of 330K hand pose depth frames with additional in-plane rotations augmented frames and 1.5K testing depth images. In our experiments, we only use 110K training data by random sampling. The dataset provides 16 annotated 3D joints.

MSRA Dataset [27] contains 76K depth frames from 9 subjects with 21 annotated joints. Following [27], we use the leave-one-subject-out cross-validation strategy and average the results over the 9 subjects.

Hands In the Million (HIM2017) Challenge Dataset contains the frame based hand pose estimation dataset and the continuous action tracking dataset [36]. We focus on the frame based estimation dataset, which samples poses from BigHand2.2M dataset [38] and FHAD datasets [6] consisting 957K training and 295K testing depth images. The training data is randomly shuffled instead of continuous action sequence. Including both the first-person view and the third-person view hand pose depth images, this dataset is more challenging for its abundant perspectives and hand poses. Moreover, this dataset provides accurate 21-joint 3D location annotations.

In our experiment, we randomly sample 72K training data from the original HIM2017 Challenge dataset as our training set. Since the original test set provided by the Challenge does not contain the ground truth, we have difficulties to measure the accuracy of our method by ourselves. In order to evaluate more fairly, considering that the original test set contains a total of 295,510 frames of SEEN and UNSEEN subjects, we randomly sample 295,510 frames from the original training set to form a new test set (not included in our training set). Since our test set only contains the SEEN subject, we only compare the results of SEEN in the Challenge leaderboard.

4.2 Evaluation Metrics

We follow the common evaluation metrics on hand pose estimation:

1. Mean joint error: The mean 3D distance error for all joints for each frame and average across all testing frames.

2. Correct frame proportion: The proportion of frames that have all joints within a certain distance to ground truth annotation.

4.3 Self-comparisons

Firstly, we compare the effect of the number of branches on the results, as shown in the left figure of Fig. 3. Single-branch means that we do not decompose the hand by part but predict all the joints of the hand directly through a single branch CNN. With regard to the Two-branch, we train a two-branch network with one branch handles the thumb and the other branch manages the other fingers. Obviously, the Three-branch stands for the original three-branch network. As for the Four-branch, the last branch handles the ring and the little finger together, the other branches handle the other fingers one by one. The Five-branch means that each branch corresponds to one finger. By adjusting the number of convolution channels, the parameters of each network remain roughly constant. These networks are trained and tested on the HIM2017 Challenge dataset.

As shown in the left figure of Fig. 3, the original three-branch structure achieves the best accuracy. The horizontal ordinate of Fig. 3 represents each joint. C means the wrist, and \(Ti(i = \left\{ 1,2,3,4\right\} )\), Ii, Mi, Ri and Li represent the joint in the thumb, index, middle, ring, and little finger, respectively. And Avg means the mean joint error. For each finger, take the thumb for example, T1, T2, T3 and T4 represent the MCP joint, PIP joint, DIP joint and the fingertip respectively. The following graphs are represented in the same way.

There is a linkage between the middle finger and the ring finger, which is forcibly destroyed by the structure of the Four-branch and the Five-branch. Further more, in most cases, the last three fingers are in the same activity range, and the Three-branch networks can extract their associated features and reduce the redundancy in the feature combining and mapping. Therefore, the performance of the Three-branch outperforms the others.

Fig. 3.
figure 3

Self-comparisons. Left: Distribution of joint errors in different branch-structures. Right: Distribution of joint errors in different bottleneck dimensions

The effect of the bottleneck layer with low-dimensional embedding has been proved in the paper of Deep Prior [20]. In our experiments, we also use this method to introduce the physical prior of the overall hand pose shape. As for the ICVL dataset, we follow  [20] using a 30-dimensional embedding bottleneck layer. And on the MSRA and HIM2017 dataset, we use a 35-dimensional embedding layer according to our experimental results as shown in the right at Fig. 3, which is evaluated on the MSRA P0 test set. The distribution of joint errors shows that the 35 dimensions out of the original 63-dimensional pose spaces performs best. The evaluation shows that enforcing a pose prior is beneficial compared to direct regression in the full pose space, which is in line with the conclusion of [20], but it is not significant in the improvement of accuracy according to our experiments.

Then we evaluate the importance of our ensemble strategy on the HIM2017 dataset. When we directly concatenating the joint predictions of three branches instead of fusing features of each branch as our ensemble strategy, the mean joint error reaches 5.71 mm, while the mean joint error of our original network with feature ensemble achieves 5.26 mm, and the distribution of joint errors and the correct frame proportion are shown in Fig. 4, which shows that the ensemble strategy in fully connected layer achieves the best performance and further confirms the effectiveness of the ensemble method used in our network.

Fig. 4.
figure 4

Self-comparisons of the ensemble strategy. Left: Distribution of joint errors. Right: Correct frame proportion

Qualitative Results: We present qualitative results on the ICVL, MSRA and HIM2017 dataset in Fig. 5. As we can see, most of the hand poses can be predicted correctly on the three datasets.

Fig. 5.
figure 5

The qualitative results on the MSRA, ICVL and HIM2017 Challenge dataset. The ground truth is marked in blue lines and the prediction is marked in red lines. (Color figure online)

4.4 Comparison with State-of-the-Art Methods

We compare the performance of the Hand Branch Ensemble(HBE) network on three public challenging 3D hand pose datasets (HIM2017, ICVL and MSRA) with some of the state-of-the-art methods, including Deep Prior [20], Deep Model [39], latent random forest (LRF) [30], Crossing Nets [34], V2V-PoseNet [17], Cascade [27], MultiView [7], Pose-REN [3] and Global2Local [16]. Some reported results of previous works [17, 20, 30, 39] are calculated by their prediction available online. Other results [3, 7, 16, 27, 34] are calculated from the figures and tables of their papers.

Table 1. The mean Joint Error on the ICVL Dataset

Our network is evaluated on the ICVL dataset and compared with the state-of-the-art methods. As shown in Table 1, we get better results than Cascade but inferior to V2V-PoseNet. However, we use less data than them to train our method and the parameters complexity is much less than them. Figure 6 shows the correct frame proportion on the ICVL dataset compared with Deep Prior [20], Deep Model [39], latent random forest (LRF) [30], Crossing Nets [34] and Cascade [27], where the horizontal axis represents the maximum allowed distance to ground truth. In general, we achieve comparable performance with state-of-the-art methods on the ICVL dataset in standard evaluation metrics.

Fig. 6.
figure 6

Correct frame proportion on the ICVL dataset

On the MSRA dataset, we compared with Cascade [27], MultiView [7], Crossing Nets [34] and Global2Local [16] as shown in the left figure in Fig. 7. Global2Local [16] also uses a branch-like structure, but our method is quite different from them as described in Sect. 2. The result also proves that our three-branch anterior branch structure achieves a better performance.

Fig. 7.
figure 7

Comparison with state-of-the-art methods. Left: Correct frame proportion on the MSRA dataset. Right: Correct frame proportion of SEEN subjects on the HIM2017 dataset. The curves of THU VCLab and NAIST RVLab are from [36]

We also implement our HBE network and Deep Prior network on HIM2017 Challenge dataset, and get the prediction results of all the joints on our test set. Since our test set has the same size of the original test set, but only contains the SEEN subjects, we only compare the results of SEEN in the Challenge leaderboard. Table 2 shows the Challenge leaderboard and our comparison of mean joint error in millimeter. The right figure in Fig. 7 shows the correct frame proportion of SEEN subjects whose mean joint error within a certain value. The result of Pose-REN from the THU VCLab and the result of NAIST RVLab are from [36]. What we need to emphasize is that the comparison of the above is only an approximate comparison. In spite of this, it can be seen from the results that our method has a superior performance.

Table 2. The approximate comparison on the HIM2017 Challenge Dataset
Table 3. The comparison of computational complexity on the HIM2017 Challenge dataset

4.5 Computational Complexity

We take the HIM2017 Challenge dataset as an example to compare the computational complexity of the proposed HBE network and V2V-PoseNet. We train our network on a single GPU for 100 epochs taking 26250.24 s (7.2 h). The input generation and data pretreatment take 435 s, and loading the input data takes 7.04 s. In the testing stage, it takes 1.5 ms for processing a frame.

Table 3 compares the computational complexity of our HBE network with V2V-PoseNet. We only use part of the original training set for training, while V2V-PoseNet uses the entire training set spending 6 days training including time-consuming I/O operations. With regard to the testing stage, we can achieve 673 fps on a single GPU, while V2V-PoseNet reaches 3.5 fps on a single GPU and 35 fps in a multi-GPU environment. Unlike them, we don’t need to do voxel data conversion and epoch models ensemble for testing, and our network has a fast forwarding due to its simplicity. Besides, the number of parameters in our proposed method is much less than V2V-PoseNet regressing 3D coordinates. In summary, we use a much less training set and simpler network structure but reach the same level as their result even better than them. Our method is faster, more efficient and suitable for real-time applications.

5 Conclusions

We propose a novel three-branch network called the Hand Branch Ensemble (HBE) network for 3D hand pose estimation from a single depth image. According to fingers activity space and functional importance we decompose the hand to three parts: the thumb, the index and the other fingers. Each branch corresponds to one part. The features of three branches are ensemble to predict all 3D joint locations. Our network is trained with a small amount of training data and evaluated on three challenging datasets. Both the training and testing time are quite short, and the experimental results demonstrate that our method outperforms the state-of-the-art methods on the HIM2017 Challenge dataset and achieves comparable performance on the ICVL and MSRA dataset. Our method has less complexity and can adapt to a large range of view-points and varied hand poses. Our proposed method provides a technical approach for tracking and analyzing the complex interaction between humans and environment.