Elsevier

Pattern Recognition Letters

Volume 140, December 2020, Pages 43-48
Pattern Recognition Letters

Simultaneous 3D hand detection and pose estimation using single depth images

https://doi.org/10.1016/j.patrec.2020.09.026Get rights and content

Highlights

  • We proposed to estimate 3D hand pose by simultaneously detecting the hand location.

  • We proposed to use 3D region proposals for hand detection.

  • The proposed method shows competitive results to most existing 3D hand pose estimation methods.

Abstract

In this paper, we investigate 3D hand pose estimation using single depth images. On the one hand, accurate hand localization is a crucial factor for pose estimation. On the other hand, multi-task learning methods have achieved great success in visual recognition tasks. Therefore, in this paper we proposed to simultaneously detect the hand location and estimate its 3D pose in a multi-task learning framework. We used 3D region proposal for 3D pose estimation, which searches possible hand locations in the 3D space. In the experimental part, the proposed method is evaluated on several benchmark datasets and shown it is comparable to most existing 3D hand pose estimation methods.

Introduction

3D hand pose estimation is important with the prevalence of depth cameras, which is the basics for advanced applications, e.g., hand tracking and action recognition [8]. It estimates the coordinates of hand joints in 3D space. In order to compute the 3D hand pose, researchers usually try to first detect accurate hand areas in depth images [9], [20], [26], which is then followed by the pose estimation stage.

Hand localization is crucial to 3D hand pose estimation, especially when hand location is unknown in the dynamic environment [14]. Precise hand localization results can boost the accuracy of 3D hand pose estimation, which has been validated in Oberweger and Lepetit [9]. Tompson et al. [20] tried a simple method for hand localization, where a Randomized Decision Forest (RDF) classifier is used to detect the hand area according to the depth value of each pixel. Oberweger and Lepetit [9] recently proposed to use a convolutional neural network (CNN) to refine the hand area from a coarse localization result, which brings a significant improvement to 3D hand pose estimation. Chen et al. [1] used a CNN with the 2D region proposal method to detect 2D joint locations from hand images. Choi et al. [3] assumed the hand is within a certain range to the camera in the depth image. Choi et al. [4] also tried to use a CNN to regress the heatmaps corresponding to the central position of the targets for localization. Sun et al. [16] used a cascaded way to first detect the root hand joint position and then detect the rest hand joints. Tang et al. [19] detected the hand joints in a hierarchical way where joints are organized in a kinematic hierarchy and each joint relies on its ancestors.

3D hand pose estimation has been researched for several years. Given the hand location in the depth image, various methods have been proposed to estimate the 3D hand pose, i.e., the 3D coordinates of the hand joints. Most works formulate it as a global regression problem [17], [24], and the 3D coordinates of all joints are the regression target. In recent years, CNN is widely used to estimate 3D hand pose. Usually, single joint refinement [10], [24] follows to improve the global regression results. Instead of coordinate regression, Tompson et al. [20] and Ge et al. [5] learned a heatmap for each joint in the image using CNN. In order to capture more details of joints, researchers use the skeleton to characterize the structural information among joints, e.g., [24], [25], [27]. In this paper, for efficiency we only used CNN to directly regress all 3D joint coordinates.

Up to now, hand detection and 3D hand pose estimation stages are separated in most existing works. It remains an interesting question for them to be simultaneously considered in a multi-task learning system. In the area of object detection in RGB images, multi-task learning methods have shown their superiority, e.g., the well-known Fast R-CNN [6] and Faster R-CNN [15] methods. In these two methods, not only object detection, but also the region classification are implemented in a system, which has shown better performance than just considering every single task. Inspired by this evidence, we proposed to formulate 3D hand pose estimation in a multi-task learning problem.

In the proposed method, it learns three tasks simultaneously and they are region classification, hand region detection, and 3D hand pose estimation, respectively. Both region classification and hand region detection can help boost the accuracy of 3D hand pose estimation in the multi-task learning framework. These three tasks are performed on hand region proposals from depth images. Currently, few works in 3D hand pose estimation area have investigated hand region proposal generation methods for single depth images. The most related one is [1], where the Faster RCNN method is applied to the feature map of depth image to generate 2D proposals for 2D pose estimation. The difference is they solved different problems, i.e., 2D vs. 3D pose estimation. Chen et al. [1] only predict the 2D joints in the image, whose method cannot be used to predict 3D joints directly, because it lacks the consideration for the coordinate in the depth direction.

We use a CNN model on the whole image for 3D hand pose estimation. We apply a CNN on the whole image and extract local features from the image feature map to represent local regions. With these local features, we detect the hand region in the image by considering the requirement in practice, where hand location is always unknown in images.

We estimate 3D hand pose by detecting the hand location from depth images without using any prior information, especially in the testing phase. Specifically, first, we used CNN to compute a feature map for each depth image, which is then shared by all the region proposals to compute feature vectors. 2D region proposals are generated using a Region Proposal Network (RPN) [15] on the image feature map. For each 2D region proposal, we compute a classification score to discriminate as a hand or not and the 2D hand region coordinates in the image. For the last task, we generate 3D region proposals to estimate 3D hand joint locations based on 2D region proposals. To predict the 3D joints, our method proposed to use the 3D region proposal for 3D hand pose estimation. Finally, according to the classification score, we output one single 3D pose estimation result for the hand in the depth image. In the experimental part, the proposed method is evaluated on several benchmark datasets of 3D hand pose estimation, which achieves state-of-the-art results, when additional hand annotation is not used in the image. The system is illustrated in Fig. 1.

Overall, the main contributions of this paper are summarised as follows:

  • We proposed to estimate 3D hand pose by simultaneously detecting the hand location, which helps the deployment of 3D hand pose estimation methods in practice.

  • We proposed to use 3D region proposals for hand detection, which has not been used in most existing 3D hand pose estimation works.

  • The proposed method is systematically evaluated on different benchmark datasets, which shows it is comparable to most existing 3D hand pose estimation methods.

Section snippets

Related works

In this section, we review some closely related works on 3D hand pose estimation and hand detection.

Our approach

We propose to simultaneously detect the hand location and estimate its 3D pose from single depth images. It is naturally formulated into a multi-task learning problem, which is shown in Fig. 1. In this system, we consider three tasks: region classification, hand region detection, and 3D hand pose estimation. In the beginning, a depth image is fed into a deep CNN to generate an image feature map. Then, a region proposal network (RPN) is applied on the feature map to get multiple 2D region

Experiments

In this part, we provide a systematic evaluation of the proposed method on different 3D pose estimation benchmark datasets, and they are respectively:

  • NYU hand [20]: It contains 72,757 training images and 8252 test images. Each image has one person moving one hand, which is annotated with 36 3D joints. We evaluated 14 out of 36 joints the same as [10]. The RGBD images are acquired from 3 Kinect cameras and we only used the frontal view. The training set contains one user and the testing set

Conclusions and future works

In this paper, we proposed to estimate 3D hand pose by detecting the hand location in depth images simultaneously. They are naturally formulated into a multi-task learning problem, which is evaluated on 3D hand proposals generated on the feature map of the depth image from CNN. In the experimental part, the proposed method is evaluated on benchmark 3D hand datasets with complex backgrounds. The results show it is comparable to existing methods using separate hand detection and pose estimation

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is supported by National Key R&D Program of China (No.2018AAA0100100), and National Natural Science Foundation of China, 61702095.

References (28)

  • R. Li et al.

    A survey on 3D hand pose estimation: cameras, methods, and datasets

    Pattern Recognit.

    (2019)
  • A. Ramamoorthy et al.

    Recognition of dynamic hand gestures

    Pattern Recognit.

    (2003)
  • Y. Zhou et al.

    A novel finger and hand pose estimation technique for real-time hand gesture recognition

    Pattern Recognit.

    (2016)
  • T.-Y. Chen et al.

    Deep learning for integrated hand detection and pose estimation

    International Conference on Pattern Recognition

    (2016)
  • X. Chen, G. Wang, H. Guo, C. Zhang, Pose guided structured region ensemble network for cascaded hand pose estimation,...
  • C. Choi et al.

    A collaborative filtering approach to real-time hand pose estimation

    Proc. IEEE Int’l Conf. on Computer Vision

    (2015)
  • C. Choi et al.

    Robust hand pose estimation during the interaction with an unknown object

    Proc. IEEE Int’l Conf. on Computer Vision

    (2017)
  • L. Ge et al.

    Robust 3D hand pose estimation in single depth images: from single-view CNN to multi-view CNNs

    Proc. IEEE Int’l Conf. on Computer Vision and Pattern Recognition

    (2016)
  • R. Girshick

    Fast R-CNN

    Proc. IEEE Int’l Conf. on Computer Vision

    (2015)
  • K. He et al.

    Mask R-CNN

    Proc. IEEE Int’l Conf. on Computer Vision

    (2017)
  • M. Oberweger et al.

    DeepPrior++: improving fast and accurate 3D hand pose estimation

    Proc. IEEE Int’l Conf. on Computer Vision Workshops

    (2017)
  • M. Oberweger et al.

    Hands deep in deep learning for hand pose estimation

    Proc. Computer Vision Winter Workshop (CVWW)

    (2015)
  • M. Oberweger et al.

    Training a feedback loop for hand pose estimation

    Proc. IEEE Int’l Conf. on Computer Vision

    (2015)
  • M. Oberweger et al.

    Training a feedback loop for hand pose estimation

    Proc. IEEE Int’l Conf. on Computer Vision

    (2015)
  • Cited by (7)

    • HandyPose: Multi-level framework for hand pose estimation

      2022, Pattern Recognition
      Citation Excerpt :

      Hand pose estimation is an important computer vision task that includes methods for 2D hand pose [1–3] and 3D hand pose estimation [4].

    • Dual-channel cascade pose estimation network trained on infrared thermal image and groundtruth annotation for real-time gait measurement

      2022, Medical Image Analysis
      Citation Excerpt :

      This challenge remains even with deep learning approaches. Likewise, depth based pose estimation and analysis via deep neural networks and the subsequent production of large human pose estimation datasets has also attracted considerable attention over the last decade (He et al., 2015; He et al., 2016; Vasileiadis et al., 2019; Zhang et al., 2020; Ren et al., 2021). Various efforts have been made in the domain of gait analysis using depth-based human pose estimation (Eltoukhy et al., 2017b; Latorre et al., 2018; Summa et al., 2020; Hazra et al., 2021).

    • Pollen Image Depth Information Correction

      2023, Proceedings - 2023 8th International Conference on Computational Intelligence and Applications, ICCIA 2023
    • Research on the Application of Virtual Technology-based Posture Detection Device in Swimming Teaching

      2022, International Journal of Advanced Computer Science and Applications
    View all citing articles on Scopus

    Handled by Associate Editor: Sudeep Sarkar.

    View full text