Simultaneous 3D hand detection and pose estimation using single depth images

doi:10.1016/j.patrec.2020.09.026

Pattern Recognition Letters

Volume 140, December 2020, Pages 43-48

https://doi.org/10.1016/j.patrec.2020.09.026 Get rights and content

Highlights

•
We proposed to estimate 3D hand pose by simultaneously detecting the hand location.
•
We proposed to use 3D region proposals for hand detection.
•
The proposed method shows competitive results to most existing 3D hand pose estimation methods.

Abstract

In this paper, we investigate 3D hand pose estimation using single depth images. On the one hand, accurate hand localization is a crucial factor for pose estimation. On the other hand, multi-task learning methods have achieved great success in visual recognition tasks. Therefore, in this paper we proposed to simultaneously detect the hand location and estimate its 3D pose in a multi-task learning framework. We used 3D region proposal for 3D pose estimation, which searches possible hand locations in the 3D space. In the experimental part, the proposed method is evaluated on several benchmark datasets and shown it is comparable to most existing 3D hand pose estimation methods.

Introduction

3D hand pose estimation is important with the prevalence of depth cameras, which is the basics for advanced applications, e.g., hand tracking and action recognition [8]. It estimates the coordinates of hand joints in 3D space. In order to compute the 3D hand pose, researchers usually try to first detect accurate hand areas in depth images [9], [20], [26], which is then followed by the pose estimation stage.

Hand localization is crucial to 3D hand pose estimation, especially when hand location is unknown in the dynamic environment [14]. Precise hand localization results can boost the accuracy of 3D hand pose estimation, which has been validated in Oberweger and Lepetit [9]. Tompson et al. [20] tried a simple method for hand localization, where a Randomized Decision Forest (RDF) classifier is used to detect the hand area according to the depth value of each pixel. Oberweger and Lepetit [9] recently proposed to use a convolutional neural network (CNN) to refine the hand area from a coarse localization result, which brings a significant improvement to 3D hand pose estimation. Chen et al. [1] used a CNN with the 2D region proposal method to detect 2D joint locations from hand images. Choi et al. [3] assumed the hand is within a certain range to the camera in the depth image. Choi et al. [4] also tried to use a CNN to regress the heatmaps corresponding to the central position of the targets for localization. Sun et al. [16] used a cascaded way to first detect the root hand joint position and then detect the rest hand joints. Tang et al. [19] detected the hand joints in a hierarchical way where joints are organized in a kinematic hierarchy and each joint relies on its ancestors.

3D hand pose estimation has been researched for several years. Given the hand location in the depth image, various methods have been proposed to estimate the 3D hand pose, i.e., the 3D coordinates of the hand joints. Most works formulate it as a global regression problem [17], [24], and the 3D coordinates of all joints are the regression target. In recent years, CNN is widely used to estimate 3D hand pose. Usually, single joint refinement [10], [24] follows to improve the global regression results. Instead of coordinate regression, Tompson et al. [20] and Ge et al. [5] learned a heatmap for each joint in the image using CNN. In order to capture more details of joints, researchers use the skeleton to characterize the structural information among joints, e.g., [24], [25], [27]. In this paper, for efficiency we only used CNN to directly regress all 3D joint coordinates.

Up to now, hand detection and 3D hand pose estimation stages are separated in most existing works. It remains an interesting question for them to be simultaneously considered in a multi-task learning system. In the area of object detection in RGB images, multi-task learning methods have shown their superiority, e.g., the well-known Fast R-CNN [6] and Faster R-CNN [15] methods. In these two methods, not only object detection, but also the region classification are implemented in a system, which has shown better performance than just considering every single task. Inspired by this evidence, we proposed to formulate 3D hand pose estimation in a multi-task learning problem.

In the proposed method, it learns three tasks simultaneously and they are region classification, hand region detection, and 3D hand pose estimation, respectively. Both region classification and hand region detection can help boost the accuracy of 3D hand pose estimation in the multi-task learning framework. These three tasks are performed on hand region proposals from depth images. Currently, few works in 3D hand pose estimation area have investigated hand region proposal generation methods for single depth images. The most related one is [1], where the Faster RCNN method is applied to the feature map of depth image to generate 2D proposals for 2D pose estimation. The difference is they solved different problems, i.e., 2D vs. 3D pose estimation. Chen et al. [1] only predict the 2D joints in the image, whose method cannot be used to predict 3D joints directly, because it lacks the consideration for the coordinate in the depth direction.

We use a CNN model on the whole image for 3D hand pose estimation. We apply a CNN on the whole image and extract local features from the image feature map to represent local regions. With these local features, we detect the hand region in the image by considering the requirement in practice, where hand location is always unknown in images.

We estimate 3D hand pose by detecting the hand location from depth images without using any prior information, especially in the testing phase. Specifically, first, we used CNN to compute a feature map for each depth image, which is then shared by all the region proposals to compute feature vectors. 2D region proposals are generated using a Region Proposal Network (RPN) [15] on the image feature map. For each 2D region proposal, we compute a classification score to discriminate as a hand or not and the 2D hand region coordinates in the image. For the last task, we generate 3D region proposals to estimate 3D hand joint locations based on 2D region proposals. To predict the 3D joints, our method proposed to use the 3D region proposal for 3D hand pose estimation. Finally, according to the classification score, we output one single 3D pose estimation result for the hand in the depth image. In the experimental part, the proposed method is evaluated on several benchmark datasets of 3D hand pose estimation, which achieves state-of-the-art results, when additional hand annotation is not used in the image. The system is illustrated in Fig. 1.

Overall, the main contributions of this paper are summarised as follows:

•
We proposed to estimate 3D hand pose by simultaneously detecting the hand location, which helps the deployment of 3D hand pose estimation methods in practice.
•
We proposed to use 3D region proposals for hand detection, which has not been used in most existing 3D hand pose estimation works.
•
The proposed method is systematically evaluated on different benchmark datasets, which shows it is comparable to most existing 3D hand pose estimation methods.

Section snippets

Related works

In this section, we review some closely related works on 3D hand pose estimation and hand detection.

Our approach

We propose to simultaneously detect the hand location and estimate its 3D pose from single depth images. It is naturally formulated into a multi-task learning problem, which is shown in Fig. 1. In this system, we consider three tasks: region classification, hand region detection, and 3D hand pose estimation. In the beginning, a depth image is fed into a deep CNN to generate an image feature map. Then, a region proposal network (RPN) is applied on the feature map to get multiple 2D region

Experiments

In this part, we provide a systematic evaluation of the proposed method on different 3D pose estimation benchmark datasets, and they are respectively:

•
NYU hand [20]: It contains 72,757 training images and 8252 test images. Each image has one person moving one hand, which is annotated with 36 3D joints. We evaluated 14 out of 36 joints the same as [10]. The RGBD images are acquired from 3 Kinect cameras and we only used the frontal view. The training set contains one user and the testing set

Conclusions and future works

In this paper, we proposed to estimate 3D hand pose by detecting the hand location in depth images simultaneously. They are naturally formulated into a multi-task learning problem, which is evaluated on 3D hand proposals generated on the feature map of the depth image from CNN. In the experimental part, the proposed method is evaluated on benchmark 3D hand datasets with complex backgrounds. The results show it is comparable to existing methods using separate hand detection and pose estimation

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is supported by National Key R&D Program of China (No.2018AAA0100100), and National Natural Science Foundation of China, 61702095.

References (28)

R. Li et al.
A survey on 3D hand pose estimation: cameras, methods, and datasets
Pattern Recognit.
(2019)
A. Ramamoorthy et al.
Recognition of dynamic hand gestures
Pattern Recognit.
(2003)
Y. Zhou et al.
A novel finger and hand pose estimation technique for real-time hand gesture recognition
Pattern Recognit.
(2016)
T.-Y. Chen et al.
Deep learning for integrated hand detection and pose estimation
International Conference on Pattern Recognition
(2016)
X. Chen, G. Wang, H. Guo, C. Zhang, Pose guided structured region ensemble network for cascaded hand pose estimation,...
C. Choi et al.
A collaborative filtering approach to real-time hand pose estimation
Proc. IEEE Int’l Conf. on Computer Vision
(2015)
C. Choi et al.
Robust hand pose estimation during the interaction with an unknown object
Proc. IEEE Int’l Conf. on Computer Vision
(2017)
L. Ge et al.
Robust 3D hand pose estimation in single depth images: from single-view CNN to multi-view CNNs
Proc. IEEE Int’l Conf. on Computer Vision and Pattern Recognition
(2016)
R. Girshick
Fast R-CNN
Proc. IEEE Int’l Conf. on Computer Vision
(2015)
K. He et al.
Mask R-CNN
Proc. IEEE Int’l Conf. on Computer Vision
(2017)

M. Oberweger et al.

DeepPrior++: improving fast and accurate 3D hand pose estimation

Proc. IEEE Int’l Conf. on Computer Vision Workshops

(2017)

M. Oberweger et al.

Hands deep in deep learning for hand pose estimation

Proc. Computer Vision Winter Workshop (CVWW)

(2015)

M. Oberweger et al.

Training a feedback loop for hand pose estimation

Proc. IEEE Int’l Conf. on Computer Vision

(2015)

M. Oberweger et al.

Training a feedback loop for hand pose estimation

Proc. IEEE Int’l Conf. on Computer Vision

(2015)

Cited by (7)

Lightweight 3D hand pose estimation by cascading CNNs with reinforcement learning
2023, Pattern Recognition Letters
This paper proposes a novel strategy for lightweight 3D hand pose estimation. The strategy decomposes the estimation process into feature extraction and feature exploitation, where feature extraction performs dimension reduction on the original input and outputs feature vectors. Feature exploitation is further analyzed and considered as a path optimization problem, and reinforcement learning (RL) is proved to be capable of tackling the problem accurately. A framework cascading convolutional neural networks (CNNs) and RL is next introduced to validate the effectiveness of the proposed strategy, where two different backbones are used to extract features, and RL is extended into continuous space to enhance accuracy. Ablation studies and experiments are carried out on NYU and ICVL datasets using the proposed strategy with continuous RL. The results show that the accuracy of continuous RL exceeds discrete RL, and the rapidity and accuracy leads the backbones. Comparative studies show the strategy achieves leading rapidity and accuracy in single-view depth-based methods.
HandyPose: Multi-level framework for hand pose estimation
2022, Pattern Recognition
Citation Excerpt :
Hand pose estimation is an important computer vision task that includes methods for 2D hand pose [1–3] and 3D hand pose estimation [4].
Hand pose estimation is a challenging task due to the large number of degrees of freedom and the frequent occlusions of joints. To address these challenges, we propose HandyPose, a single-pass, end-to-end trainable architecture for 2D hand pose estimation using a single RGB image as input. Adopting an encoder-decoder framework with multi-level features, along with a novel multi-level waterfall atrous spatial pooling module for multi-scale representations, our method achieves high accuracy in hand pose while maintaining manageable size complexity and modularity of the network. HandyPose takes a multi-scale approach to representing context by incorporating spatial information at various levels of the network to mitigate the loss of resolution due to pooling. Our advanced multi-level waterfall module leverages the efficiency of progressive cascade filtering while maintaining larger fields-of-view through the concatenation of multi-level features from different levels of the network in the waterfall module. The decoder incorporates both the waterfall and multi-scale features for the generation of accurate joint heatmaps in a single stage. Our results demonstrate state-of-the-art performance on popular datasets and show that HandyPose is a robust and efficient architecture for 2D hand pose estimation.
Dual-channel cascade pose estimation network trained on infrared thermal image and groundtruth annotation for real-time gait measurement
2022, Medical Image Analysis
Citation Excerpt :
This challenge remains even with deep learning approaches. Likewise, depth based pose estimation and analysis via deep neural networks and the subsequent production of large human pose estimation datasets has also attracted considerable attention over the last decade (He et al., 2015; He et al., 2016; Vasileiadis et al., 2019; Zhang et al., 2020; Ren et al., 2021). Various efforts have been made in the domain of gait analysis using depth-based human pose estimation (Eltoukhy et al., 2017b; Latorre et al., 2018; Summa et al., 2020; Hazra et al., 2021).
Real-time spatiotemporal parameter measurement for gait analysis is challenging. Previous techniques for 3D motion analysis, such as inertial measurement units, marker based motion analysis or the use of depth cameras, require expensive equipment, highly skilled staff and limits feasibility for sustainable applications. In this paper a dual-channel cascaded network to perform contactless real-time 3D human pose estimation using a single infrared thermal video as an input is proposed. An algorithm to calculate gait spatiotemporal parameters is presented by tracking estimated joint locations. Additionally, a training dataset composed of infrared thermal images and groundtruth annotations has been developed. The annotation represents a set of 3D joint locations from infrared optical trackers, which is considered to be the gold standard in clinical applications. On the proposed dataset, our pose estimation framework achieved a 3D human pose mean error of below 21 mm and outperforms state-of-the-art methods. The results reveal that the proposed system achieves competitive skeleton tracking performance on par with the other motion capture devices and exhibited good agreement with a marker-based three-dimensional motion analysis system (3DMA) over a range of spatiotemporal parameters. Moreover, the process is shown to distinguish differences in over-ground gait parameters of older adults with and without Hemiplegia’s disease. We believe that the proposed approaches can measure selected spatiotemporal gait parameters and could be effectively used in clinical or home settings.
Yoga pose classification: a CNN and MediaPipe inspired deep learning approach for real-world application
2023, Journal of Ambient Intelligence and Humanized Computing
Pollen Image Depth Information Correction
2023, Proceedings - 2023 8th International Conference on Computational Intelligence and Applications, ICCIA 2023
Research on the Application of Virtual Technology-based Posture Detection Device in Swimming Teaching
2022, International Journal of Advanced Computer Science and Applications

View all citing articles on Scopus

: Handled by Associate Editor: Sudeep Sarkar.

View full text

Simultaneous 3D hand detection and pose estimation using single depth images

Highlights

Abstract

Introduction

Section snippets

Related works

Our approach

Experiments

Conclusions and future works

Declaration of Competing Interest

Acknowledgments

Pattern Recognit.

Pattern Recognit.

Pattern Recognit.

Deep learning for integrated hand detection and pose estimation

International Conference on Pattern Recognition

A collaborative filtering approach to real-time hand pose estimation

Proc. IEEE Int’l Conf. on Computer Vision

Robust hand pose estimation during the interaction with an unknown object

Proc. IEEE Int’l Conf. on Computer Vision

Robust 3D hand pose estimation in single depth images: from single-view CNN to multi-view CNNs

Proc. IEEE Int’l Conf. on Computer Vision and Pattern Recognition

Fast R-CNN

Proc. IEEE Int’l Conf. on Computer Vision

Mask R-CNN

Proc. IEEE Int’l Conf. on Computer Vision

DeepPrior++: improving fast and accurate 3D hand pose estimation

Proc. IEEE Int’l Conf. on Computer Vision Workshops

Hands deep in deep learning for hand pose estimation

Proc. Computer Vision Winter Workshop (CVWW)

Training a feedback loop for hand pose estimation

Proc. IEEE Int’l Conf. on Computer Vision

Training a feedback loop for hand pose estimation

Proc. IEEE Int’l Conf. on Computer Vision