Elsevier

Neurocomputing

Volume 437, 21 May 2021, Pages 42-57
Neurocomputing

Spatial-aware stacked regression network for real-time 3D hand pose estimation

https://doi.org/10.1016/j.neucom.2021.01.045Get rights and content

Highlights

  • A stacked regression network for fast, robust and accurate 3D hand pose estimation is proposed.

  • A pose re-parameterization is adopted to utilize the 3D spatial structure of hand.

  • A spatial attention module is adopted to reduce the influence of irrelevant features.

  • A cross-stage self-distillation module is adopted to achieve a lightweight network.

Abstract

Making full use of the spatial information of the depth data is crucial for 3D hand pose estimation from a single depth image. In this paper, we propose a Spatial-aware Stacked Regression Network (SSRN) for fast, robust and accurate 3D hand pose estimation from a single depth image. By adopting a differentiable pose re-parameterization process, our method efficiently encodes the pose-dependent 3D spatial structure of the depth data as spatial-aware representations. Taking such spatial-aware representations as inputs, the stacked regression network utilizes multi-joint spatial context and the 3D spatial relationship between the estimated pose and the depth data to predict a refined hand pose. To further improve the estimation accuracy, we adopt a spatial attention mechanism to reduce the influence of irrelevant features for pose regression. In order to improve the speed of the network, we propose a cross-stage self-distillation mechanism to distill knowledge within the network itself. Experiments on four datasets show that our proposed method achieves state-of-the-art accuracy with high running speed around 330 FPS on a single GPU and 35 FPS on a single CPU.

Introduction

Hand pose estimation plays an important role in applications of human–computer interaction in virtual reality and augmented reality. With the development of deep learning and low-cost depth camera, it has made significant progress in recent years [1], [2], [3]. Nevertheless, it is still challenging to achieve accurate and robust hand pose estimation performance due to large variations in hand orientations, high similarity among fingers, severe self-occlusion and poor quality of depth images. In addition, improving the speed of the algorithm is also an important issue in order to satisfy the requirements of interactive application scenarios.

Recently, deep neural network-based approaches have achieved drastic performance improvement in 3D hand pose estimation. One line of work for 3D hand poses estimation is holistic regression [4], [5], [6], [7], [8], [9], that is aiming to directly predict 3D pose parameters such as joint angles or 3D joint locations from the depth image. Regression-based methods are able to capture global constraints among different joints, thus they are robust to self-occlusion and poor quality images [2]. However, since these methods treat the depth image as a 2D image, they under-utilize 3D spatial information of the depth image, thus having relatively low precision.

Explicit consideration of 3D spatial properties of the depth image can significantly improve the accuracy of estimation [2]. One straightforward solution is converting the depth image to 3D data, such as voxels [10], [11] and points [12], [13], [14], [15], [16], and then applying a 3D deep learning method. An alternative way is to incorporate spatial-aware representations into 2D CNNs. These works utilize the fully convolutional network to estimate pixel-wise representations such as 3D heat-maps, 3D unit vector fields and approximated geodesic distance maps, from which the 3D hand joint locations can be inferred by post-processing [17], [18], [19] or a regression network [20]. Adopting spatial-aware representations allows 2D CNNs to consider both the 2D and 3D properties of the depth image and makes it easy to leverage stacked architecture to reevaluate the initial estimations. However, those methods still suffer from some limitations.

First, performing pixel-wise estimation is inefficient, because it needs a computationally expensive upsampling step to obtain high-resolution pixel-wise estimations [17], [19], [21]. Thus, compared with regression-based methods, these methods require more complex network structures and more computation. Second, when the depth data near the target joint points are missing or occluded, directly inferring the joint coordinates from the spatial-aware representations is unreliable [13].

In essence, the spatial-aware representations are embeddings of low-dimensional joint coordinates on high-dimensional image space. They reflect the spatial location of joints or the spatial relationship such as distance, direction and offset between the joint coordinates and each pixel of the depth image. They are able to provide rich 3D spatial information and powerful disambiguation clues for further optimization. We argue that the form of representation is more critical than the process of representation generation for iterative refinement. Based on this inspiration, we propose a stacked regression architecture, which directly encodes the previously estimated pose into spatial-aware representations by a pose re-parameterization, thus subsequent regression stages are able to utilize multi-joint spatial context and the 3D spatial relationship between the estimated pose and the depth data to perform iterative refinement.

In this paper, we propose a Spatial-aware Stacked Regression Network (SSRN) for fast, robust and accurate 3D hand pose estimation from a single depth image. SSRN has multiple pose regression modules, which are connected by the differentiable pose re-parameterization module. Specifically, the pose re-parameterization module generates spatial-aware representations from previously estimated pose directly. Then, the subsequent pose regression module predicts a more accurate pose based on multi-joint spatial context and the 3D spatial relationship between the estimated pose and the depth data on the spatial-aware representations. We regard the first pose regression module as the initial stage and the subsequent regression module as the refinement stage. The pose re-parameterization process is simple, fast and non-parametric. It can generate high-quality spatial-aware representations with little overhead in computation and storage. Furthermore, we integrate and explore multiple good practices including data augmentation, smooth L1 loss, localization refinement and coordinate decoupling to improve the performance of 2D CNN for 3D hand pose estimation.

Our main contributions can be summarized as follows:

  • (1)  We adopt a differentiable pose re-parameterization process to generate spatial-aware representations. Compared with performing pixel-wise estimation, it can generate high-quality spatial-aware representations with little overhead in computation and storage.

  • (2)  We incorporate the spatial-aware representations into a stacked regression network. Spatial-aware representations allow us to efficiently utilize the 3D spatial information in the depth image and multi-joint spatial context for accurate 3D hand pose estimation.

This paper is an extension of our conference paper [22]. The new contributions of this paper are summarized as follows:

  • (1)  We propose a spatial attention mechanism to replace the global average pooling (GAP) and fully connected (FC) layer in the regression-based method to reduce the influence of irrelevant regions in the feature maps for pose estimation. Experimental results show that, with the spatial attention mechanism, the estimation accuracy of the regression-based method can be further improved.

  • (2)  We propose a cross-stage self-distillation mechanism to align the features of the initial stage with the refinement stage, which allows us to adopt a more lightweight network in the initial stage while maintaining the estimation accuracy so that the whole network has faster reasoning speed.

  • (3)  We conduct more extensive self-comparison experiments and a cross-dataset experiment. Experimental results show that our method has good generalization ability.

We evaluate our method on four publicly available 3D hand pose datasets (NYU [1], ICVL [23], MSRA [24], HANDS 2017 [2]). Our method achieves state-of-the-art accuracy on four datasets with fewer parameters and a faster frame rate. The reasoning speed of our method is around 330 FPS on a single GPU and 35 FPS on a single CPU.

Section snippets

Depth-based 3D hand pose estimation

The methods of estimating 3D hand pose from a single depth image can be categorized into three classes: generative methods, discriminative methods and hybrid methods. Generative methods [25], [26], [27], [28], [29], [30], [31], [32], [33] use a pre-defined 3D hand model to fit depth input. Its effectiveness heavily relies on the construction of the hand model and the definition of the energy function. These approaches need a time-consuming optimizing procedure and are likely to trap into local

Overview

In order to better capture the spatial structure of the depth data, our method aims at incorporating spatial-aware representations into the network inference. However, we also want to avoid performing pixel-wise estimation due to the computational overhead and lack of robustness. To that end, we adopt a pose re-parameterization process to encode an estimated pose into spatial-aware representations directly. Fig. 1 illustrates the architecture of our method. SSRN contains a feature extraction

Dataset and evaluation metric

We conduct experiments on four publicly available datasets: NYU dataset [1], ICVL dataset [23], MSRA dataset [24] and HANDS 2017 dataset [2].

NYU dataset [1] consists of 72 K training and 8.2 K testing depth images captured by the PrimeSense 3D sensor. The annotation of hand pose contains 36 joints. Following previous works [6], [11], we selected 14 joints from all annotated joints during training and testing.

ICVL dataset [23] consists of 330 K training and 1.6 K testing depth images captured by

Conclusion

In this paper, we propose a Spatial-aware Stacked Regression Network (SSRN) for fast, robust and accurate 3D hand pose estimation from a single depth image. We utilize a differentiable pose re-parameterization module to efficiently generate a high-quality spatial-aware representation from the previously estimated pose. Taking such representations as input, a pose regression module allows the SSRN to utilize the 3D spatial structure of depth data and multi-joint spatial context to reevaluate the

CRediT authorship contribution statement

Pengfei Ren: Conceptualization, Methodology, Software, Writing - original draft, Investigation, Validation, Data curation, Formal analysis, Writing - review & editing. Haifeng Sun: Resources, Supervision, Project administration, Funding acquisition. Weiting Huang: Investigation, Validation, Writing - review & editing. Jiachang Hao: Resources, Formal analysis, Writing - review & editing. Daixuan Cheng: Writing - review & editing. Qi Qi: Resources, Supervision, Project administration, Funding

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work was supported in part by the National Natural Science Foundation of China under Grants 61671079, 61771068, and in part by the Beijing Municipal Natural Science Foundation under Grant 4182041. This work was also supported by BUPT Excellent Ph.D. Students Foundation CX2020121.

Pengfei Ren is obtained his B.S. degree from Beijing University of Posts and Telecommunications in 2018, where he is currently working toward his M.S. degree. His research interest includes deep learning, hand pose estimation and gesture recognition.

References (80)

  • R. Li et al.

    A survey on 3D hand pose estimation: Cameras, methods, and datasets

    Pattern Recogn.

    (2019)
  • J. Tompson, M. Stein, Y. Lecun, K. Perlin, Real-time continuous pose recovery of human hands using convolutional...
  • S. Yuan et al.

    Depth-based 3d hand pose estimation: From current achievements to future goals

  • M. Oberweger, P. Wohlhart, V. Lepetit, Hands deep in deep learning for hand pose estimation, in: Proceedings of the...
  • H. Guo et al.

    Region ensemble network: Improving convolutional network for hand pose estimation

  • X. Chen, G. Wang, H. Guo, C. Zhang, Pose guided structured region ensemble network for cascaded hand pose estimation,...
  • M. Oberweger, V. Lepetit, Deepprior++: Improving fast and accurate 3d hand pose estimation, in: Proceedings of the IEEE...
  • M. Madadi, S. Escalera, X. Baró, J. Gonzalez, End-to-end global to local cnn learning for hand pose recovery in depth...
  • M. Oberweger, P. Wohlhart, V. Lepetit, Training a Feedback Loop for Hand Pose Estimation, in: Proceedings of the IEEE...
  • L. Ge, H. Liang, J. Yuan, D. Thalmann, 3D Convolutional Neural Networks for Efficient and Robust Hand Pose Estimation...
  • G. Moon et al.

    V2V-PoseNet: Voxel-to-Voxel Prediction Network for Accurate 3D Hand and Human Pose Estimation From a Single Depth Map

  • L. Ge, Y. Cai, J. Weng, J. Yuan, Hand pointnet: 3d hand pose estimation using point sets, in: Proceedings of the IEEE...
  • L. Ge, Z. Ren, J. Yuan, Point-to-point regression pointnet for 3d hand pose estimation, in: Proceedings of the European...
  • S. Li et al.

    Point-to-pose voting based hand pose estimation using residual permutation equivariant layer

  • Y. Chen et al.

    So-handnet: Self-organizing network for 3d hand pose estimation with semi-supervised learning

  • X. Chen et al.

    Shpr-net: Deep semantic hand pose regression from point clouds

    IEEE Access

    (2018)
  • C. Wan et al.

    Dense 3d regression for hand pose estimation

  • F. Xiong et al.

    A2j: Anchor-to-joint regression network for 3d articulated pose estimation from a single depth image

  • W. Huang et al.

    AWR: Adaptive Weighting Regression for 3D Hand Pose Estimation

  • X. Wu et al.

    Handmap: Robust hand pose estimation via intermediate dense guidance map supervision

  • B. Xiao et al.

    Simple baselines for human pose estimation and tracking

  • P. Ren, H. Sun, Q. Qi, J. Wang, W. Huang, SRN: Stacked Regression Network for Real-time 3D Hand Pose Estimation., in:...
  • D. Tang et al.

    Latent regression forest: Structured estimation of 3d articulated hand posture

  • X. Sun et al.

    Cascaded hand pose regression

  • I. Oikonomidis, N. Kyriazis, A. A. Argyros, Efficient model-based 3D tracking of hand articulations using Kinect., in:...
  • C. Qian et al.

    Realtime and robust hand tracking from depth

  • S. Khamis et al.

    Learning an efficient model of hand shape variation from depth images

  • S. Sridhar et al.

    Fast and robust hand tracking using detection-guided optimization

  • A. Tagliasacchi et al.

    Robust articulated-ICP for real-time hand tracking

    Computer Graphics Forum

    (2015)
  • A. Tkach, A. Tagliasacchi, E. Remelli, M. Pauly, A. Fitzgibbon, Online generative model personalization for hand...
  • L. Ballan et al.

    Motion capture of hands in action using discriminative salient points

  • J. Taylor, L. Bordeaux, T. Cashman, B. Corish, C. Keskin, T. Sharp, E. Soto, D. Sweeney, J. Valentin, B. Luff, et al.,...
  • M. Ye et al.

    Real-time simultaneous pose and shape estimation for articulated objects using a single depth camera

  • J. Romero, H. Kjellström, D. Kragic, Monocular real-time 3D articulated hand pose estimation, in: IEEE-RAS...
  • G. Shakhnarovich et al.

    Fast pose estimation with parameter-sensitive hashing

  • D. Tang et al.

    Latent regression forest: structured estimation of 3d hand poses

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2016)
  • T. Sharp et al.

    Accurate, robust, and flexible real-time hand tracking

  • S. Sridhar et al.

    Real-time joint tracking of a hand manipulating an object from rgb-d input

  • C. R. Qi, L. Yi, H. Su, L. J. Guibas, Pointnet++: Deep hierarchical feature learning on point sets in a metric space,...
  • J. Wang et al.

    Generative Model-Based Loss to the Rescue: A Method to Overcome Annotation Errors for Depth-Based Hand Pose Estimation

  • Cited by (20)

    • A hand motion capture method based on infrared thermography for measuring fine motor skills in biomedicine

      2023, Artificial Intelligence in Medicine
      Citation Excerpt :

      Hand pose estimation (HPE) aims to localizing complex hand anatomical keypoints from images or video sequences as a fundamental motion capture technology in computer vision and has been used in many fields, such as human–computer interaction, virtual reality, movies and animation, sports motion analysis [1–5].

    • Dual-channel cascade pose estimation network trained on infrared thermal image and groundtruth annotation for real-time gait measurement

      2022, Medical Image Analysis
      Citation Excerpt :

      This challenge remains even with deep learning approaches. Likewise, depth based pose estimation and analysis via deep neural networks and the subsequent production of large human pose estimation datasets has also attracted considerable attention over the last decade (He et al., 2015; He et al., 2016; Vasileiadis et al., 2019; Zhang et al., 2020; Ren et al., 2021). Various efforts have been made in the domain of gait analysis using depth-based human pose estimation (Eltoukhy et al., 2017b; Latorre et al., 2018; Summa et al., 2020; Hazra et al., 2021).

    • 3D interacting hand pose and shape estimation from a single RGB image

      2022, Neurocomputing
      Citation Excerpt :

      With the rapid development of artificial intelligence, hand reconstruction techniques have exhibited widespread prospects for application and great commercial values in human-computer interaction services, e.g., virtual control, AR/VR assistance, to name a few. Many hand reconstruction methods estimate 3D hand pose from depth image [1–6], which usually need the extra support of RGB-D sensor [7] and result in the extra cost of deployment consequently. In practice, estimating hand pose and shape directly from a single RGB image is more applicable and attracts much attention in recent research [8–19].

    • Two Heads Are Better than One: Image-Point Cloud Network for Depth-Based 3D Hand Pose Estimation

      2023, Proceedings of the 37th AAAI Conference on Artificial Intelligence, AAAI 2023
    View all citing articles on Scopus

    Pengfei Ren is obtained his B.S. degree from Beijing University of Posts and Telecommunications in 2018, where he is currently working toward his M.S. degree. His research interest includes deep learning, hand pose estimation and gesture recognition.

    Haifeng Sun is obtained his PhD degree from Beijing University of Posts and Telecommunications in 2017. Now he is an lecture in Beijing University of Posts and Telecommunications. His research interest includes data mining, information retrieval, and Next Generation Network.

    Weiting Huang is now a graduate student in Beijing University of Posts and Telecommunications. Her research interest is computer vision.

    Jiachang Hao is obtained his Bachelor’s degree from Beijing University of Posts and Telecommunications in 2018. Now he is postgraduate in Beijing University of Posts and Telecommunications. His research interest includes video understanding, action recognition and object detection

    Daixuan Cheng is a bachelor student under the supervision of Prof. Jingyu Wang in the Network Intelligence Research Center at Beijing University of Posts and Telecommunications. She worked on various projects with his supervisor Prof. Haifeng Sun. She is interested in natural language processing, computer vision, machine learning and data mining.

    Qi Qi received the Ph.D. degree from the Beijing University of Posts and Telecommunications, in 2010, where she is currently an Associate Professor with the State Key Laboratory of Networking and Switching Technology. She has published more than 30 papers in international journal, and received two National Natural Science Foundations of China. Her research interests include ubiquitous services, deep learning, transfer learning, deep reinforcement learning, edge computing, and the Internet of Things.

    Jingyu Wang was born in 1978, obtained his Ph.D degree from Beijing University of Posts and Telecommunications in 2008. Now he is the full professor of State Key laboratory of Networking and Switching Technology in Beijing University of Posts and Telecommunications, China. His research interests span broad aspects of future internet, intelligent networks, machine learning and date mining. He has published hundreds of research papers and several books, and has been granted dozens of patents for inventions.

    Jianxin Liao obtained his PhD degree at University of Electronics Science and Technology of China in 1996. He is currently the dean of Network Intelligence Research Center and the full professor of State Key laboratory of Networking and Switching Technology in Beijing University of Posts and Telecommunications. He has published hundreds of research papers and several books, and has been granted dozens of patents for inventions. He has won a number of prizes, which include the Premier’s Award of Distinguished Young Scientists from National Natural Science Foundation of China in 2005, and the Specially-invited Professor of the “Yangtse River Scholar Award Program” by the China Ministry of Education in 2009. His main creative contributions include mobile intelligent network, service network intelligent, networking architectures and protocols, and multimedia communication. These achievements were conferred the “National Prize for Progress in Science and Technology” twice in 2004 and 2009, respectively.

    View full text