Abstract
For monocular RGB based 3D hand pose estimation task, z coordinates are more difficult to estimate compared to the 2D hand joint coordinates due to the intrinsic depth ambiguity, thus some works firstly estimate the 2D hand joint coordinates and then apply a 2D to 3D lifting module to estimate the z coordinates. In this paper, we propose a new 2D to 3D lifting module. Differ from existing methods which estimate z coordinates of all hand joints simultaneously, we propose to estimate the z coordinate of each hand joint individually with its 2D joint features and the global image features as input. It can divide the complex task into simple sub-tasks, which makes it easier to lift the 2D coordinates to 3D. Besides, our 2D to 3D lifting module use only convolutional operation with shared convolutional kernel, which has fewer network parameters compared with existing methods usually with fully connected layers. Furthermore, we introduce a new inter joint attention module in our model to learn the correlation between every two hand joints. We conduct experiments on two popular hand pose datasets. From the experimental results we can see, our model gets state-of-the-art performance compared with existing methods. Ablation study also verifies the validity of each components proposed in our model.









Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.Data Availability
Data openly available in a public repository.
References
Cai Y, Ge L, Cai J, Yuan J (2018) Weakly-supervised 3d hand pose estimation from monocular rgb images. In: Proceedings of the European conference on computer vision (ECCV), pp 666–682
Chatzis T, Stergioulas A, Konstantinidis D, Dimitropoulos K, Daras P (2020) A comprehensive study on deep learning-based 3d hand pose estimation methods. Appl Sci 10(19):6850
Chen Z, Du K, Sun Y, Lin X, Ma X (2020) Hierarchical neural network for hand pose estimation. Signal Process Image Commun 115909:87
Choi H, Moon G, Lee KM (2020) Pose2mesh: Graph convolutional network for 3d human pose and mesh recovery from a 2d human pose. In: European conference on computer vision. Springer, pp 769–787
Ge L, Ren Z, Li Y, Xue Z, Wang Y, Cai J, Yuan J (2019) 3d hand shape and pose estimation from a single rgb image. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10833–10842
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hu Z, Hu Y, Wu B, Liu J, Han D, Kurfess T (2018) Hand pose estimation with multi-scale network. Appl Intell 48(8):2501–2515
Iqbal U, Molchanov P, Breuel Juergen Gall T, Kautz J (2018) Hand pose estimation via latent 2.5 d heatmap regression. In: Proceedings of the European conference on computer vision (ECCV), pp 118–134
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90
Kulon D, Guler RA, Kokkinos I, Bronstein MM, Zafeiriou S (2020) Weakly-supervised mesh-convolutional hand reconstruction in the wild. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4990–5000
Li M, Gao Y, Sang N (2021) Exploiting learnable joint groups for hand pose estimation. In: Proceedings of the AAAI conference on artificial intelligence, vol 35, pp 1921–1929
Li R, Liu Z, Tan J (2019) A survey on 3d hand pose estimation: Cameras, methods, and datasets. Pattern Recogn 93:251–272
Lin F, Wilhelm C, Martinez T (2021) Two-hand global 3d pose estimation using monocular rgb. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 2373–2381
Madadi M, Escalera S, Baró X, Gonzalez J (2017) End-to-end global to local cnn learning for hand pose recovery in depth data. arXiv:170509606
Malik J, Abdelaziz I, Elhayek A, Shimada S, Ali SA, Golyanik V, Theobalt C, Stricker D (2020) Handvoxnet: deep voxel-based network for 3d hand shape and pose estimation from a single depth map. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7113–7122
Moon G, Lee KM (2020) I2l-meshnet: Image-to-lixel prediction network for accurate 3d human pose and mesh estimation from a single rgb image. arXiv:200803713
Mueller F, Bernard F, Sotnychenko O, Mehta D, Sridhar S, Casas D, Theobalt C (2018) Ganerated hands for real-time 3d hand tracking from monocular rgb. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 49–59
Panteleris P, Oikonomidis I, Argyros A (2018) Using a single rgb frame for real time 3d hand pose estimation in the wild. In: 2018 IEEE winter conference on applications of computer vision (WACV). IEEE, pp 436–445
Qi CR, Su H, Mo K, Guibas LJ (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 652–660
Spurr A, Song J, Park S, Hilliges O (2018) Cross-modal deep variational hand pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 89–98
Sridhar S, Mueller F, Zollhöfer M, Casas D, Oulasvirta A, Theobalt C (2016) Real-time joint tracking of a hand manipulating an object from rgb-d input. In: European conference on computer vision. Springer, pp 294–310
Sun X, Wei Y, Liang S, Tang X, Sun J (2015) Cascaded hand pose regression. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 824–832
Sun X, Xiao B, Wei F, Liang S, Wei Y (2018) Integral human pose regression. In: Proceedings of the European conference on computer vision (ECCV), pp 529–545
Supančič JS, Rogez G, Yang Y, Shotton J, Ramanan D (2018) Depth-based hand pose estimation: methods, data, and challenges. Int J Comput Vis 126(11):1180–1198
Tang D, Chang HJ, Tejani A, Kim TK (2017) Latent regression forest: structured estimation of 3d hand poses. IEEE Trans Pattern Anal Mach Intell 39(7):1374–1387
Tang W, Wu Y (2019) Does learning specific features for related parts help human pose estimation?. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1107–1116
Tompson J, Stein M, Lecun Y, Perlin K (2014) Real-time continuous pose recovery of human hands using convolutional networks. ACM Transactions on Graphics (ToG) 33(5):169
Yang L, Yao A (2019) Disentangling latent hands for image synthesis and pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9877–9886
Yang L, Li S, Lee D, Yao A (2019) Aligning latent spaces for 3d hand pose estimation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2335– 2343
Zhang J, Jiao J, Chen M, Qu L, Xu X, Yang Q (2016) 3d hand pose tracking and estimation using stereo matching. arXiv:161007214
Zhou Y, Lu J, Du K, Lin X, Sun Y, Ma X (2018) Hbe: hand branch ensemble network for real-time 3d hand pose estimation. In: Proceedings of the European conference on computer vision (ECCV), pp 501–516
Zimmermann C, Brox T (2017) Learning to estimate 3d hand pose from single rgb images. In: Proceedings of the IEEE international conference on computer vision, pp 4903–4911
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
The authors declare that they have no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Chen, Z., Sun, Y. Joint-wise 2D to 3D lifting for hand pose estimation from a single RGB image. Appl Intell 53, 6421–6431 (2023). https://doi.org/10.1007/s10489-022-03764-1
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-022-03764-1