Abstract
Three dimensional (3D) hand pose estimation is the task of estimating the 3D location of hand keypoints. In recent years, this task has received much research attention due to its diverse applications in human-computer interaction and virtual reality. To the best of our knowledge, there has been limited studies that model self-attention in 3D hand pose estimation despite its use in various computer vision tasks. Hence, we propose augmenting convolution with self-attention to capture long-range dependencies in a depth image. In addition, motivated by a recent work which uses anchor points set on a depth image, we extend anchor points to the depth dimension to regress 3D hand joint locations. Validation experiments using the proposed approaches are performed on various hand pose datasets, and we obtain performances that are comparable to other state-of-the-art methods. The results demonstrate the potential of these approaches in a hand-based recognition system.





Similar content being viewed by others
References
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv:1409.0473
Bello I, Zoph B, Vaswani A, Shlens J, Le QV (2019) Attention augmented convolutional networks. In: Proceedings of the IEEE international conference on computer vision. pp 3286–3295
Bouchacourt D, Mudigonda PK, Nowozin S (2016) Disco nets: Dissimilarity coefficients networks. In: Advances in neural information processing systems. pp 352–360
Cejnog LWX, Cesar RM, de Campos TE, Elui VMC (2019) Hand range of motion evaluation for rheumatoid arthritis patients. In: 2019 14th IEEE international conference on automatic face & gesture recognition (FG 2019). IEEE, pp 1–5
Chen X, Wang G, Guo H, Zhang C (2020) Pose guided structured region ensemble network for cascaded hand pose estimation. Neurocomputing 395:138–149
Chen X, Wang G, Zhang C, Kim Tae-Kyun, Ji X (2018) Shpr-net: Deep semantic hand pose regression from point clouds. IEEE Access 6:43425–43439
Deng X, Yang S, Zhang Y, Tan P, Chang L, Wang H (2017) Hand3d: Hand pose estimation using 3d neural network. arXiv:1704.02224
Fourure D, Emonet Rémi, Fromont E, Muselet D, Neverova N, Trémeau A., Wolf C (2017) Multi-task, multi-domain learning: application to semantic segmentation and pose regression. Neurocomputing 251:68–80
Garcia-Hernando G, Yuan S, Baek S, Kim T-K (2018) First-person hand action benchmark with rgb-d videos and 3d hand pose annotations. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 409–419
Ge L, Cai Y, Weng J, Yuan J (2018) Hand pointnet: 3d hand pose estimation using point sets. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 8417–8426
Ge L, Liang H, Yuan J, Thalmann D (2016) Robust 3d hand pose estimation in single depth images: from single-view cnn to multi-view cnns. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 3593–3601
Ge L, Ren Z, Yuan J (2018) Point-to-point regression pointnet for 3d hand pose estimation. In: Proceedings of the European conference on computer vision (ECCV). pp 475–491
Girdhar R, Carreira J, Doersch C, Zisserman A (2019) Video action transformer network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 244–253
Guo F, He Z, Zhang S, Zhao X, Tan J (2020) Attention-based pose sequence machine for 3d hand pose estimation. IEEE Access 8:18258–18269
Guo H, Wang G, Chen X, Zhang C (2017) Towards good practices for deep 3d hand pose estimation. arXiv:1707.07248
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision. pp 2961–2969
Huang L, Yuan Y, Guo J, Zhang C, Chen X, Wang J (2019) Interlaced sparse self-attention for semantic segmentation. arXiv:1907.12273
Imura S, Hosobe H (2018) A hand gesture-based method for biometric authentication. In: International conference on human-computer interaction. Springer, pp 554–566
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25:1097–1105
Kuo D u, Lin X, Yi S, Ma X (2019) Crossinfonet: Multi-task information sharing based hand pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 9896–9905
LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, Jackel LD (1989) Backpropagation applied to handwritten zip code recognition. Neural Comput 1(4):541–551
Li W-J, Hsieh C-Y, Lin L-F, Chu W-C (2017) Hand gesture recognition for post-stroke rehabilitation using leap motion. In: 2017 international conference on applied system innovation (ICASI). IEEE, pp 386–388
Luong M-T, Pham H, Manning CD (2015) Effective approaches to attention-based neural machine translation. arXiv:1508.04025
Madadi M, Escalera S, Baró X, Gonzalez J (2017) End-to-end global to local cnn learning for hand pose recovery in depth data. arXiv:1705.09606
Madadi M, Escalera S, Carruesco A, Andujar C, Baró X, Gonzalez J (2017) Occlusion aware hand pose recovery from sequences of depth images. In: 2017 12th IEEE international conference on automatic face & gesture recognition (FG 2017). IEEE, pp 230–237
Moon G, Ju YC, Lee KM (2018) V2v-posenet: Voxel-to-voxel prediction network for accurate 3d hand and human pose estimation from a single depth map. In: Proceedings of the IEEE conference on computer vision and pattern Recognition. pp 5079–5088
Oberweger M, Lepetit V (2017) Deepprior+ +: Improving fast and accurate 3d hand pose estimation. In: Proceedings of the IEEE international conference on computer vision workshops. pp 585–594
Oberweger M, Wohlhart P, Lepetit V (2015) Hands deep in deep learning for hand pose estimation. arXiv:1502.06807
Parmar N, Vaswani A, Uszkoreit J, Kaiser Łukasz, Shazeer N, Alexander K u, Tran D (2018) Image transformer. arXiv:1802.05751
Poier G, Opitz M, Schinagl D, Bischof H (2019) Murauer: Mapping unlabeled real data for label austerity. In: 2019 IEEE winter conference on applications of computer vision (WACV). IEEE, pp 1393–1402
Ramachandran P, Parmar N, Vaswani A, Bello I, Levskaya A, Shlens J (2019) Stand-alone self-attention in vision models. arXiv:1906.05909
Ren P, Sun H, Qi Q i, Wang J, Huang W (2019) Srn: Stacked regression network for real-time 3d hand pose estimation. In: BMVC, page 112
Showers A, Si M (2018) Pointing estimation for human-robot interaction using hand pose, verbal cues, and confidence heuristics. In: International conference on social computing and social media. Springer, pp 403–412
Sun X, Wei Y, Liang S, Tang X, Sun J (2015) Cascaded hand pose regression. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 824–832
Tang D, Chang HJ, Tejani A, Kim T-K (2014) Latent regression forest: Structured estimation of 3d articulated hand posture. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 3786–3793
Tian Y, Zhang Y, Di Z, Cheng G, Chen W-G, Wang R (2020) Triple attention network for video segmentation. Neurocomputing 417:202–211
Tompson J, Stein M, Lecun Y, Perlin K (2014) Real-time continuous pose recovery of human hands using convolutional networks. ACM Trans Graph (ToG) 33(5):1–10
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems. pp 5998–6008
Wan C, Probst T, Gool LV, Yao A (2017) Crossing nets: Combining gans and vaes with a shared latent space for hand pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 680–689
Wan C, Probst T, Gool LV, Yao A (2018) Dense 3d regression for hand pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 5147–5156
Wang X, Jiang J, Guo Y, Kang L, Wei Y, Li D (2020) Cfam: Estimating 3d hand poses from a single rgb image with attention. Appl Sci 10(2):618
Xiong F, Zhang B, Xiao Y, Cao Z, Yu T, Zhou JT, Yuan J (2019) A2j: Anchor-to-joint regression network for 3d articulated pose estimation from a single depth image. In: Proceedings of the IEEE international conference on computer vision. pp 793–802
Xu C, Govindarajan LN, Yu Z, Li C (2017) Lie-x: Depth image based articulated object pose estimation, tracking, and action recognition on lie groups. Int J Comput Vis 123(3):454–478
Yuan S, Garcia-Hernando G, Stenger B, Moon G, Ju YC, Kyoung ML, Molchanov P, Kautz J, Honari S, Ge L et al (2018) Depth-based 3d hand pose estimation: From current achievements to future goals. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp 2636–2645
Yuan S, Qi Y, Garcia-Hernando G, Kim T-K (2017) The 2017 hands in the million challenge on 3d hand pose estimation. arXiv:1707.02237
Yuan S, Ye Q, Stenger B, Jain S, Kim T-K (2017) Bighand2. 2m benchmark: Hand pose dataset and state of the art analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 4866–4874
Zhang H, Goodfellow I, Metaxas D, Odena A (2019) Self-attention generative adversarial networks. In: International conference on machine learning. PMLR, pp 7354–7363
Zhang Y, Meruvia-Pastor O (2017) Operating virtual panels with hand gestures in immersive vr games. In: International conference on augmented reality, virtual reality and computer graphics. Springer, pp 299–308
Zhou X, Wan Q, Zhang W, Xue X, Wei Y (2016) Model-based deep hand pose estimation. arXiv:1606.06854
Acknowledgements
This study was funded by Tote Board Enabling Lives Initiative Grant (Grant Number: GC62018NUSISS) and supported by SG Enable.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
The authors declare that they have no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ng, MY., Chng, CB., Koh, WK. et al. An enhanced self-attention and A2J approach for 3D hand pose estimation. Multimed Tools Appl 81, 41661–41676 (2022). https://doi.org/10.1007/s11042-021-11020-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-021-11020-w