Abstract
In virtual reality (VR) applications, haptic gloves provide feedback and more direct control than bare hands do. Most VR gloves contain flex and inertial measurement sensors for tracking the finger joints of a single hand; however, they lack a mechanism for tracking two-hand interactions. In this paper, a vision-based method is proposed for improved two-handed glove tracking. The proposed method requires only one camera attached to a VR headset. A photorealistic glove data generation framework was established to synthesize large quantities of training data for identifying the left, right, or both gloves in images with complex backgrounds. We also incorporated the “glove pose hypothesis” in the training stage, in which spatial cues regarding relative joint positions were exploited for accurately predict glove positions under severe self-occlusion or motion blur. In our experiments, a system based on the proposed method achieved an accuracy of 94.06% on a validation set and achieved high-speed tracking at 65 fps on a consumer graphics processing unit.














Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The authors confirm that the data supporting the findings of this study are available within the article.
References
Barron C, Kakadiaris IA (2000) Estimating anthropometry and pose from a single image. Proc IEEE Conf Comput vis Pattern Recognit 1:669–676. https://doi.org/10.1109/CVPR.2000.855884
Buxton W, Myers B (1986) A study in two-handed input. In: Proceedings of the SIGCHI conference on human factors in computing systems, Boston, Massachusetts, USA., 321–326. https://doi.org/10.1145/22627.22390
Buxton W (1995) Chunking and phrasing and the design of human-computer dialogues. In: Baecker RM, Grudin J, Buxton WAS, Greenberg S. (Eds), Readings in human–computer interaction, 494–499. https://doi.org/10.1016/B978-0-08-051574-8.50051-0
Chen W, Yu C, Tu C, Lyu Z, Tang J, Ou S, Fu Y, Xue Z (2020) A Survey on hand pose estimation with wearable sensors and computer-vision-based methods. Sensors 20(4):1074. https://doi.org/10.3390/s20041074
Chen Y, Tu Z, Ge L, Zhang D, Chen R, Yuan J (2019) SO-HandNet: self-organizing network for 3D hand pose estimation with semi-supervised learning. In: Proceedings of the IEEE/CVF international conference on computer vision, 6961–6970
Chen Y, Tu Z, Kang D, Bao L, Zhang Y, Zhe X, Chen R, Yuan J (2021) Model-based 3D Hand Reconstruction via Self-Supervised Learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 10451–10460. https://doi.org/10.48550/arXiv.2103.11703
Cheng W, Park JH, Ko JH (2021) HandFoldingNet: A 3D hand pose estimation network using multiscale-feature guided folding of a 2D hand skeleton. In: Proceedings of the IEEE/CVF international conference on computer vision, 11260–11269. https://doi.org/10.48550/arXiv.2108.05545
Doosti B, Naha S, Mirbagheri M, Crandall DJ (2020) Hope-net: a graph-based model for hand-object pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6608–6617. https://doi.org/10.48550/arXiv.2004.00060
Erol A, Bebis G, Nicolescu M, Boyle RD, Twombly X (2007) Vision-based hand pose estimation: a review. Comput vis Image Underst 108(1–2):52–73. https://doi.org/10.1016/j.cviu.2006.10.012
Fang L, Liu X, Liu L, Xu H, Kang W (2020) JGR-P2O: Joint graph reasoning based pixel-to-offset prediction network for 3D hand pose estimation from a single depth image. In: European Conference Computer Vision, pp 120–137. https://doi.org/10.48550/arXiv.2007.04646
Garcia-Hernando G, Yuan S, Baek S, Kim TK (2018) First-person hand action benchmark with RGB-D videos and 3D hand pose annotations. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 409–419. https://doi.org/10.48550/arXiv.1704.0246
Hinckley K, Pausch R, Proffitt D, Kassell NF (1998a) Two-handed virtual manipulation. ACM Trans Comput Hum Interact 5(3):260–302. https://doi.org/10.1145/292834.292849
Hinckley K, Pausch R, Proffitt D, Kassell NF (1998b) Two-handed virtual manipulation. ACM Trans Comput Hum Interact (TOCHI) 5(3):260–302. https://doi.org/10.1145/292834.292849
Hinckley K, Pausch R, Proffitt D (1997) Attention and visual feedback: the bimanual frame of reference. In: Proceedings of the 1997 symposium on interactive 3D graphics, Providence, Rhode Island, USA. 121–ff. https://doi.org/10.1145/253284.253318
Huber PJ (1992) Robust estimation of a location parameter. In: Breakthroughs in statistics, pp 492–518. https://doi.org/10.1007/978-1-4612-4380-9_35
Insafutdinov E, Pishchulin L, Andres B, Andriluka M, Schiele B (2016) DeeperCut: a deeper, stronger, and faster multi-person pose estimation model. In: European conference on computer vision, pp 34–50. https://doi.org/10.48550/arXiv.1605.03170
Kotranza A, Quarles J, Lok B (2006) Mixed reality: are two hands better than one?. In: Proceedings of the ACM symposium on virtual reality software and technology, Limassol, Cyprus. pp 31–34. https://doi.org/10.1145/1180495.1180503
Lin F, Wilhelm C, Martinez T (2021) Two-hand global 3D pose estimation using monocular RGB. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 2373–2381. https://doi.org/10.48550/arXiv.2006.01320
Liu S Jiang H, Xu J, Liu S, Wang X (2021) Semi-supervised 3D hand-object poses estimation with interactions in time. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14687–14697. https://doi.org/10.48550/arXiv.2106.05266
Moon G, Chang JY, Lee KM (2018) V2V-PoseNet: voxel-to-voxel prediction network for accurate 3D hand and human pose estimation from a single depth map. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5079–5088. https://doi.org/10.48550/arXiv.1711.07399
Mueller F, Mehta D, Sotnychenko O, Sridhar S, Casas D, Theobalt C (2017) Real-time hand tracking under occlusion from an egocentric RGB-D sensor. In: Proceedings of the IEEE international conference on computer vision, pp 1154–1163. https://doi.org/10.48550/arXiv.1704.02201
Mueller F, Bernard F, Sotnychenko O, Mehta D, Sridhar S, Casas D, Theobalt C (2018) GANerated hands for real-time 3D hand tracking from monocular RGB. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 49–59. https://doi.org/10.48550/arXiv.1712.01057
Pishchulin L, Insafutdinov E, Tang S, Andres B, Andriluka M, Gehler PV, Schiele B (2016) DeepCut: joint subset partition and labeling for multi person pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4929–4937. https://doi.org/10.48550/arXiv.1511.06645
Rad M, Oberweger M, Lepetit V (2018) Feature mapping for learning fast and accurate 3D pose inference from synthetic images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4663–4672. https://doi.org/10.48550/arXiv.1712.03904
Ren P, Sun H, Hao J, Wang J, Qi Q, Liao J (2022) Mining multi-view information: a strong self-supervised framework for depth-based 3D hand pose and mesh estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 20555–20565. https://doi.org/10.1109/CVPR52688.2022.01990
Rhodin H, Richardt C, Casas D, Insafutdinov E, Shafiei M, Seidel H-P, Schiele B, Theobalt C (2016) EgoCap: egocentric marker-less motion capture with two fisheye cameras. ACM Trans Grap 35(6):1–11. https://doi.org/10.48550/arXiv.1609.07306
Rudnev V, Golyanik V, Wang J, Seidel HP, Mueller F, Elgharib M, Theobalt C (2021) Real-time neural 3D hand pose estimation from an event stream. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2385–12395. https://doi.org/10.48550/arXiv.2012.06475
Sapp B, Taskar B (2013) MODEC: multimodal decomposable models for human pose estimation. IEEE Conf Comput vis Pattern Recognit 2013:23–28. https://doi.org/10.1109/CVPR.2013.471
Spurr A, Dahiya A, Wang X, Zhang X, Hilliges O (2021) Self-supervised 3D hand pose estimation from monocular RGB via contrastive learning.In: Proceedings of the IEEE/CVF international conference on computer vision, pp 11230–11239. https://doi.org/10.48550/arXiv.2106.05953
Tompson J, Stein M, Lecun Y, Perlin K (2014) Real-time continuous pose recovery of human hands using convolutional networks. ACM Trans Grap 33(5):1–10. https://doi.org/10.1145/2629500
Vogiatzidakis P, Koutsabasis P (2022) ‘Address and command’: two-handed mid-air interactions with multiple home devices. Int J Hum Comput Stud 159:102755. https://doi.org/10.1016/j.ijhcs.2021.102755
Voigt-Antons J N, Kojic T, Ali D, Möller S (2020) Influence of hand tracking as a way of interaction in virtual reality on user experience. In: 2020 Twelfth international conference on quality of multimedia experience (QoMEX), Athlone, Ireland, pp 1–4. https://doi.org/10.1109/QoMEX48832.2020.9123085
Wei SE, Ramakrishna V, Kanade T, Sheikh Y (2016) Convolutional pose machines. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 4724–4732. https://doi.org/10.48550/arXiv.1602.00134
Xiong F, Zhang B, Xiao Y, Cao Z, Yu T, Zhou JT, Yuan J (2019) A2J: anchor-to-joint regression network for 3D articulated pose estimation from a single depth image. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 793–802. https://doi.org/10.48550/arXiv.1908.09999
Yang L, Li S, Lee D, Yao A (2019) Aligning latent spaces for 3D hand pose estimation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2335–2343. https://doi.org/10.1109/ICCV.2019.00242
Yang L, Li K, Zhan X, Lv J, Xu W, Li J, Lu C (2022) ArtiBoost: boosting articulated 3D hand-object pose estimation via online exploration and synthesis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2750–2760. https://doi.org/10.48550/arXiv.2109.05488
Zhao Z, Zhao X, Wang Y (2021) TravelNet: self-supervised physically plausible hand motion learning from monocular color images. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 11666–11676. https://doi.org/10.1109/ICCV48922.2021.01146
Acknowledgements
This study was supported by the Industrial Technology Research Institute, the National Science and Technology Council, Taiwan (Grant Numbers: NSTC 111-2222-E-A49-008 and NSTC 112-2221-E-A49-129).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no relevant financial or nonfinancial interests to disclose.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Supplementary file1 (MP4 83881 KB)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Hsu, FS., Wang, TM. & Chen, LH. Robust vision-based glove pose estimation for both hands in virtual reality. Virtual Reality 27, 3133–3148 (2023). https://doi.org/10.1007/s10055-023-00860-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10055-023-00860-6