3D hand pose estimation from a single RGB image through semantic decomposition of VAE latent space

Guo, Xinru; Xu, Song; Lin, Xiangbo; Sun, Yi; Ma, Xiaohong

doi:10.1007/s10044-021-01048-x

3D hand pose estimation from a single RGB image through semantic decomposition of VAE latent space

Theoretical Advances
Published: 24 January 2022

Volume 25, pages 157–167, (2022)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

Xinru Guo¹^na1,
Song Xu¹^na1,
Xiangbo Lin ORCID: orcid.org/0000-0001-7232-9479¹,
Yi Sun¹ &
…
Xiaohong Ma¹

730 Accesses
3 Citations
1 Altmetric
Explore all metrics

Abstract

Based on the disentanglement representation learning theory and the cross-modal variational autoencoder (VAE) model, we derive a “Single Input Multiple Output” (SIMO) disentangled model \({\text{cmSIMO} - \beta \,\text{VAE}}\). With the guidance of this derived model, we design a new VAE network, named da-VAE, for the challenging task of 3D hand pose estimation from a single RGB image. The designed da-VAE network has a multi-head encoder with the attention modules. Cooperating with the specific supervisions, the latent space is decomposed into subspaces with explicit semantics, which are relevant to the generative factors of hand pose, shape, appearance and others. The performance of the proposed da-VAE network is evaluated on RHD and STB dataset. The experimental results show competitive accuracies with the state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unsupervised 3D Human Pose Representation with Viewpoint and Pose Disentanglement

Self-supervised method for 3D human pose estimation with consistent shape and viewpoint factorization

Article 03 June 2022

Adaptive Joint Interdependency Learning for 2D Occluded Hand Pose Estimation

References

Zimmermann C, Brox T (2017) Learning to estimate 3d hand pose from single rgb images. In Proceedings of the IEEE International Conference on Computer Vision(ICCV), pp. 4903–4911
Iqbal U, Molchanov P, Gall TBJ, Kautz J (2018) Hand pose estimation via latent 2.5d heatmap regression. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 118–134
Cai Y, Ge L, Cai J, Yuan J (2018) Weakly-supervised 3d hand pose estimation from monocular rgb images. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 666–682
Boukhayma A, Bem R de, Torr PHS (2019) 3d hand shape and pose from images in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10843–10852
Ge L, Ren Z, Li Y, Xue Z, Wang Y, Cai J, Yuan J (2019) 3d hand shape and pose estimation from a single rgb image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10833–10842
Zhang X, Li Q, Mo H, Zhang W, Zheng W (2019) End-to-end hand mesh recovery from a monocular rgb image. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2354–2364
Baek S, Kim KI, Kim TK (2019) Pushing the envelope for rgb-based dense 3d hand pose estimation via neural rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1067–1076
Cai Y, Ge L, Liu J, Cai J, Cham T-J, Yuan J, Thalmann NM (2019) Exploiting spatial-temporal relationships for 3d pose estimation via graph convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2272–2281
Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828
Article Google Scholar
Ridgeway K (2016) A survey of inductive biases for factorial representation-learning. arXiv preprintarXiv:1612.05299, 2016
Kingma DP, Welling M (2014) Auto-encoding variational bayes. In International Conference on Learning Representation (ICLR)
Kulkarni TD, Whitney W, Kohli P, Tenenbaum JB (2015) Deep convolutional inverse graphics network. Advances in Neural Information Processing Systems (NIPS), pp 2539–2547
Karaletsos T, Belongie S, Rtsch G (2016) Bayesian representation learning with oracle constraints. In International Conference on Learning Representations (ICLR)
Kim M, Wang Y, Sahu P, Pavlovic V (2019) Bayes-factor-vae: Hierarchical bayesian deep auto-encoder models for factor disentanglement. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2979–2987
Chen RTQ, Li X, Grosse R, Duvenaud D (2018) Isolating sources of disentanglement in variational autoencoders. arXiv preprintarXiv:1802.04942
Yang L, Yao A (2019) Disentangling latent hands for image synthesis and pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9877–9886
Locatello F, Bauer S, Lucic M, Raetsch G, Gelly S, Schölkopf B, Bachem O (2019) Challenging common assumptions in the unsupervised learning of disentangled representations. In International Conference on Machine Learning (ICML), pp. 4114–4124
Vahdat A, Kautz J (2020) Nvae: a deep hierarchical variational autoencoder. arXiv preprintarXiv:2007.03898
Zhang J, Jiao J, Chen M, Qu L, Xu X, Yang Q (2016) 3d hand pose tracking and estimation using stereo matching. arXiv preprintarXiv:1610.07214
Higgins I, Matthey L, Pal A, Burgess C, Glorot X, Botvinick M, Mohamed S, Lerchner A (2017) \(\beta\)-vae: learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations (ICLR)
Burgess CP, Higgins I, Pal A, Matthey L, Watters N, Desjardins G, Lerchner A (2018) Understanding disentangling in \(\beta\)-vae. arXiv preprintarXiv:1804.03599
Kim H, Mnih A (2018) Disentangling by factorising. In International Conference on Machine Learning, pp. 2649–2658
Kumar A, Sattigeri P, Balakrishnan A (2017) Variational inference of disentangled latent concepts from unlabeled observations. In International Conference on Learning Representations (ICLR)
Dupont E (2018) Learning disentangled joint continuous and discrete representations. Adv Neural Inf Process Syst (NIPS), pp. 710–720
Lee W, Kim D, Hong S, Lee H (2020) High-fidelity synthesis with disentangled representation. In European Conference on Computer Vision (ECCV), pp. 157–174
Siddharth N, Paige B, van de Meent J-W, Desmaison A, Goodman N, Kohli P, Wood F, Torr P (2017) Learning disentangled representations with semi-supervised deep generative models. Adv Neural Inf Process Syst (NIPS) 30:5925–5935
Google Scholar
Ruiz A, Martinez O, Binefa X, Verbeek J (2019) Learning disentangled representations with reference-based variational autoencoders. arXiv preprintarXiv:1901.08534
Chen J, Batmanghelich K (2020) Weakly supervised disentanglement by pairwise similarities. Proce AAAI Conf Artif Intell 34:3495–3502
Google Scholar
Locatello F, Tschannen M, Bauer S, Rätsch G, Schölkopf B, Bachem O (2019) Disentangling factors of variation using few labels. arXiv preprintarXiv:1905.01258
Wan C, Probst T, Van Gool L, Yao A (2017) Crossing nets: combining gans and vaes with a shared latent space for hand pose estimation. In Proc IEEE Conf Computer Vision Pattern Recogn (CVPR), pp. 680–689
Gao Y, Wang Y, Falco P, Navab N, Tombari F (2019) Variational object-aware 3-d hand pose from a single rgb image. IEEE Robot Autom Letts 4(4):4239–4246
Article Google Scholar
Spurr A, Song J, Park S, Hilliges O (2018) Cross-modal deep variational hand pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 89–98
Yang L, Li S, Lee D, Yao A (2019) Aligning latent spaces for 3d hand pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2335–2343
Kulon D, Guler RA, Kokkinos I, Bronstein MM, Zafeiriou S (2020) Weakly-supervised mesh-convolutional hand reconstruction in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4990–5000
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016
Li X, Wang W, Hu X, Yang J (2019) Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 510–519
Yang Y, Feng C, Shen Y, Tian D (2018) Foldingnet: Point cloud auto-encoder via deep grid deformation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 206–215
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7132–7141
Li S, Lee D (2019) Point-to-pose voting based hand pose estimation using residual permutation equivariant layer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11927–11936
Romero J, Tzionas D, Black MJ (2017) Embodied hands: modeling and capturing hands and bodies together. ACM Trans Graph (ToG) 36(6):1–17
Article Google Scholar
Yang L, Li J, Xu W, Diao Y, Lu C (2020) Bihand: Recovering hand mesh with multi-stage bisected hourglass networks. arXiv preprintarXiv:2008.05079
Zhou Y, Habermann M, Xu W, Habibie I, Theobalt C, Xu F (2020) Monocular real-time hand shape and motion capture using multi-modal data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5346–5355
Zhao L, Peng X, Chen Y, Kapadia M, Metaxas DN (2020) Knowledge as priors: cross-modal knowledge generalization for datasets without superior knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6528–6537
Mueller F, Bernard F, Sotnychenko O, Mehta D, Sridhar S, Casas D, Theobalt C (2018) Ganerated hands for real-time 3d hand tracking from monocular rgb. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 49–59

Download references

Author information

Xinru Guo and Song Xu contributed equally to this study.

Authors and Affiliations

School of Information and Communication Engineering, Dalian University of Technology, Dalian, China
Xinru Guo, Song Xu, Xiangbo Lin, Yi Sun & Xiaohong Ma

Authors

Xinru Guo
View author publications
You can also search for this author in PubMed Google Scholar
Song Xu
View author publications
You can also search for this author in PubMed Google Scholar
Xiangbo Lin
View author publications
You can also search for this author in PubMed Google Scholar
Yi Sun
View author publications
You can also search for this author in PubMed Google Scholar
Xiaohong Ma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiangbo Lin.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported by the National Natural Science Foundation of China [grant numbers 61873046, U1708263].

Rights and permissions

Reprints and permissions

About this article

Cite this article

Guo, X., Xu, S., Lin, X. et al. 3D hand pose estimation from a single RGB image through semantic decomposition of VAE latent space. Pattern Anal Applic 25, 157–167 (2022). https://doi.org/10.1007/s10044-021-01048-x

Download citation

Received: 25 May 2021
Accepted: 13 December 2021
Published: 24 January 2022
Issue Date: February 2022
DOI: https://doi.org/10.1007/s10044-021-01048-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

3D hand pose estimation from a single RGB image through semantic decomposition of VAE latent space

Abstract

Access this article

Similar content being viewed by others

Unsupervised 3D Human Pose Representation with Viewpoint and Pose Disentanglement

Self-supervised method for 3D human pose estimation with consistent shape and viewpoint factorization

Adaptive Joint Interdependency Learning for 2D Occluded Hand Pose Estimation

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

3D hand pose estimation from a single RGB image through semantic decomposition of VAE latent space

Abstract

Access this article

Similar content being viewed by others

Unsupervised 3D Human Pose Representation with Viewpoint and Pose Disentanglement

Self-supervised method for 3D human pose estimation with consistent shape and viewpoint factorization

Adaptive Joint Interdependency Learning for 2D Occluded Hand Pose Estimation

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation