Abstract
Based on the disentanglement representation learning theory and the cross-modal variational autoencoder (VAE) model, we derive a “Single Input Multiple Output” (SIMO) disentangled model \({\text{cmSIMO} - \beta \,\text{VAE}}\). With the guidance of this derived model, we design a new VAE network, named da-VAE, for the challenging task of 3D hand pose estimation from a single RGB image. The designed da-VAE network has a multi-head encoder with the attention modules. Cooperating with the specific supervisions, the latent space is decomposed into subspaces with explicit semantics, which are relevant to the generative factors of hand pose, shape, appearance and others. The performance of the proposed da-VAE network is evaluated on RHD and STB dataset. The experimental results show competitive accuracies with the state-of-the-art methods.
Similar content being viewed by others
References
Zimmermann C, Brox T (2017) Learning to estimate 3d hand pose from single rgb images. In Proceedings of the IEEE International Conference on Computer Vision(ICCV), pp. 4903–4911
Iqbal U, Molchanov P, Gall TBJ, Kautz J (2018) Hand pose estimation via latent 2.5d heatmap regression. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 118–134
Cai Y, Ge L, Cai J, Yuan J (2018) Weakly-supervised 3d hand pose estimation from monocular rgb images. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 666–682
Boukhayma A, Bem R de, Torr PHS (2019) 3d hand shape and pose from images in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10843–10852
Ge L, Ren Z, Li Y, Xue Z, Wang Y, Cai J, Yuan J (2019) 3d hand shape and pose estimation from a single rgb image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10833–10842
Zhang X, Li Q, Mo H, Zhang W, Zheng W (2019) End-to-end hand mesh recovery from a monocular rgb image. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2354–2364
Baek S, Kim KI, Kim TK (2019) Pushing the envelope for rgb-based dense 3d hand pose estimation via neural rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1067–1076
Cai Y, Ge L, Liu J, Cai J, Cham T-J, Yuan J, Thalmann NM (2019) Exploiting spatial-temporal relationships for 3d pose estimation via graph convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2272–2281
Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828
Ridgeway K (2016) A survey of inductive biases for factorial representation-learning. arXiv preprintarXiv:1612.05299, 2016
Kingma DP, Welling M (2014) Auto-encoding variational bayes. In International Conference on Learning Representation (ICLR)
Kulkarni TD, Whitney W, Kohli P, Tenenbaum JB (2015) Deep convolutional inverse graphics network. Advances in Neural Information Processing Systems (NIPS), pp 2539–2547
Karaletsos T, Belongie S, Rtsch G (2016) Bayesian representation learning with oracle constraints. In International Conference on Learning Representations (ICLR)
Kim M, Wang Y, Sahu P, Pavlovic V (2019) Bayes-factor-vae: Hierarchical bayesian deep auto-encoder models for factor disentanglement. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2979–2987
Chen RTQ, Li X, Grosse R, Duvenaud D (2018) Isolating sources of disentanglement in variational autoencoders. arXiv preprintarXiv:1802.04942
Yang L, Yao A (2019) Disentangling latent hands for image synthesis and pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9877–9886
Locatello F, Bauer S, Lucic M, Raetsch G, Gelly S, Schölkopf B, Bachem O (2019) Challenging common assumptions in the unsupervised learning of disentangled representations. In International Conference on Machine Learning (ICML), pp. 4114–4124
Vahdat A, Kautz J (2020) Nvae: a deep hierarchical variational autoencoder. arXiv preprintarXiv:2007.03898
Zhang J, Jiao J, Chen M, Qu L, Xu X, Yang Q (2016) 3d hand pose tracking and estimation using stereo matching. arXiv preprintarXiv:1610.07214
Higgins I, Matthey L, Pal A, Burgess C, Glorot X, Botvinick M, Mohamed S, Lerchner A (2017) \(\beta\)-vae: learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations (ICLR)
Burgess CP, Higgins I, Pal A, Matthey L, Watters N, Desjardins G, Lerchner A (2018) Understanding disentangling in \(\beta\)-vae. arXiv preprintarXiv:1804.03599
Kim H, Mnih A (2018) Disentangling by factorising. In International Conference on Machine Learning, pp. 2649–2658
Kumar A, Sattigeri P, Balakrishnan A (2017) Variational inference of disentangled latent concepts from unlabeled observations. In International Conference on Learning Representations (ICLR)
Dupont E (2018) Learning disentangled joint continuous and discrete representations. Adv Neural Inf Process Syst (NIPS), pp. 710–720
Lee W, Kim D, Hong S, Lee H (2020) High-fidelity synthesis with disentangled representation. In European Conference on Computer Vision (ECCV), pp. 157–174
Siddharth N, Paige B, van de Meent J-W, Desmaison A, Goodman N, Kohli P, Wood F, Torr P (2017) Learning disentangled representations with semi-supervised deep generative models. Adv Neural Inf Process Syst (NIPS) 30:5925–5935
Ruiz A, Martinez O, Binefa X, Verbeek J (2019) Learning disentangled representations with reference-based variational autoencoders. arXiv preprintarXiv:1901.08534
Chen J, Batmanghelich K (2020) Weakly supervised disentanglement by pairwise similarities. Proce AAAI Conf Artif Intell 34:3495–3502
Locatello F, Tschannen M, Bauer S, Rätsch G, Schölkopf B, Bachem O (2019) Disentangling factors of variation using few labels. arXiv preprintarXiv:1905.01258
Wan C, Probst T, Van Gool L, Yao A (2017) Crossing nets: combining gans and vaes with a shared latent space for hand pose estimation. In Proc IEEE Conf Computer Vision Pattern Recogn (CVPR), pp. 680–689
Gao Y, Wang Y, Falco P, Navab N, Tombari F (2019) Variational object-aware 3-d hand pose from a single rgb image. IEEE Robot Autom Letts 4(4):4239–4246
Spurr A, Song J, Park S, Hilliges O (2018) Cross-modal deep variational hand pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 89–98
Yang L, Li S, Lee D, Yao A (2019) Aligning latent spaces for 3d hand pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2335–2343
Kulon D, Guler RA, Kokkinos I, Bronstein MM, Zafeiriou S (2020) Weakly-supervised mesh-convolutional hand reconstruction in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4990–5000
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016
Li X, Wang W, Hu X, Yang J (2019) Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 510–519
Yang Y, Feng C, Shen Y, Tian D (2018) Foldingnet: Point cloud auto-encoder via deep grid deformation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 206–215
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7132–7141
Li S, Lee D (2019) Point-to-pose voting based hand pose estimation using residual permutation equivariant layer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11927–11936
Romero J, Tzionas D, Black MJ (2017) Embodied hands: modeling and capturing hands and bodies together. ACM Trans Graph (ToG) 36(6):1–17
Yang L, Li J, Xu W, Diao Y, Lu C (2020) Bihand: Recovering hand mesh with multi-stage bisected hourglass networks. arXiv preprintarXiv:2008.05079
Zhou Y, Habermann M, Xu W, Habibie I, Theobalt C, Xu F (2020) Monocular real-time hand shape and motion capture using multi-modal data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5346–5355
Zhao L, Peng X, Chen Y, Kapadia M, Metaxas DN (2020) Knowledge as priors: cross-modal knowledge generalization for datasets without superior knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6528–6537
Mueller F, Bernard F, Sotnychenko O, Mehta D, Sridhar S, Casas D, Theobalt C (2018) Ganerated hands for real-time 3d hand tracking from monocular rgb. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 49–59
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work was supported by the National Natural Science Foundation of China [grant numbers 61873046, U1708263].
Rights and permissions
About this article
Cite this article
Guo, X., Xu, S., Lin, X. et al. 3D hand pose estimation from a single RGB image through semantic decomposition of VAE latent space. Pattern Anal Applic 25, 157–167 (2022). https://doi.org/10.1007/s10044-021-01048-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-021-01048-x