Abstract
Recovering 3D human meshes from monocular images is an inherently ambiguous and challenging task due to depth ambiguity, joint occlusion and truncation. However, most recent works avoid modeling uncertainty, typically obtaining a single reconstruction for a given input. In contrast, this paper presents the ambiguity of reception reconstruction and considers the problem as an inverse problem for which multiple feasible solutions exist. Our method, MHPro, first constructs a probability distribution and obtains a set of feasible recovery results (i.e. multi-hypotheses), from monocular images. Intra-hypothesis refinement is then performed to achieve independent feature enhancement. Finally, the multi-hypothesis features are aggregated by inter-hypothesis communication to recover the final 3D human mesh. The effectiveness of our method is validated on two benchmark datasets, Human3.6M and 3DPW, where experimental results show that our method achieves state-of-the-art performance and recovers more accurate human meshes. Our results validate the importance of intra-hypothesis refinement and inter-hypothesis communication in probabilistic modeling and show optimal performance across a variety of settings. Our source code will be available at http://cic.tju.edu.cn/faculty/likun/projects/MHPro.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Duan, H., Zhao, Y., Chen, K., et al.: Revisiting skeleton-based action recognition. In: CVPR, pp. 2969–2978 (2022)
Liu, Y., Sivaparthipan, C.B., Shankar, A.: Human-computer Interaction Based Visual Feedback System for Augmentative and Alternative Communication. Int. J. Speech Technol. 25, 305–314 (2022). https://doi.org/10.1007/s10772-021-09901-4
Weng, C.Y., Curless, B., Kemelmacher-Shlizerman, I.: Photo wake-up: 3D character animation from a single photo. In: CVPR, pp. 5908–5917 (2019)
Khirodkar, R., Tripathi, S., Kitani, K.: Occluded human mesh recovery. In: CVPR, pp. 1715–1725 (2022)
Zheng, C., Wu, W., Yang, T., et al.: Deep learning-based human pose estimation: a survey. ArXiv:2012.13392 (2020)
Tian, Y., Zhang, H., Liu, Y., et al.: Recovering 3D human mesh from monocular images: a survey. ArXiv:2203.01923 (2022)
Loper, M., Mahmood, N., Romero, J., et al.: SMPL: a skinned multi-person linear model. TOG 34(6), 1–16 (2015)
Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep it SMPL: automatic estimation of 3D human pose and shape from a single image. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 561–578. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_34
Lassner, C., Romero, J., Kiefel, M., et al.: Unite the people: closing the loop between 3D and 2D human representations. In: CVPR, pp. 6050–6059 (2017)
Song, J., Chen, X., Hilliges, O.: Human body model fitting by learned gradient descent. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12365, pp. 744–760. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58565-5_44
Kanazawa, A., Black, M.J., Jacobs, D.W., et al.: End-to-end recovery of human shape and pose. In: CVPR, pp. 7122–7131 (2018)
Pavlakos, G., Zhu, L., Zhou, X., et al.: Learning to estimate 3D human pose and shape from a single color image. In: CVPR, pp. 459–468 (2018)
Kolotouros, N., Pavlakos, G., Black, M.J., et al.: Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: ICCV, pp. 2252–2261 (2019)
Kolotouros, N., Pavlakos, G., Daniilidis, K.: Convolutional mesh regression for single-image human shape reconstruction. In: CVPR, pp. 4501–4510 (2019)
Moon, G., Lee, K.M.: I2L-MeshNet: image-to-lixel prediction network for accurate 3D human pose and mesh estimation from a single RGB image. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12352, pp. 752–768. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58571-6_44
Kocabas, M., Athanasiou, N., Black, M.J.: VIBE: video inference for human body pose and shape estimation. In: CVPR, pp. 5253–5263 (2020)
Jiang, W., Kolotouros, N., Pavlakos, G., et al.: Coherent reconstruction of multiple humans from a single image. In: CVPR, pp. 5579–5588 (2020)
Lee, G.H., Lee, S.W.: Uncertainty-aware human mesh recovery from video by learning part-based 3D dynamics. In: ICCV, pp. 12375–12384 (2021)
Zhang, H., Tian, Y., Zhou, X., et al.: PyMAF: 3D human pose and shape regression with pyramidal mesh alignment feedback loop. In: ICCV, pp. 11446–11456 (2021)
Wan, Z., Li, Z., Tian, M., et al.: Encoder-decoder with multi-level attention for 3D human shape and pose estimation. In: ICCV, pp. 13033–13042 (2021)
Kocabas, M., Huang, C.H.P., Hilliges, O., et al.: PARE: part attention regressor for 3D human body estimation. In: ICCV, pp. 11127–11137 (2021)
Li, C., Lee, G.H.: Generating multiple hypotheses for 3D human pose estimation with mixture density network. In: CVPR, pp. 9887–9895 (2019)
Li, C., Lee, G.H.: Weakly supervised generative network for multiple 3D human pose hypotheses. ArXiv:2008.05770 (2020)
Biggs, B., Novotny, D., Ehrhardt, S., et al.: 3D multi-bodies: fitting sets of plausible 3D human models to ambiguous image data. In: NIPS, vol. 33, pp. 20496–20507 (2020)
Oikarinen, T., Hannah, D., Kazerounian, S.: GraphMDN: leveraging graph structure and deep learning to solve inverse problems. In: IJCNN, pp. 1–9 (2021)
Wehrbein, T., Rudolph, M., Rosenhahn, B., et al.: Probabilistic monocular 3D human pose estimation with normalizing flows. In: ICCV, pp. 11199–11208 (2021)
Kolotouros, N., Pavlakos, G., Jayaraman, D., et al.: Probabilistic modeling for human mesh recovery. In: ICCV, pp. 11605–11614 (2021)
Sengupta, A., Budvytis, I., Cipolla, R.: Hierarchical kinematic probability distributions for 3D human shape and pose estimation from images in the wild. In: ICCV, pp. 11219–11229 (2021)
Li, W., Liu, H., Tang, H., et al.: MHFormer: multi-hypothesis transformer for 3D human pose estimation. ArXiv:2111.12707 (2021)
Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: NIPS, vol. 30 (2017)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al.: An image is worth \(16\times 16\) words: transformers for image recognition at scale. ArXiv:2010.11929 (2020)
Lin, K., Wang, L., Liu, Z.: End-to-end human pose and mesh reconstruction with transformers. In: CVPR, pp. 1954–1963 (2021)
Jiang, Y., Chang, S., Wang, Z.: TransGAN: two pure transformers can make one strong GAN, and that can scale up. In: NIPS, vol. 34 (2021)
Chen, H., Wang, Y., Guo, T., et al.: Pre-trained image processing transformer. In: CVPR, pp. 12299–12310 (2021)
Dai, Z., Cai, B., Lin, Y., et al.: UP-DETR: unsupervised pre-training for object detection with transformers. In: CVPR, pp. 1601–1610 (2021)
Zeng, Y., Fu, J., Chao, H.: Learning joint spatial-temporal transformations for video inpainting. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12361, pp. 528–543. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58517-4_31
Zhou, Y., Barnes, C., Lu, J., et al.: On the continuity of rotation representations in neural networks. In: CVPR, pp. 5745–5753 (2019)
Chen, C.F.R., Fan, Q., Panda, R.: CrossViT: cross-attention multi-scale vision transformer for image classification. In: ICCV, pp. 357–366 (2021)
Wei, X., Zhang, T., Li, Y., et al.: Multi-modality cross attention network for image and sentence matching. In: CVPR, pp. 10941–10950 (2020)
Hou, R., Chang, H., Ma, B., et al.: Cross attention network for few-shot classification. In: NIPS, vol. 32 (2019)
Ionescu, C., Papava, D., Olaru, V., et al.: Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. TPAMI 36(7), 1325–1339 (2013)
Mehta, D., Rhodin, H., Casas, D., et al.: Monocular 3D human pose estimation in the wild using improved CNN supervision. In: 3DV, pp. 506–516 (2017)
von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3D human pose in the wild using IMUs and a moving camera. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 614–631. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01249-6_37
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Andriluka, M., Pishchulin, L., Gehler, P., et al.: 2D human pose estimation: new benchmark and state of the art analysis. In: CVPR, pp. 3686–3693 (2014)
Johnson, S., Everingham, M.: Clustered pose and nonlinear appearance models for human pose estimation. In: BMVC, vol. 2, no. 4, p. 5 (2010)
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China (62171317 and 62122058).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Xuan, H., Zhang, J., Li, K. (2022). MHPro: Multi-hypothesis Probabilistic Modeling for Human Mesh Recovery. In: Fang, L., Povey, D., Zhai, G., Mei, T., Wang, R. (eds) Artificial Intelligence. CICAI 2022. Lecture Notes in Computer Science(), vol 13604. Springer, Cham. https://doi.org/10.1007/978-3-031-20497-5_18
Download citation
DOI: https://doi.org/10.1007/978-3-031-20497-5_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20496-8
Online ISBN: 978-3-031-20497-5
eBook Packages: Computer ScienceComputer Science (R0)