Abstract
It is well known that regressing the parametric representation of a human body from a single image suffers low accuracy due to sparse information use and error accumulation. Although being able to achieve higher accuracy by avoiding these issues, directly regressing vertices may result in vertex outliers. We present METRO-X, a novel method for reconstructing full-body human meshes with body pose, facial expression and hand gesture from a single image, which combines the advantages from the two disciplines so as to achieve higher accuracy than parameter regression while bear denser vertices and generate smoother shape than vertices regression. It first detects and extracts hands, head and the whole body parts from a given image, then regresses the vertices of three parts separately using METRO, and finally fits SMPL-X to the reconstructed meshes to obtain the complete parametric representation. Experimental results show that METRO-X outperforms the ExPose method, with a significant 23% improvement in body accuracy and a 35% improvement in gesture accuracy. These results demonstrate the potential of our approach in enabling various applications.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3686–3693 (2014)
Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep It SMPL: automatic estimation of 3D human pose and shape from a single image. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 561–578. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_34
Chen, D., Song, Y., Liang, F., Ma, T., Zhu, X., Jia, T.: 3D human body reconstruction based on SMPL model. Vis. Comput. 39(5), 1893–1906 (2022). https://doi.org/10.1007/s00371-022-02453-x
Cho, J., Youwang, K., Oh, T.H.: Cross-attention of disentangled modalities for 3D human mesh recovery with transformers. In: Avidan, S., Brostow, G., Cisse, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13661, pp. 342–359. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19769-7_20
Choutas, V., Pavlakos, G., Bolkart, T., Tzionas, D., Black, M.J.: Monocular expressive body regression through body-driven attention. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12355, pp. 20–40. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58607-2_2
Feng, Y., Choutas, V., Bolkart, T., Tzionas, D., Black, M.J.: Collaborative regression of expressive bodies using moderation. In: 2021 International Conference on 3D Vision (3DV), pp. 792–804 (2021)
Feng, Y., Feng, H., Black, M.J., Bolkart, T.: Learning an animatable detailed 3D face model from in-the-wild images. ACM Trans. Graph. 40(4), 1–13 (2021)
Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3. 6 m: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1325–1339 (2013)
Johnson, S., Everingham, M.: Clustered pose and nonlinear appearance models for human pose estimation. In: Proceedings of the British Machine Vision Conference, pp. 12.1–12.11 (2010)
Johnson, S., Everingham, M.: Learning effective human pose estimation from inaccurate annotation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1465–1472 (2011)
Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7122–7131 (2018)
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4396–4405 (2019). https://doi.org/10.1109/CVPR.2019.00453
Kocabas, M., Athanasiou, N., Black, M.J.: VIBE: video inference for human body pose and shape estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5253–5263 (2020)
Kolotouros, N., Pavlakos, G., Black, M.J., Daniilidis, K.: Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2252–2261 (2019)
Kolotouros, N., Pavlakos, G., Daniilidis, K.: Convolutional mesh regression for single-image human shape reconstruction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4501–4510 (2019)
Li, M., et al.: Interacting attention graph for single image two-hand reconstruction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2761–2770 (2022)
Li, X., Li, G., Li, T., Lv, J., Mitrouchev, P.: Remodeling of mannequins based on automatic binding of mesh to anthropometric parameters. Vis. Comput. 1–24 (2022)
Li, Z., Liu, J., Zhang, Z., Xu, S., Yan, Y.: CLIFF: carrying location information in full frames into human pose and shape estimation. arXiv preprint arXiv:2208.00571 (2022)
Lin, K., Wang, L., Liu, Z.: End-to-end human pose and mesh reconstruction with transformers. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1954–1963 (2021)
Lin, K., Wang, L., Liu, Z.: Mesh graphormer. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 12939–12948 (2021)
Lin, T.Y., et al.: Microsoft coco: Common objects in context. In: Proceedings of the European Conference on Computer Vision, pp. 740–755 (2014)
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. 34(6), 1–16 (2015)
von Marcard, T., Henschel, R., Black, M., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3d human pose in the wild using IMUs and a moving camera. In: European Conference on Computer Vision (ECCV), September 2018
Mehta, D., et al.: Monocular 3D human pose estimation in the wild using improved CNN supervision. In: 2017 International Conference on 3D Vision (3DV), pp. 506–516 (2017). https://doi.org/10.1109/3DV.2017.00064
Pavlakos, G., et al.: Expressive body capture: 3D hands, face, and body from a single image. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition, pp. 10975–10985 (2019)
Pavlakos, G., et al.: Expressive body capture: 3D hands, face, and body from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10975–10985 (2019)
Pavlakos, G., Zhu, L., Zhou, X., Daniilidis, K.: Learning to estimate 3D human pose and shape from a single color image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 459–468 (2018)
Ranjan, A., Bolkart, T., Sanyal, S., Black, M.J.: Generating 3D faces using convolutional mesh autoencoders. In: Proceedings of the European Conference on Computer Vision, pp. 704–720 (2018)
Varol, G., et al.: Bodynet: volumetric inference of 3D human body shapes. In: Proceedings of the European Conference on Computer Vision, pp. 20–36 (2018)
Wan, Z., Li, Z., Tian, M., Liu, J., Yi, S., Li, H.: Encoder-decoder with multi-level attention for 3D human shape and pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 13033–13042 (2021)
Wang, K., Zhang, G., Yang, J., Bao, H.: Dynamic human body reconstruction and motion tracking with low-cost depth cameras. Vis. Comput. 37, 603–618 (2021)
Wei, W.L., Lin, J.C., Liu, T.L., Liao, H.Y.M.: Capturing humans in motion: temporal-attentive 3D human pose and shape estimation from monocular video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 13211–13220 (2022)
Wu, Y., Kirillov, A., Massa, F., Lo, W.Y., Girshick, R.: Detectron2 (2019). https://github.com/facebookresearch/detectron2
Zhou, X., Zhu, M., Pavlakos, G., Leonardos, S., Derpanis, K.G., Daniilidis, K.: MonoCap: monocular human motion capture using a CNN coupled with a geometric prior. IEEE Trans. Pattern Anal. Mach. Intell. 41(4), 901–914 (2019). https://doi.org/10.1109/TPAMI.2018.2816031
Zimmermann, C., Ceylan, D., Yang, J., Russell, B., Argus, M., Brox, T.: FreiHAND: a dataset for markerless capture of hand pose and shape from single RGB images. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 813–822 (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Li, G., Yao, C., Zhang, H., Zeng, J., Nie, Y., Xian, C. (2024). METRO-X: Combining Vertex and Parameter Regressions for Recovering 3D Human Meshes with Full Motions. In: Sheng, B., Bi, L., Kim, J., Magnenat-Thalmann, N., Thalmann, D. (eds) Advances in Computer Graphics. CGI 2023. Lecture Notes in Computer Science, vol 14496. Springer, Cham. https://doi.org/10.1007/978-3-031-50072-5_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-50072-5_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-50071-8
Online ISBN: 978-3-031-50072-5
eBook Packages: Computer ScienceComputer Science (R0)