Multimodal Token Fusion and Optimization for 3D Human Mesh Reconstruction with Transformers

Jiang, Yang; Wang, Sunli; Sun, Mingyang; Kou, Dongliang; Xie, Qiangbin; Zhang, Lihuang

doi:10.1007/978-981-97-8508-7_41

Yang Jiang^15,16,
Sunli Wang^15,16,
Mingyang Sun^15,16,
Dongliang Kou^15,16,
Qiangbin Xie^15,16 &
…
Lihuang Zhang^15,16

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15036))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

131 Accesses

Abstract

3D human recovery has attracted great attention and shown potential in games and movies. Confronting challenges related to occlusion and depth blurring in 3D human reconstruction, transformer encoder architectures have been used to make good progress in learning the connections between various parts of the human body. Nevertheless, for the input to the model, the differences between image tokens and vertex-joint tokens of different modalities are still limiting the reconstruction capability of the 3D human mesh. To overcome this limitation, we propose a module based on a multimodal cross-feature fusion mechanism directly fuses 2D images and 3D spatial coordinates to reconstruct a better human mesh. Our approach employs a large kernel attention strategy to improve the understanding of image features for spatial long-range relationships. We also design a token shift module for joints and vertices to learn interactions between vertices. Quantitative and qualitative experiments on large-scale human datasets such as 3DPW and Human3.6 show that our method achieves excellent reconstruction accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 74.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Cross-Attention of Disentangled Modalities for 3D Human Mesh Recovery with Transformers

MeT: mesh transformer with an edge

Article 14 July 2023

Disentangled Shape and Pose Based on Attention and Mesh Autoencoder

References

Cho, J., Youwang, K., Oh, T.H.: Cross-attention of disentangled modalities for 3d human mesh recovery with transformers. In: European Conference on Computer Vision, pp. 342–359. Springer (2022)
Google Scholar
Choi, H., Moon, G., Lee, K.M.: Pose2Mesh: Graph Convolutional Network for 3D Human Pose and Mesh Recovery from a 2D Human Pose. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12352, pp. 769–787. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58571-6_45
Chapter Google Scholar
Gower, J.C.: Generalized procrustes analysis. Psychometrika 40, 33–51 (1975)
Article MathSciNet Google Scholar
Guo, M.H., Lu, C.Z., Liu, Z.N., Cheng, M.M., Hu, S.M.: Visual attention network. Comput. Vis. Media 9(4), 733–752 (2023)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1325–1339 (2013)
Google Scholar
Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7122–7131 (2018)
Google Scholar
Kocabas, M., Athanasiou, N., Black, M.J.: Vibe: Video inference for human body pose and shape estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5253–5263 (2020)
Google Scholar
Kolotouros, N., Pavlakos, G., Black, M.J., Daniilidis, K.: Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2252–2261 (2019)
Google Scholar
Kolotouros, N., Pavlakos, G., Daniilidis, K.: Convolutional mesh regression for single-image human shape reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4501–4510 (2019)
Google Scholar
Li, Z., Liu, J., Zhang, Z., Xu, S., Yan, Y.: Cliff: Carrying location information in full frames into human pose and shape estimation. In: European Conference on Computer Vision, pp. 590–606. Springer (2022)
Google Scholar
Lin, K., Wang, L., Liu, Z.: End-to-end human pose and mesh reconstruction with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1954–1963 (2021)
Google Scholar
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: Common Objects in Context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: Smpl: A skinned multi-person linear model. In: Seminal Graphics Papers: Pushing the Boundaries, vol. 2, pp. 851–866 (2023)
Google Scholar
Moon, G., Lee, K.M.: I2L-MeshNet: Image-to-Lixel Prediction Network for Accurate 3D Human Pose and Mesh Estimation from a Single RGB Image. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12352, pp. 752–768. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58571-6_44
Chapter Google Scholar
Osman, A.A.A., Bolkart, T., Black, M.J.: STAR: A sparse trained articulated human body regressor. In: European Conference on Computer Vision (ECCV), pp. 598–613 (2020). https://star.is.tue.mpg.de
Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A., Tzionas, D., Black, M.J.: Expressive body capture: 3d hands, face, and body from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10975–10985 (2019)
Google Scholar
Pavlakos, G., Zhu, L., Zhou, X., Daniilidis, K.: Learning to estimate 3d human pose and shape from a single color image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 459–468 (2018)
Google Scholar
Saito, S., Huang, Z., Natsume, R., Morishima, S., Kanazawa, A., Li, H.: Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2304–2314 (2019)
Google Scholar
Saito, S., Simon, T., Saragih, J., Joo, H.: Pifuhd: Multi-level pixel-aligned implicit function for high-resolution 3d human digitization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 84–93 (2020)
Google Scholar
Sun, K., Zhao, Y., Jiang, B., Cheng, T., Xiao, B., Liu, D., Mu, Y., Wang, X., Liu, W., Wang, J.: High-resolution representations for labeling pixels and regions (2019)
Google Scholar
Von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3d human pose in the wild using imus and a moving camera. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 601–617 (2018)
Google Scholar
Wang, L., Liu, X., Ma, X., Wu, J., Cheng, J., Zhou, M.: A progressive quadric graph convolutional network for 3d human mesh recovery. IEEE Trans. Circuits Syst. Video Technol. 33(1), 104–117 (2022)
Article Google Scholar
Xiu, Y., Yang, J., Tzionas, D., Black, M.J.: Icon: Implicit clothed humans obtained from normals. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13286–13296. IEEE (2022)
Google Scholar
Zheng, Z., Yu, T., Liu, Y., Dai, Q.: Pamir: Parametric model-conditioned implicit representation for image-based human reconstruction. IEEE Trans. Pattern Anal. Mach. Intell. 44(6), 3170–3184 (2021)
Article Google Scholar
Zhou, X., Zhu, M., Pavlakos, G., Leonardos, S., Derpanis, K.G., Daniilidis, K.: Monocap: Monocular human motion capture using a cnn coupled with a geometric prior. IEEE Trans. Pattern Anal. Mach. Intell. 41(4), 901–914 (2018)
Article Google Scholar
Zimmermann, C., Ceylan, D., Yang, J., Russell, B., Argus, M., Brox, T.: Freihand: A dataset for markerless capture of hand pose and shape from single rgb images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 813–822 (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

Academy for Engineering and Technology, Fudan University, Shanghai, China
Yang Jiang, Sunli Wang, Mingyang Sun, Dongliang Kou, Qiangbin Xie & Lihuang Zhang
Cognition and Intelligent Technology Laboratory (CIT Lab), Shanghai, China
Yang Jiang, Sunli Wang, Mingyang Sun, Dongliang Kou, Qiangbin Xie & Lihuang Zhang

Authors

Yang Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Sunli Wang
View author publications
You can also search for this author in PubMed Google Scholar
Mingyang Sun
View author publications
You can also search for this author in PubMed Google Scholar
Dongliang Kou
View author publications
You can also search for this author in PubMed Google Scholar
Qiangbin Xie
View author publications
You can also search for this author in PubMed Google Scholar
Lihuang Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lihuang Zhang .

Editor information

Editors and Affiliations

Peking University, Beijing, China
Zhouchen Lin
Nankai University, Tianjin, China
Ming-Ming Cheng
Chinese Academy of Sciences, Beijing, China
Ran He
Xinjiang University, Ürümqi, Xinjiang, China
Kurban Ubul
Xinjiang University, Ürümqi, China
Wushouer Silamu
Peking University, Beijing, China
Hongbin Zha
Tsinghua University, Beijing, China
Jie Zhou
Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jiang, Y., Wang, S., Sun, M., Kou, D., Xie, Q., Zhang, L. (2025). Multimodal Token Fusion and Optimization for 3D Human Mesh Reconstruction with Transformers. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15036. Springer, Singapore. https://doi.org/10.1007/978-981-97-8508-7_41

Download citation

DOI: https://doi.org/10.1007/978-981-97-8508-7_41
Published: 03 November 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-8507-0
Online ISBN: 978-981-97-8508-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Multimodal Token Fusion and Optimization for 3D Human Mesh Reconstruction with Transformers

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Cross-Attention of Disentangled Modalities for 3D Human Mesh Recovery with Transformers

MeT: mesh transformer with an edge

Disentangled Shape and Pose Based on Attention and Mesh Autoencoder

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Multimodal Token Fusion and Optimization for 3D Human Mesh Reconstruction with Transformers

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Cross-Attention of Disentangled Modalities for 3D Human Mesh Recovery with Transformers

MeT: mesh transformer with an edge

Disentangled Shape and Pose Based on Attention and Mesh Autoencoder

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation