Skip to main content

An Efficient Graph Transformer Network for Video-Based Human Mesh Reconstruction

  • Conference paper
  • First Online:
Artificial Intelligence (CICAI 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14473))

Included in the following conference series:

  • 175 Accesses

Abstract

Although existing image-based methods for 3D human mesh reconstruction have achieved remarkable accuracy, effectively capturing smooth human motion from monocular video remains a significant challenge. Recently, video-based methods for human mesh reconstruction tend to build more complex networks to capture temporal information of human motion, resulting in a large number of parameters and limiting their practical applications. To address this issue, we propose an Efficient Graph Transformer network to Reconstruct 3D human mesh from monocular video, named EGTR. Specifically, we present a temporal redundancy removal module that uses 1D convolution to eliminate redundant information among video frames and a spatial-temporal fusion module that combines Modulated GCN with transformer framework to capture human motion. Our method achieves better accuracy than the state-of-the-art video-based method TCMR on 3DPW, Human3.6M and MPI-INF-3DHP datasets while only using 8.7% of the parameters, indicating the effectiveness of our method for practical applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Tian, Y., Zhang, H., Liu, Y., Wang, L.: Recovering 3D human mesh from monocular images: a survey. arXiv preprint arXiv:2203.01923 (2022)

  2. Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graphics (TOG) 34(6), 1–16 (2015)

    Article  Google Scholar 

  3. Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7122–7131 (2018)

    Google Scholar 

  4. Kolotouros, N., Pavlakos, G., Black, M.J., Daniilidis, K.: Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2252–2261 (2019)

    Google Scholar 

  5. Georgakis, G., Li, R., Karanam, S., Chen, T., Košecká, J., Wu, Z.: Hierarchical kinematic human mesh recovery. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12362, pp. 768–784. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58520-4_45

    Chapter  Google Scholar 

  6. Kocabas, M., Huang, C. H. P., Hilliges, O., Black, M. J.: PARE: part attention regressor for 3D human body estimation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 11127–11137 (2021)

    Google Scholar 

  7. Zhang, H., Tian, Y., Zhou, X., Ouyang, W., Liu, Y., Wang, L., Sun, Z.: Pymaf: 3d human pose and shape regression with pyramidal mesh alignment feedback loop. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 11446–11456 (2021)

    Google Scholar 

  8. Kanazawa, A., Zhang, J.Y., Felsen, P., Malik, J.: Learning 3D human dynamics from video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5614–5623 (2019)

    Google Scholar 

  9. Kocabas, M., Athanasiou, N., Black, M.J.: VIBE: video inference for human body pose and shape estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5253–5263 (2020)

    Google Scholar 

  10. Luo, Z., Golestaneh, S.A., Kitani, K.M.: 3D human motion estimation via motion compression and refinement. In: Ishikawa, H., Liu, C.-L., Pajdla, T., Shi, J. (eds.) ACCV 2020. LNCS, vol. 12626, pp. 324–340. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-69541-5_20

    Chapter  Google Scholar 

  11. Choi, H., Moon, G., Chang, J.Y., Lee, K.M.: Beyond static features for temporally consistent 3D human pose and shape from a video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1964–1973 (2021)

    Google Scholar 

  12. Cho, K., Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)

  13. Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J., Davis, J.: Scape: shape completion and animation of people. In: ACM SIGGRAPH, pp. 408–416 (2005)

    Google Scholar 

  14. Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A.A., Tzionas, D.: Expressive body capture: 3d hands, face, and body from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10975–10985 (2019)

    Google Scholar 

  15. Osman, A.A.A., Bolkart, T., Black, M.J.: STAR: sparse trained articulated human body regressor. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12351, pp. 598–613. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58539-6_36

    Chapter  Google Scholar 

  16. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)

    Google Scholar 

  17. Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: AMASS: archive of motion capture as surface shapes. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 5442–5451 (2019)

    Google Scholar 

  18. Kolotouros, N., Pavlakos, G., Daniilidis, K.: Convolutional mesh regression for single-image human shape reconstruction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4501–4510 (2019)

    Google Scholar 

  19. Choi, H., Moon, G., Lee, K.M.: Pose2Mesh: graph convolutional network for 3D human pose and mesh recovery from a 2D human pose. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12352, pp. 769–787. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58571-6_45

    Chapter  Google Scholar 

  20. Lin, K., Wang, L., Liu, Z.: Mesh graphormer. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 12939–12948 (2021)

    Google Scholar 

  21. You, Y., Liu, H., Li, X., Li, W., Wang, T., Ding, R.: Gator: graph-aware transformer with motion-disentangled regression for human mesh recovery from a 2D Pose. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2023)

    Google Scholar 

  22. Zou, Z., Tang, W.: Modulated graph convolutional network for 3D human pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 11477–11487 (2021)

    Google Scholar 

  23. Wang, T., Liu, H., Ding, R., Li, W., You, Y., Li, X.: Interweaved graph and attention network for 3D human pose estimation. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2023)

    Google Scholar 

  24. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N.: Attention is all you need. In: Conference on Neural Information Processing Systems (NIPS) (2017)

    Google Scholar 

  25. Ba, J.L., Kiros, J.R., Hinton G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)

  26. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  27. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z.: Automatic differentiation in pytorch (2017)

    Google Scholar 

  28. Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3d human pose in the wild using imus and a moving camera. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 601–617 (2018)

    Google Scholar 

  29. Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6m: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 36(7), 1325–1339 (2013)

    Article  Google Scholar 

  30. Mehta, D., Rhodin, H., Casas, D., Fua, P., Sotnychenko, O., Xu, W.: Monocular 3D human pose estimation in the wild using improved CNN supervision. In: International Conference on 3D Vision (3DV), pp. 506–516 (2017)

    Google Scholar 

  31. Andriluka, M., Iqbal, U., Insafutdinov, E., Pishchulin, L., Milan, A., Gall, J.: Posetrack: a benchmark for human pose estimation and tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5167–5176 (2018)

    Google Scholar 

  32. Zhang, W., Zhu, M., Derpanis, K.G.: From actemes to action: A strongly-supervised representation for detailed action understanding. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2248–2255 (2013)

    Google Scholar 

  33. Loper, M., Mahmood, N., Black, M.J.: MoSh: motion and shape capture from sparse markers. ACM Trans. Graphics (TOG) 33(6), 220:1-220:13 (2014)

    Article  Google Scholar 

Download references

Acknowledgement

This paper is funded by National Natural Science Foundation of China (No.62073004), National Key R &D Program of China (No.2020AAA0108904), and Shenzhen Fundamental Research Program (No. GXWD20201231165807007-20200807164903001).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yingxuan You .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tang, T., You, Y., Wang, T., Liu, H. (2024). An Efficient Graph Transformer Network for Video-Based Human Mesh Reconstruction. In: Fang, L., Pei, J., Zhai, G., Wang, R. (eds) Artificial Intelligence. CICAI 2023. Lecture Notes in Computer Science(), vol 14473. Springer, Singapore. https://doi.org/10.1007/978-981-99-8850-1_17

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-8850-1_17

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-8849-5

  • Online ISBN: 978-981-99-8850-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics