An Efficient Graph Transformer Network for Video-Based Human Mesh Reconstruction

Tang, Tao; You, Yingxuan; Wang, Ti; Liu, Hong

doi:10.1007/978-981-99-8850-1_17

Tao Tang^11,12,
Yingxuan You¹¹,
Ti Wang¹¹ &
…
Hong Liu¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14473))

Included in the following conference series:

CAAI International Conference on Artificial Intelligence

175 Accesses

Abstract

Although existing image-based methods for 3D human mesh reconstruction have achieved remarkable accuracy, effectively capturing smooth human motion from monocular video remains a significant challenge. Recently, video-based methods for human mesh reconstruction tend to build more complex networks to capture temporal information of human motion, resulting in a large number of parameters and limiting their practical applications. To address this issue, we propose an Efficient Graph Transformer network to Reconstruct 3D human mesh from monocular video, named EGTR. Specifically, we present a temporal redundancy removal module that uses 1D convolution to eliminate redundant information among video frames and a spatial-temporal fusion module that combines Modulated GCN with transformer framework to capture human motion. Our method achieves better accuracy than the state-of-the-art video-based method TCMR on 3DPW, Human3.6M and MPI-INF-3DHP datasets while only using 8.7% of the parameters, indicating the effectiveness of our method for practical applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Tian, Y., Zhang, H., Liu, Y., Wang, L.: Recovering 3D human mesh from monocular images: a survey. arXiv preprint arXiv:2203.01923 (2022)
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graphics (TOG) 34(6), 1–16 (2015)
Article Google Scholar
Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7122–7131 (2018)
Google Scholar
Kolotouros, N., Pavlakos, G., Black, M.J., Daniilidis, K.: Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2252–2261 (2019)
Google Scholar
Georgakis, G., Li, R., Karanam, S., Chen, T., Košecká, J., Wu, Z.: Hierarchical kinematic human mesh recovery. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12362, pp. 768–784. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58520-4_45
Chapter Google Scholar
Kocabas, M., Huang, C. H. P., Hilliges, O., Black, M. J.: PARE: part attention regressor for 3D human body estimation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 11127–11137 (2021)
Google Scholar
Zhang, H., Tian, Y., Zhou, X., Ouyang, W., Liu, Y., Wang, L., Sun, Z.: Pymaf: 3d human pose and shape regression with pyramidal mesh alignment feedback loop. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 11446–11456 (2021)
Google Scholar
Kanazawa, A., Zhang, J.Y., Felsen, P., Malik, J.: Learning 3D human dynamics from video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5614–5623 (2019)
Google Scholar
Kocabas, M., Athanasiou, N., Black, M.J.: VIBE: video inference for human body pose and shape estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5253–5263 (2020)
Google Scholar
Luo, Z., Golestaneh, S.A., Kitani, K.M.: 3D human motion estimation via motion compression and refinement. In: Ishikawa, H., Liu, C.-L., Pajdla, T., Shi, J. (eds.) ACCV 2020. LNCS, vol. 12626, pp. 324–340. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-69541-5_20
Chapter Google Scholar
Choi, H., Moon, G., Chang, J.Y., Lee, K.M.: Beyond static features for temporally consistent 3D human pose and shape from a video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1964–1973 (2021)
Google Scholar
Cho, K., Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J., Davis, J.: Scape: shape completion and animation of people. In: ACM SIGGRAPH, pp. 408–416 (2005)
Google Scholar
Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A.A., Tzionas, D.: Expressive body capture: 3d hands, face, and body from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10975–10985 (2019)
Google Scholar
Osman, A.A.A., Bolkart, T., Black, M.J.: STAR: sparse trained articulated human body regressor. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12351, pp. 598–613. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58539-6_36
Chapter Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Google Scholar
Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: AMASS: archive of motion capture as surface shapes. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 5442–5451 (2019)
Google Scholar
Kolotouros, N., Pavlakos, G., Daniilidis, K.: Convolutional mesh regression for single-image human shape reconstruction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4501–4510 (2019)
Google Scholar
Choi, H., Moon, G., Lee, K.M.: Pose2Mesh: graph convolutional network for 3D human pose and mesh recovery from a 2D human pose. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12352, pp. 769–787. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58571-6_45
Chapter Google Scholar
Lin, K., Wang, L., Liu, Z.: Mesh graphormer. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 12939–12948 (2021)
Google Scholar
You, Y., Liu, H., Li, X., Li, W., Wang, T., Ding, R.: Gator: graph-aware transformer with motion-disentangled regression for human mesh recovery from a 2D Pose. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2023)
Google Scholar
Zou, Z., Tang, W.: Modulated graph convolutional network for 3D human pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 11477–11487 (2021)
Google Scholar
Wang, T., Liu, H., Ding, R., Li, W., You, Y., Li, X.: Interweaved graph and attention network for 3D human pose estimation. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5 (2023)
Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N.: Attention is all you need. In: Conference on Neural Information Processing Systems (NIPS) (2017)
Google Scholar
Ba, J.L., Kiros, J.R., Hinton G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z.: Automatic differentiation in pytorch (2017)
Google Scholar
Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3d human pose in the wild using imus and a moving camera. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 601–617 (2018)
Google Scholar
Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6m: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 36(7), 1325–1339 (2013)
Article Google Scholar
Mehta, D., Rhodin, H., Casas, D., Fua, P., Sotnychenko, O., Xu, W.: Monocular 3D human pose estimation in the wild using improved CNN supervision. In: International Conference on 3D Vision (3DV), pp. 506–516 (2017)
Google Scholar
Andriluka, M., Iqbal, U., Insafutdinov, E., Pishchulin, L., Milan, A., Gall, J.: Posetrack: a benchmark for human pose estimation and tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5167–5176 (2018)
Google Scholar
Zhang, W., Zhu, M., Derpanis, K.G.: From actemes to action: A strongly-supervised representation for detailed action understanding. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2248–2255 (2013)
Google Scholar
Loper, M., Mahmood, N., Black, M.J.: MoSh: motion and shape capture from sparse markers. ACM Trans. Graphics (TOG) 33(6), 220:1-220:13 (2014)
Article Google Scholar

Download references

Acknowledgement

This paper is funded by National Natural Science Foundation of China (No.62073004), National Key R &D Program of China (No.2020AAA0108904), and Shenzhen Fundamental Research Program (No. GXWD20201231165807007-20200807164903001).

Author information

Authors and Affiliations

Key Laboratory of Machine Perception, Shenzhen Graduate School, Peking University, Shenzhen, China
Tao Tang, Yingxuan You, Ti Wang & Hong Liu
School of Computer Science and Engineering, Central South University, Changsha, China
Tao Tang

Authors

Tao Tang
View author publications
You can also search for this author in PubMed Google Scholar
Yingxuan You
View author publications
You can also search for this author in PubMed Google Scholar
Ti Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hong Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yingxuan You .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Lu Fang
Duke University, Durham, NC, USA
Jian Pei
Shanghai Jiao Tong Univeristy, Shanghai, China
Guangtao Zhai
Chinese Academy of Sciences, Beijing, China
Ruiping Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tang, T., You, Y., Wang, T., Liu, H. (2024). An Efficient Graph Transformer Network for Video-Based Human Mesh Reconstruction. In: Fang, L., Pei, J., Zhai, G., Wang, R. (eds) Artificial Intelligence. CICAI 2023. Lecture Notes in Computer Science(), vol 14473. Springer, Singapore. https://doi.org/10.1007/978-981-99-8850-1_17

Download citation

DOI: https://doi.org/10.1007/978-981-99-8850-1_17
Published: 04 February 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8849-5
Online ISBN: 978-981-99-8850-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

An Efficient Graph Transformer Network for Video-Based Human Mesh Reconstruction