Multi-Person Absolute 3D Pose and Shape Estimation from Video

Zhang, Kaifu; Li, Yihui; Guan, Yisheng; Xi, Ning

doi:10.1007/978-3-030-89134-3_18

Kaifu Zhang^13,14,
Yihui Li^13,14,
Yisheng Guan^13,14 &
…
Ning Xi^13,14

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13015))

Included in the following conference series:

International Conference on Intelligent Robotics and Applications

2972 Accesses

Abstract

It is a challenging problem to recover the 3D absolute pose and shape of multiple person from video because of the inherent depth, scale and motion blur in the video. To solve this ambiguity, we need to aggregate temporal information, relationship between people and environmental factors, etc. Although many methods have made progress in 3D pose estimation, most of them can not produce accurate and natural motion sequences with absolute scale. In this paper, we propose a new framework, which is composed of human tracking, root-related human mesh estimation and root depth estimation model, adopts temporal network architecture, self-attention mechanism and adversarial training. The experiments show that the method has achieved good performance in in-the-wild datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Arnab, A., Doersch, C., Zisserman, A.: Exploiting temporal context for 3D human pose estimation in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3395–3404 (2019)
Google Scholar
Dabral, R., Mundhada, A., Kusupati, U., Afaque, S., Jain, A.: Structure-aware and temporally coherent 3D human pose estimation. arXiv preprint arXiv:1711.09250 3(4), 6 (2017)
Doersch, C., Zisserman, A.: Sim2Real transfer learning for 3D human pose estimation: motion to the rescue. arXiv preprint arXiv:1907.02499 (2019)
Grauman, K., Shakhnarovich, G., Darrell, T.: Inferring 3D structure with a statistical image-based shape model. In: ICCV, vol. 3, p. 641 (2003)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 630–645. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_38
Chapter Google Scholar
Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. pattern Anal. Mach. Intell. 36(7), 1325–1339 (2013)
Google Scholar
Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7122–7131 (2018)
Google Scholar
Kanazawa, A., Zhang, J.Y., Felsen, P., Malik, J.: Learning 3D human dynamics from video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5614–5623 (2019)
Google Scholar
Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
Kolotouros, N., Pavlakos, G., Black, M.J., Daniilidis, K.: Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2252–2261 (2019)
Google Scholar
Kolotouros, N., Pavlakos, G., Daniilidis, K.: Convolutional mesh regression for single-image human shape reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4501–4510 (2019)
Google Scholar
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. (TOG) 34(6), 1–16 (2015)
Article Google Scholar
Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: AMASS: archive of motion capture as surface shapes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5442–5451 (2019)
Google Scholar
von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3D human pose in the wild using IMUs and a moving camera. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 601–617 (2018)
Google Scholar
Mehta, D., et al.: Monocular 3D human pose estimation in the wild using improved CNN supervision. In: 2017 international conference on 3D vision (3DV), pp. 506–516. IEEE (2017)
Google Scholar
Mehta, D., et al.: XNect: real-time multi-person 3D human pose estimation with a single RGB camera. arXiv preprint arXiv:1907.00837 (2019)
Mehta, D., et al.: Single-shot multi-person 3D pose estimation from monocular RGB. In: 2018 International Conference on 3D Vision (3DV), pp. 120–130. IEEE (2018)
Google Scholar
Moon, G., Chang, J.Y., Lee, K.M.: Camera distance-aware top-down approach for 3D multi-person pose estimation from a single RGB image. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10133–10142 (2019)
Google Scholar
Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_29
Chapter Google Scholar
Pavlakos, G., Zhu, L., Zhou, X., Daniilidis, K.: Learning to estimate 3D human pose and shape from a single color image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 459–468 (2018)
Google Scholar
Rogez, G., Weinzaepfel, P., Schmid, C.: LCR-Net: localization-classification-regression for human pose. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3433–3441 (2017)
Google Scholar
Rogez, G., Weinzaepfel, P., Schmid, C.: LCR-Net++: multi-person 2D and 3D pose detection in natural images. IEEE Trans. Pattern Anal. Mach. Intell. 42(5), 1146–1161 (2019)
Google Scholar
Sapp, B., Taskar, B.: MODEC: multimodal decomposable models for human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3674–3681 (2013)
Google Scholar
Sun, Y., Ye, Y., Liu, W., Gao, W., Fu, Y., Mei, T.: Human mesh recovery from monocular images via a skeleton-disentangled representation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5349–5358 (2019)
Google Scholar
Varol, G., et al.: BodyNet: volumetric inference of 3D human body shapes. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 20–36 (2018)
Google Scholar
Zhang, Y., Wang, C., Wang, X., Zeng, W., Liu, W.: FairMOT: on the fairness of detection and re-identification in multiple object tracking. arXiv e-prints pp. arXiv-2004 (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

Biomimetic and Intelligent Robotics Lab (BIRL), Guangdong University of Technology, Guangzhou, 510006, China
Kaifu Zhang, Yihui Li, Yisheng Guan & Ning Xi
Department of Industrial and Manufacturing Systems Engineering, The University of Hong Kong, Pok Fu Lam, HK SAR, China
Kaifu Zhang, Yihui Li, Yisheng Guan & Ning Xi

Authors

Kaifu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yihui Li
View author publications
You can also search for this author in PubMed Google Scholar
Yisheng Guan
View author publications
You can also search for this author in PubMed Google Scholar
Ning Xi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yisheng Guan .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Xin-Jun Liu
Tsinghua University, Beijing, China
Zhenguo Nie
Beihang University, Beijing, China
Jingjun Yu
Tsinghua University, Beijing, China
Fugui Xie
Shandong University, Shandong, China
Rui Song

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, K., Li, Y., Guan, Y., Xi, N. (2021). Multi-Person Absolute 3D Pose and Shape Estimation from Video. In: Liu, XJ., Nie, Z., Yu, J., Xie, F., Song, R. (eds) Intelligent Robotics and Applications. ICIRA 2021. Lecture Notes in Computer Science(), vol 13015. Springer, Cham. https://doi.org/10.1007/978-3-030-89134-3_18

Download citation

DOI: https://doi.org/10.1007/978-3-030-89134-3_18
Published: 18 October 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-89133-6
Online ISBN: 978-3-030-89134-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics