Abstract
It is a challenging problem to recover the 3D absolute pose and shape of multiple person from video because of the inherent depth, scale and motion blur in the video. To solve this ambiguity, we need to aggregate temporal information, relationship between people and environmental factors, etc. Although many methods have made progress in 3D pose estimation, most of them can not produce accurate and natural motion sequences with absolute scale. In this paper, we propose a new framework, which is composed of human tracking, root-related human mesh estimation and root depth estimation model, adopts temporal network architecture, self-attention mechanism and adversarial training. The experiments show that the method has achieved good performance in in-the-wild datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Arnab, A., Doersch, C., Zisserman, A.: Exploiting temporal context for 3D human pose estimation in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3395–3404 (2019)
Dabral, R., Mundhada, A., Kusupati, U., Afaque, S., Jain, A.: Structure-aware and temporally coherent 3D human pose estimation. arXiv preprint arXiv:1711.09250 3(4), 6 (2017)
Doersch, C., Zisserman, A.: Sim2Real transfer learning for 3D human pose estimation: motion to the rescue. arXiv preprint arXiv:1907.02499 (2019)
Grauman, K., Shakhnarovich, G., Darrell, T.: Inferring 3D structure with a statistical image-based shape model. In: ICCV, vol. 3, p. 641 (2003)
He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 630–645. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_38
Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. pattern Anal. Mach. Intell. 36(7), 1325–1339 (2013)
Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7122–7131 (2018)
Kanazawa, A., Zhang, J.Y., Felsen, P., Malik, J.: Learning 3D human dynamics from video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5614–5623 (2019)
Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
Kolotouros, N., Pavlakos, G., Black, M.J., Daniilidis, K.: Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2252–2261 (2019)
Kolotouros, N., Pavlakos, G., Daniilidis, K.: Convolutional mesh regression for single-image human shape reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4501–4510 (2019)
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. (TOG) 34(6), 1–16 (2015)
Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: AMASS: archive of motion capture as surface shapes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5442–5451 (2019)
von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3D human pose in the wild using IMUs and a moving camera. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 601–617 (2018)
Mehta, D., et al.: Monocular 3D human pose estimation in the wild using improved CNN supervision. In: 2017 international conference on 3D vision (3DV), pp. 506–516. IEEE (2017)
Mehta, D., et al.: XNect: real-time multi-person 3D human pose estimation with a single RGB camera. arXiv preprint arXiv:1907.00837 (2019)
Mehta, D., et al.: Single-shot multi-person 3D pose estimation from monocular RGB. In: 2018 International Conference on 3D Vision (3DV), pp. 120–130. IEEE (2018)
Moon, G., Chang, J.Y., Lee, K.M.: Camera distance-aware top-down approach for 3D multi-person pose estimation from a single RGB image. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10133–10142 (2019)
Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_29
Pavlakos, G., Zhu, L., Zhou, X., Daniilidis, K.: Learning to estimate 3D human pose and shape from a single color image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 459–468 (2018)
Rogez, G., Weinzaepfel, P., Schmid, C.: LCR-Net: localization-classification-regression for human pose. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3433–3441 (2017)
Rogez, G., Weinzaepfel, P., Schmid, C.: LCR-Net++: multi-person 2D and 3D pose detection in natural images. IEEE Trans. Pattern Anal. Mach. Intell. 42(5), 1146–1161 (2019)
Sapp, B., Taskar, B.: MODEC: multimodal decomposable models for human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3674–3681 (2013)
Sun, Y., Ye, Y., Liu, W., Gao, W., Fu, Y., Mei, T.: Human mesh recovery from monocular images via a skeleton-disentangled representation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5349–5358 (2019)
Varol, G., et al.: BodyNet: volumetric inference of 3D human body shapes. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 20–36 (2018)
Zhang, Y., Wang, C., Wang, X., Zeng, W., Liu, W.: FairMOT: on the fairness of detection and re-identification in multiple object tracking. arXiv e-prints pp. arXiv-2004 (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, K., Li, Y., Guan, Y., Xi, N. (2021). Multi-Person Absolute 3D Pose and Shape Estimation from Video. In: Liu, XJ., Nie, Z., Yu, J., Xie, F., Song, R. (eds) Intelligent Robotics and Applications. ICIRA 2021. Lecture Notes in Computer Science(), vol 13015. Springer, Cham. https://doi.org/10.1007/978-3-030-89134-3_18
Download citation
DOI: https://doi.org/10.1007/978-3-030-89134-3_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-89133-6
Online ISBN: 978-3-030-89134-3
eBook Packages: Computer ScienceComputer Science (R0)