Skip to main content
Log in

Unsupervised universal hierarchical multi-person 3D pose estimation for natural scenes

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Multi-person 3D pose estimation using a monocular freely moving camera in real-world scenarios remains a challenge. There is a lack of data with 3D ground truth, and real-world scenes usually contain self-occlusions and inter-person occlusions. To address these challenges, an unsupervised Universal Hierarchical 3D Human Pose Estimation (UH3DHPE) method that optimizes the torso and limb poses based on a hierarchical framework is proposed. To handle the case of an occluded or inaccurate 2D torso keypoints, which play an important role for 3D pose initialization and subsequent inference, an effective method to directly estimate limb poses without building upon the estimated torso pose is proposed, and the torso pose can then be further refined to form the hierarchy in a bottom-up fashion. An adaptive merging strategy is proposed to determine the best hierarchy. To verify the effectiveness of the proposed scheme, a video dataset for multi-person interactions is collected by a moving camera, under a Motion Capture (MoCap) ground truth data acquisition environment, for our performance evaluations. Experimental results show the proposed method outperforms state-of-the-art methods on the multi-person moving camera scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data availability

will be open sourced.

Code availability

No

References

  1. Arnab A, Doersch C, Zisserman A Exploiting temporal context for 3D human pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3395–3404)

  2. Bogo F, Kanazawa A, Lassner C, Gehler P, Romero J, Black MJ (2016) Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In European Conference on Computer Vision (pp. 561–578). Springer, Cham

  3. Cao Z, Simon T, Wei SE, Sheikh Y (2017) Realtime multi-person 2D pose estimation using part affinity fields, the IEEE conference on computer vision and pattern recognition

  4. Gao XS, Hou XR, Tang J, Cheng HF (2003) Complete solution classification for the perspective-three-point problem. IEEE Trans Pattern Anal Mach Intell 25(8):930–943

    Article  Google Scholar 

  5. Gu R, Wang G, Jiang Z, Hwang JN (2019) Multi-person hierarchical 3D pose estimation in natural videos. IEEE Transactions on Circuits and Systems for Video Technology 30:4245–4257

    Article  Google Scholar 

  6. Gu R, Wang G, Hwang JN (2020) Exploring severe occlusion: Multi-Person 3D Pose Estimation with Gated Convolution[J]. arXiv preprint arXiv:2011.00184

  7. Ionescu C, Papava D, Olaru V, Sminchisescu C (2014) Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments, IEEE Transactions on Pattern Analysis and Machine Intelligence 36(7)

  8. Kanazawa A, Black MJ, Jacobs DW, Malik J (2018) End-to-end recovery of human shape and pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 7122–7131)

  9. Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev P, Suleyman M (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950

  10. Li S, Chan AB (2014) 3d human pose estimation from monocular images with deep convolutional neural network. In Asian Conference on Computer Vision (pp. 332–347). Springer, Cham

  11. Li S, Zhang W, Chan AB (2015) Maximum-margin structured learning with deep networks for 3d human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2848–2856)

  12. Lin J, Lee GH (2019) Trajectory space factorization for deep video-based 3D human pose estimation. arXiv preprint arXiv:1908.08289

  13. Marcard TV, Henschel R, Black MJ et al. (2018) Recovering accurate 3D human pose in the wild using IMUs and a moving camera, European conference on computer vision (ECCV)

  14. Martinez J, Hossain R., Romero J, Little JJ (2017) A simple yet effective baseline for 3d human pose estimation. In proceedings of the IEEE international conference on computer vision (pp. 2640-2649)

  15. Mitra R, Gundavarapu NB, Sharma A, Jain A (2020) Multiview-consistent semi-supervised learning for 3D human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

  16. Moon G, Chang JY, Lee KM (2019) Camera distance-aware top-down approach for 3D multi-person pose estimation from a single RGB image. In proceedings of the IEEE international conference on computer vision (pp. 10133-10142)

  17. Pavllo D, Feichtenhofer C, Grangier D, Auli M (2019) 3d human pose estimation in video with temporal convolutions and semi-supervised training. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 7753–7762

  18. Powell MJ (1964) An efficient method for finding the minimum of a function of several variables without calculating derivatives. Comput J 7(2):155–162

    Article  MathSciNet  Google Scholar 

  19. Ramakrishna V, Kanade T, Sheikh Y (2012) Reconstructing 3d human pose from 2d image landmarks. In European conference on computer vision (pp. 573–586). Springer, Berlin, Heidelberg

  20. Rayat Imtiaz Hossain M, Little JJ (2018) Exploiting temporal information for 3d human pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 68–84)

  21. Rogez G, Weinzaepfel P, Schmid C (2017) LCR-Net: Localization-Classification-Regression for Human Pose, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

  22. Rogez G, Weinzaepfel P, Schmid C (2019) Lcr-net++: Multi-person 2d and 3d pose detection in natural images. IEEE Trans Pattern Anal Mach Intell 42:1–1161

    Article  Google Scholar 

  23. Shu T, Ryoo MS, Zhu S-C (2016) Learning social affordance for human-robot interaction. International Joint Conference on Artificial Intelligence (IJCAI)

  24. Simo-Serra E, Quattoni A, Torras C, Moreno-Noguer F (2013) A joint model for 2d and 3d pose estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3634–3641)

  25. Tekin B, Rozantsev A, Lepetit V, Fua P (2016) Direct prediction of 3d body poses from motion compensated sequences. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 991–1000

  26. Tekin B, Márquez-Neila P, Salzmann M, Fua P (2017) Learning to fuse 2d and 3d image cues for monocular body pose estimation. In proceedings of the IEEE international conference on computer vision (pp. 3941-3950)

  27. Wandt B, Ackermann H, Rosenhahn B (2016) 3d reconstruction of human motion from monocular image sequences. IEEE transactions on pattern analysis and machine intelligence 38(8):1505–1516

    Article  Google Scholar 

  28. Wang C, Wang Y, Lin Z, Yuille A (2018) Robust 3D human pose estimation from single images or video sequences. IEEE Trans Pattern Anal Mach Intell 41:1227–1241

    Article  Google Scholar 

  29. Wang G, Wang Y, Zhang H, Gu R, Hwang JN (2019) Exploit the connectivity: Multi-object tracking with trackletnet. In Proceedings of the 27th ACM International Conference on Multimedia (pp. 482–490)

  30. Xiaohan Nie B, Wei P, Zhu SC (2017) Monocular 3D human pose estimation by predicting depth on joints. In Proceedings of the IEEE International Conference on Computer Vision

  31. Xu J et al. (2020) Deep kinematics analysis for monocular 3d human pose estimation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

  32. Yasin H, Iqbal U, Kruger B, Weber A, Gall J (2016) A dual-source approach for 3D pose estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4948–4956)

  33. Zhou X, Zhu M, Leonardos S, Derpanis K, Daniilidis K (2015) Sparse representation for 3D shape estimation: A convex relaxation approach. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR. pp. 4447–4455

  34. Zhou X, Huang Q, Sun X, Xue X, Wei Y (2017) Towards 3D human pose estimation in the wild: a weakly-supervised approach, the IEEE international conference on computer vision (ICCV)

Download references

Funding

This research was supported by the Zhejiang Provincial Science and Technology Program in China (Grant No. LQ22F020026), the National Key RD Program of China under Grant No. 2020YFB1709402, the Fundamental Research Funds for the ProvincialUniversities of Zhejiang (Grant No. GK219909299001-028), the National Natural Science Foundation of China under Grant No. U20A20386, Zhejiang Provincial Science and Technology Program in China under Grant 2021C01108, the Zhejiang Key Research and Development Program under Grant No. 2020C01050, and the National Nature Science Foundation of China under Grant Nos. 61772163.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gaoang Wang.

Ethics declarations

Conflicts of interest/competing interests

Not applicable

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

ESM 1

(MP4 21,668 kb)

ESM 2

(MP4 21,668 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gu, R., Jiang, Z., Wang, G. et al. Unsupervised universal hierarchical multi-person 3D pose estimation for natural scenes. Multimed Tools Appl 81, 32883–32906 (2022). https://doi.org/10.1007/s11042-022-13079-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-13079-5

Keywords

Navigation