Abstract
Multi-person 3D pose estimation using a monocular freely moving camera in real-world scenarios remains a challenge. There is a lack of data with 3D ground truth, and real-world scenes usually contain self-occlusions and inter-person occlusions. To address these challenges, an unsupervised Universal Hierarchical 3D Human Pose Estimation (UH3DHPE) method that optimizes the torso and limb poses based on a hierarchical framework is proposed. To handle the case of an occluded or inaccurate 2D torso keypoints, which play an important role for 3D pose initialization and subsequent inference, an effective method to directly estimate limb poses without building upon the estimated torso pose is proposed, and the torso pose can then be further refined to form the hierarchy in a bottom-up fashion. An adaptive merging strategy is proposed to determine the best hierarchy. To verify the effectiveness of the proposed scheme, a video dataset for multi-person interactions is collected by a moving camera, under a Motion Capture (MoCap) ground truth data acquisition environment, for our performance evaluations. Experimental results show the proposed method outperforms state-of-the-art methods on the multi-person moving camera scenarios.
Similar content being viewed by others
Data availability
will be open sourced.
Code availability
No
References
Arnab A, Doersch C, Zisserman A Exploiting temporal context for 3D human pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3395–3404)
Bogo F, Kanazawa A, Lassner C, Gehler P, Romero J, Black MJ (2016) Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In European Conference on Computer Vision (pp. 561–578). Springer, Cham
Cao Z, Simon T, Wei SE, Sheikh Y (2017) Realtime multi-person 2D pose estimation using part affinity fields, the IEEE conference on computer vision and pattern recognition
Gao XS, Hou XR, Tang J, Cheng HF (2003) Complete solution classification for the perspective-three-point problem. IEEE Trans Pattern Anal Mach Intell 25(8):930–943
Gu R, Wang G, Jiang Z, Hwang JN (2019) Multi-person hierarchical 3D pose estimation in natural videos. IEEE Transactions on Circuits and Systems for Video Technology 30:4245–4257
Gu R, Wang G, Hwang JN (2020) Exploring severe occlusion: Multi-Person 3D Pose Estimation with Gated Convolution[J]. arXiv preprint arXiv:2011.00184
Ionescu C, Papava D, Olaru V, Sminchisescu C (2014) Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments, IEEE Transactions on Pattern Analysis and Machine Intelligence 36(7)
Kanazawa A, Black MJ, Jacobs DW, Malik J (2018) End-to-end recovery of human shape and pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 7122–7131)
Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev P, Suleyman M (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950
Li S, Chan AB (2014) 3d human pose estimation from monocular images with deep convolutional neural network. In Asian Conference on Computer Vision (pp. 332–347). Springer, Cham
Li S, Zhang W, Chan AB (2015) Maximum-margin structured learning with deep networks for 3d human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2848–2856)
Lin J, Lee GH (2019) Trajectory space factorization for deep video-based 3D human pose estimation. arXiv preprint arXiv:1908.08289
Marcard TV, Henschel R, Black MJ et al. (2018) Recovering accurate 3D human pose in the wild using IMUs and a moving camera, European conference on computer vision (ECCV)
Martinez J, Hossain R., Romero J, Little JJ (2017) A simple yet effective baseline for 3d human pose estimation. In proceedings of the IEEE international conference on computer vision (pp. 2640-2649)
Mitra R, Gundavarapu NB, Sharma A, Jain A (2020) Multiview-consistent semi-supervised learning for 3D human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Moon G, Chang JY, Lee KM (2019) Camera distance-aware top-down approach for 3D multi-person pose estimation from a single RGB image. In proceedings of the IEEE international conference on computer vision (pp. 10133-10142)
Pavllo D, Feichtenhofer C, Grangier D, Auli M (2019) 3d human pose estimation in video with temporal convolutions and semi-supervised training. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 7753–7762
Powell MJ (1964) An efficient method for finding the minimum of a function of several variables without calculating derivatives. Comput J 7(2):155–162
Ramakrishna V, Kanade T, Sheikh Y (2012) Reconstructing 3d human pose from 2d image landmarks. In European conference on computer vision (pp. 573–586). Springer, Berlin, Heidelberg
Rayat Imtiaz Hossain M, Little JJ (2018) Exploiting temporal information for 3d human pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 68–84)
Rogez G, Weinzaepfel P, Schmid C (2017) LCR-Net: Localization-Classification-Regression for Human Pose, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Rogez G, Weinzaepfel P, Schmid C (2019) Lcr-net++: Multi-person 2d and 3d pose detection in natural images. IEEE Trans Pattern Anal Mach Intell 42:1–1161
Shu T, Ryoo MS, Zhu S-C (2016) Learning social affordance for human-robot interaction. International Joint Conference on Artificial Intelligence (IJCAI)
Simo-Serra E, Quattoni A, Torras C, Moreno-Noguer F (2013) A joint model for 2d and 3d pose estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3634–3641)
Tekin B, Rozantsev A, Lepetit V, Fua P (2016) Direct prediction of 3d body poses from motion compensated sequences. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 991–1000
Tekin B, Márquez-Neila P, Salzmann M, Fua P (2017) Learning to fuse 2d and 3d image cues for monocular body pose estimation. In proceedings of the IEEE international conference on computer vision (pp. 3941-3950)
Wandt B, Ackermann H, Rosenhahn B (2016) 3d reconstruction of human motion from monocular image sequences. IEEE transactions on pattern analysis and machine intelligence 38(8):1505–1516
Wang C, Wang Y, Lin Z, Yuille A (2018) Robust 3D human pose estimation from single images or video sequences. IEEE Trans Pattern Anal Mach Intell 41:1227–1241
Wang G, Wang Y, Zhang H, Gu R, Hwang JN (2019) Exploit the connectivity: Multi-object tracking with trackletnet. In Proceedings of the 27th ACM International Conference on Multimedia (pp. 482–490)
Xiaohan Nie B, Wei P, Zhu SC (2017) Monocular 3D human pose estimation by predicting depth on joints. In Proceedings of the IEEE International Conference on Computer Vision
Xu J et al. (2020) Deep kinematics analysis for monocular 3d human pose estimation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Yasin H, Iqbal U, Kruger B, Weber A, Gall J (2016) A dual-source approach for 3D pose estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4948–4956)
Zhou X, Zhu M, Leonardos S, Derpanis K, Daniilidis K (2015) Sparse representation for 3D shape estimation: A convex relaxation approach. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR. pp. 4447–4455
Zhou X, Huang Q, Sun X, Xue X, Wei Y (2017) Towards 3D human pose estimation in the wild: a weakly-supervised approach, the IEEE international conference on computer vision (ICCV)
Funding
This research was supported by the Zhejiang Provincial Science and Technology Program in China (Grant No. LQ22F020026), the National Key RD Program of China under Grant No. 2020YFB1709402, the Fundamental Research Funds for the ProvincialUniversities of Zhejiang (Grant No. GK219909299001-028), the National Natural Science Foundation of China under Grant No. U20A20386, Zhejiang Provincial Science and Technology Program in China under Grant 2021C01108, the Zhejiang Key Research and Development Program under Grant No. 2020C01050, and the National Nature Science Foundation of China under Grant Nos. 61772163.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest/competing interests
Not applicable
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Gu, R., Jiang, Z., Wang, G. et al. Unsupervised universal hierarchical multi-person 3D pose estimation for natural scenes. Multimed Tools Appl 81, 32883–32906 (2022). https://doi.org/10.1007/s11042-022-13079-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-13079-5