Unsupervised universal hierarchical multi-person 3D pose estimation for natural scenes

Gu, Renshu; Jiang, Zhongyu; Wang, Gaoang; McQuade, Kevin; Hwang, Jenq-Neng

doi:10.1007/s11042-022-13079-5

Unsupervised universal hierarchical multi-person 3D pose estimation for natural scenes

Published: 15 April 2022

Volume 81, pages 32883–32906, (2022)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Renshu Gu¹,
Zhongyu Jiang²,
Gaoang Wang ORCID: orcid.org/0000-0002-8403-1538³,
Kevin McQuade⁴ &
…
Jenq-Neng Hwang²

372 Accesses
1 Citation
Explore all metrics

Abstract

Multi-person 3D pose estimation using a monocular freely moving camera in real-world scenarios remains a challenge. There is a lack of data with 3D ground truth, and real-world scenes usually contain self-occlusions and inter-person occlusions. To address these challenges, an unsupervised Universal Hierarchical 3D Human Pose Estimation (UH3DHPE) method that optimizes the torso and limb poses based on a hierarchical framework is proposed. To handle the case of an occluded or inaccurate 2D torso keypoints, which play an important role for 3D pose initialization and subsequent inference, an effective method to directly estimate limb poses without building upon the estimated torso pose is proposed, and the torso pose can then be further refined to form the hierarchy in a bottom-up fashion. An adaptive merging strategy is proposed to determine the best hierarchy. To verify the effectiveness of the proposed scheme, a video dataset for multi-person interactions is collected by a moving camera, under a Motion Capture (MoCap) ground truth data acquisition environment, for our performance evaluations. Experimental results show the proposed method outperforms state-of-the-art methods on the multi-person moving camera scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

BoostTrack: boosting the similarity measure and detection confidence for improved multiple object tracking

Article Open access 12 April 2024

HOTA: A Higher Order Metric for Evaluating Multi-object Tracking

Article Open access 08 October 2020

ByteTrack: Multi-object Tracking by Associating Every Detection Box

Data availability

will be open sourced.

Code availability

No

References

Arnab A, Doersch C, Zisserman A Exploiting temporal context for 3D human pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3395–3404)
Bogo F, Kanazawa A, Lassner C, Gehler P, Romero J, Black MJ (2016) Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In European Conference on Computer Vision (pp. 561–578). Springer, Cham
Cao Z, Simon T, Wei SE, Sheikh Y (2017) Realtime multi-person 2D pose estimation using part affinity fields, the IEEE conference on computer vision and pattern recognition
Gao XS, Hou XR, Tang J, Cheng HF (2003) Complete solution classification for the perspective-three-point problem. IEEE Trans Pattern Anal Mach Intell 25(8):930–943
Article Google Scholar
Gu R, Wang G, Jiang Z, Hwang JN (2019) Multi-person hierarchical 3D pose estimation in natural videos. IEEE Transactions on Circuits and Systems for Video Technology 30:4245–4257
Article Google Scholar
Gu R, Wang G, Hwang JN (2020) Exploring severe occlusion: Multi-Person 3D Pose Estimation with Gated Convolution[J]. arXiv preprint arXiv:2011.00184
Ionescu C, Papava D, Olaru V, Sminchisescu C (2014) Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments, IEEE Transactions on Pattern Analysis and Machine Intelligence 36(7)
Kanazawa A, Black MJ, Jacobs DW, Malik J (2018) End-to-end recovery of human shape and pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 7122–7131)
Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev P, Suleyman M (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950
Li S, Chan AB (2014) 3d human pose estimation from monocular images with deep convolutional neural network. In Asian Conference on Computer Vision (pp. 332–347). Springer, Cham
Li S, Zhang W, Chan AB (2015) Maximum-margin structured learning with deep networks for 3d human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2848–2856)
Lin J, Lee GH (2019) Trajectory space factorization for deep video-based 3D human pose estimation. arXiv preprint arXiv:1908.08289
Marcard TV, Henschel R, Black MJ et al. (2018) Recovering accurate 3D human pose in the wild using IMUs and a moving camera, European conference on computer vision (ECCV)
Martinez J, Hossain R., Romero J, Little JJ (2017) A simple yet effective baseline for 3d human pose estimation. In proceedings of the IEEE international conference on computer vision (pp. 2640-2649)
Mitra R, Gundavarapu NB, Sharma A, Jain A (2020) Multiview-consistent semi-supervised learning for 3D human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Moon G, Chang JY, Lee KM (2019) Camera distance-aware top-down approach for 3D multi-person pose estimation from a single RGB image. In proceedings of the IEEE international conference on computer vision (pp. 10133-10142)
Pavllo D, Feichtenhofer C, Grangier D, Auli M (2019) 3d human pose estimation in video with temporal convolutions and semi-supervised training. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 7753–7762
Powell MJ (1964) An efficient method for finding the minimum of a function of several variables without calculating derivatives. Comput J 7(2):155–162
Article MathSciNet Google Scholar
Ramakrishna V, Kanade T, Sheikh Y (2012) Reconstructing 3d human pose from 2d image landmarks. In European conference on computer vision (pp. 573–586). Springer, Berlin, Heidelberg
Rayat Imtiaz Hossain M, Little JJ (2018) Exploiting temporal information for 3d human pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 68–84)
Rogez G, Weinzaepfel P, Schmid C (2017) LCR-Net: Localization-Classification-Regression for Human Pose, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Rogez G, Weinzaepfel P, Schmid C (2019) Lcr-net++: Multi-person 2d and 3d pose detection in natural images. IEEE Trans Pattern Anal Mach Intell 42:1–1161
Article Google Scholar
Shu T, Ryoo MS, Zhu S-C (2016) Learning social affordance for human-robot interaction. International Joint Conference on Artificial Intelligence (IJCAI)
Simo-Serra E, Quattoni A, Torras C, Moreno-Noguer F (2013) A joint model for 2d and 3d pose estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3634–3641)
Tekin B, Rozantsev A, Lepetit V, Fua P (2016) Direct prediction of 3d body poses from motion compensated sequences. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 991–1000
Tekin B, Márquez-Neila P, Salzmann M, Fua P (2017) Learning to fuse 2d and 3d image cues for monocular body pose estimation. In proceedings of the IEEE international conference on computer vision (pp. 3941-3950)
Wandt B, Ackermann H, Rosenhahn B (2016) 3d reconstruction of human motion from monocular image sequences. IEEE transactions on pattern analysis and machine intelligence 38(8):1505–1516
Article Google Scholar
Wang C, Wang Y, Lin Z, Yuille A (2018) Robust 3D human pose estimation from single images or video sequences. IEEE Trans Pattern Anal Mach Intell 41:1227–1241
Article Google Scholar
Wang G, Wang Y, Zhang H, Gu R, Hwang JN (2019) Exploit the connectivity: Multi-object tracking with trackletnet. In Proceedings of the 27th ACM International Conference on Multimedia (pp. 482–490)
Xiaohan Nie B, Wei P, Zhu SC (2017) Monocular 3D human pose estimation by predicting depth on joints. In Proceedings of the IEEE International Conference on Computer Vision
Xu J et al. (2020) Deep kinematics analysis for monocular 3d human pose estimation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Yasin H, Iqbal U, Kruger B, Weber A, Gall J (2016) A dual-source approach for 3D pose estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4948–4956)
Zhou X, Zhu M, Leonardos S, Derpanis K, Daniilidis K (2015) Sparse representation for 3D shape estimation: A convex relaxation approach. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR. pp. 4447–4455
Zhou X, Huang Q, Sun X, Xue X, Wei Y (2017) Towards 3D human pose estimation in the wild: a weakly-supervised approach, the IEEE international conference on computer vision (ICCV)

Download references

Funding

This research was supported by the Zhejiang Provincial Science and Technology Program in China (Grant No. LQ22F020026), the National Key RD Program of China under Grant No. 2020YFB1709402, the Fundamental Research Funds for the ProvincialUniversities of Zhejiang (Grant No. GK219909299001-028), the National Natural Science Foundation of China under Grant No. U20A20386, Zhejiang Provincial Science and Technology Program in China under Grant 2021C01108, the Zhejiang Key Research and Development Program under Grant No. 2020C01050, and the National Nature Science Foundation of China under Grant Nos. 61772163.

Author information

Authors and Affiliations

Hangzhou Dianzi University, Computer and Software School, Zhejiang, Hangzhou, China
Renshu Gu
University of Washington, Electrical and Computer Engineering, Seattle, WA, USA
Zhongyu Jiang & Jenq-Neng Hwang
Zhejiang University / University of Illinois at Urbana-Champaign Institute, Haining, Zhejiang, China
Gaoang Wang
University of Washington, School of Medicine, Seattle, WA, USA
Kevin McQuade

Authors

Renshu Gu
View author publications
You can also search for this author in PubMed Google Scholar
Zhongyu Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Gaoang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Kevin McQuade
View author publications
You can also search for this author in PubMed Google Scholar
Jenq-Neng Hwang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gaoang Wang.

Ethics declarations

Conflicts of interest/competing interests

Not applicable

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

ESM 1

(MP4 21,668 kb)

ESM 2

(MP4 21,668 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gu, R., Jiang, Z., Wang, G. et al. Unsupervised universal hierarchical multi-person 3D pose estimation for natural scenes. Multimed Tools Appl 81, 32883–32906 (2022). https://doi.org/10.1007/s11042-022-13079-5

Download citation

Received: 13 November 2020
Revised: 25 March 2021
Accepted: 03 April 2022
Published: 15 April 2022
Issue Date: September 2022
DOI: https://doi.org/10.1007/s11042-022-13079-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unsupervised universal hierarchical multi-person 3D pose estimation for natural scenes

Abstract

Access this article

Similar content being viewed by others

BoostTrack: boosting the similarity measure and detection confidence for improved multiple object tracking

HOTA: A Higher Order Metric for Evaluating Multi-object Tracking

ByteTrack: Multi-object Tracking by Associating Every Detection Box

Data availability

Code availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest/competing interests

Additional information

Publisher’s note

Supplementary Information

ESM 1

ESM 2

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Unsupervised universal hierarchical multi-person 3D pose estimation for natural scenes

Abstract

Access this article

Similar content being viewed by others

BoostTrack: boosting the similarity measure and detection confidence for improved multiple object tracking

HOTA: A Higher Order Metric for Evaluating Multi-object Tracking

ByteTrack: Multi-object Tracking by Associating Every Detection Box

Data availability

Code availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest/competing interests

Additional information

Publisher’s note

Supplementary Information

ESM 1

ESM 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation