Abstract
The study of deep end-to-end representation learning for 2D to 3D monocular human pose estimation is a common yet challenging task in computer vision. However, current methods still face the problem that the recognized 3D key points are inconsistent with the actual joint positions. The strategy that trains 2D to 3D networks using 3D human poses with corresponding 2D projections to solve this problem is effective. On this basis, we build a cascaded monocular 3D human pose estimation network, which uses a hierarchical supervision network, and uses the proposed composite residual module (CRM) and enhanced fusion module (EFM) as the main components. In the cascaded network, CRMs are stacked to form cascaded modules. Compared with the traditional residual module, the proposed CRM expands the information flow channels. In addition, the proposed EFM is alternately placed with cascaded modules, which addresses the problems of reduced accuracy and low robustness caused by multi-level cascade. We test the proposed network on the standard benchmark Human3.6M dataset and MPI-INF-3DHP dataset. We compare the results under the fully-supervised methods with six algorithms and the results under the weakly-supervised methods with five algorithms. We use the mean per joint position error (MPJPE) in millimeters as the evaluation index and get the best results.
Similar content being viewed by others
References
Agarwal A, Triggs B (2005) Recovering 3d human pose from monocular images. IEEE Trans Pattern Anal Mach Intell 28(1):44–58
Akhter I, Black MJ (2015) Pose-conditioned joint angle limits for 3d human pose reconstruction. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1446–1455
Bai H, Cheng S, Tang J, Pan J (2021) Learning a cascaded non-local residual network for super-resolving blurry images. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 223–232
Belagiannis V, Amin S, Andriluka M, Schiele B, Navab N, Ilic S (2014) 3d pictorial structures for multiple human pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1669–1676
Bo L, Sminchisescu C (2009) Structured output-associative regression. In: 2009 IEEE Conference on computer vision and pattern recognition. IEEE, pp 2403–2410
Bogo F, Kanazawa A, Lassner C, Gehler P, Romero J, Black MJ (2016) Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In: European conference on computer vision. Springer, pp 561–578
Burenius M, Sullivan J, Carlsson S (2013) 3d pictorial structures for multiple view articulated pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3618–3625
Chen W, Wang H, Li Y, Su H, Wang Z, Tu C, Lischinski D, Cohen-Or D, Chen B (2016) Synthesizing training images for boosting human 3d pose estimation. In: 2016 Fourth international conference on 3d vision (3DV). IEEE, pp 479–488
Chen X, Yuille A (2014) Articulated pose estimation by a graphical model with image dependent pairwise relations. arXiv:1407.3399
Chen X, Lin K-Y, Liu W, Qian C, Lin L (2019) Weakly-supervised discovery of geometry-aware representation for 3d human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10895–10904
Chen X, Fu C, Zhao Y, Zheng F, Song J, Ji R, Yi Y (2020) Salience-guided cascaded suppression network for person re-identification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3300–3310
Chen Y, Wang Z, Peng Y, Zhang Z, Yu G, Sun J (2018) Cascaded pyramid network for multi-person pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7103–7112
Cheng Y, Bo Y, Bo W, Yan W, Tan RT (2019) Occlusion-aware networks for 3d human pose estimation in video. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 723–732
Diba A, Sharma V, Pazandeh A, Pirsiavash H, Gool LV (2017) Weakly supervised cascaded convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 914–922
Dix A, Finlay J, Abowd GD, Beale R (2000) Human-computer interaction Harlow ua
Habibie I, Xu W, Mehta D, Pons-Moll G, Theobalt C (2019) In the wild human pose estimation using explicit 2d features and intermediate 3d representations. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10905–10914
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Huang G, Liu Z, Maaten LVD, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4700–4708
Ji X, Qi F, Dong J, Shuai Q, Jiang W, Zhou X (2020) A survey on monocular 3d human pose estimation. Virtual Real Intell Hardw 2(6):471–500
Kanazawa A, Black MJ, Jacobs DW, Malik J (2018) End-to-end recovery of human shape and pose. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7122–7131
Kocabas M, Karagoz S, Akbas E (2019) Self-supervised learning of 3d human pose using multi-view geometry. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1077–1086
Kolotouros N, Pavlakos G, Black MJ, Daniilidis K (2019) Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2252–2261
Li S, Ke L, Pratama K, Tai Y-W, Tang C-K, Cheng K-T (2020) Cascaded deep monocular 3d human pose estimation with evolutionary training data. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6173–6183
Li Z, Wang X, Wang F, Jiang P (2019) On boosting single-frame 3d human pose estimation via monocular videos. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2192–2201
Liu W, Chen J, Li C, Qian C, Chu X, Hu X (2018) A cascaded inception of inception network with attention modulated feature fusion for human pose estimation. In: Thirty-second AAAI conference on artificial intelligence
Luo C, Chu X, Yuille A (2018) Orinet: A fully convolutional network for 3d human pose estimation. arXiv:1811.04989
Martinez J, Hossain R, Romero J, Little JJ (2017) A simple yet effective baseline for 3d human pose estimation. In: Proceedings of the IEEE international conference on computer vision, pp 2640–2649
Mehta D, Rhodin H, Casas D, Fua P, Sotnychenko O, Weipeng X u, Theobalt C (2017) Monocular 3d human pose estimation in the wild using improved cnn supervision. In: 2017 International conference on 3d vision (3DV). IEEE, pp 506–516
Mehta D, Sotnychenko O, Mueller F, Xu W, Sridhar S, Pons-Moll G, Theobalt C (2018) Single-shot multi-person 3d pose estimation from monocular rgb. In: 2018 International conference on 3d vision (3DV). IEEE, pp 120–130
Mehta D, Sridhar S, Sotnychenko O, Rhodin H, Shafiei M, Seidel H-P, Xu W, Casas D, Theobalt C (2017) Vnect: Real-time 3d human pose estimation with a single rgb camera. ACM Trans Graph (TOG) 36(4):1–14
Moon G, Chang YJ, Lee KM (2019) Camera distance-aware top-down approach for 3d multi-person pose estimation from a single rgb image. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10133–10142
Moreno-Noguer F (2017) 3d human pose estimation from a single image via distance matrix regression. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2823–2832
Newell A, Yang K, Deng J (2016) Stacked hourglass networks for human pose estimation. In: European conference on computer vision. Springer, pp 483–499
Nibali A, He Z, Morgan S, Prendergast L (2018) Numerical coordinate regression with convolutional neural networks. arXiv:1801.07372
Nie Q, Liu Z, Liu Y (2020) Unsupervised 3d human pose representation with viewpoint and pose disentanglement. In: European conference on computer vision. Springer, pp 102–118
Nie X, Feng J, Zhang J, Yan S (2019) Single-stage multi-person pose machines. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6951–6960
Pavlakos G, Choutas V, Ghorbani N, Bolkart T, Osman AAA, Tzionas D, Black MJ (2019) Expressive body capture: 3d hands, face, and body from a single image. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10975–10985
Pavlakos G, Zhou X, Derpanis KG, Daniilidis K (2017) Coarse-to-fine volumetric prediction for single-image 3d human pose. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7025–7034
Pavlakos G, Zhou X, Derpanis KG, Daniilidis K (2017) Harvesting multiple views for marker-less 3d human pose annotations. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6988–6997
Pavllo D, Feichtenhofer C, Grangier D, Auli M (2019) 3d human pose estimation in video with temporal convolutions and semi-supervised training. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7753–7762
Hossain MRI, Little JJ (2017) Exploiting temporal information for 3d pose estimation. arXiv:arXiv--1711
Rhodin H, Spörri J, Katircioglu I, Constantin V, Meyer F, Müller E, Salzmann M, Fua P (2018) Learning monocular 3d human pose estimation from multi-view images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8437–8446
Sharma S, Varigonda PT, Bindal P, Sharma A, Jain A (2019) Monocular 3d human pose estimation by generation and ordinal ranking. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2325–2334
Shi W, Caballero J, Huszár F, Totz J, Aitken AP, Bishop R, Rueckert D, Wang Z (2016) Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1874–1883
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Ke S, Xiao B, Liu D, Wang J (2019) Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5693–5703
Sun X, Xiao B, Wei F, Liang S, Wei Y (2018) Integral human pose regression. In: Proceedings of the european conference on computer vision (ECCV), pp 529–545
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826
Tompson JJ, Jain A, LeCun Y, Bregler C (2014) Joint training of a convolutional network and a graphical model for human pose estimation. arXiv:1406.2984
Wang J, Tan S, Zhen X, Xu S, Zheng F, He Z, Shao L (2021) Deep 3d human pose estimation: A review. Computer Vision and Image Understanding, p 103225
Wu J, Xue T, Lim JJ, Tian Y, Tenenbaum JB, Torralba A, Freeman WT (2016) Single image 3d interpreter network. In: European conference on computer vision. Springer, pp 365–382
Yang W, Ouyang W, Wang X, Ren J, Li H, Wang X (2018) 3d human pose estimation in the wild by adversarial learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5255–5264
Yu D, Su K, Sun J, Wang C (2018) Multi-person pose estimation for pose tracking with enhanced cascaded pyramid network. In: Proceedings of the european conference on computer vision (ECCV) Workshops, pp 0–0
Zhao L, Xi P, Yu T, Kapadia M, Metaxas DN (2019) Semantic graph convolutional networks for 3d human pose regression. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3425–3435
Zhou T, Wang W, Qi S, Ling H, Shen J (2020) Cascaded human-object interaction recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4263–4272
Zhou X, Huang Q, Sun X, Xue X, Wei Y (2017) Towards 3d human pose estimation in the wild: A weakly-supervised approach. In: Proceedings of the IEEE international conference on computer vision, pp 398–407
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Gao, Bk., Zhang, Zx., Wu, Cn. et al. Staged cascaded network for monocular 3D human pose estimation. Appl Intell 53, 1021–1029 (2023). https://doi.org/10.1007/s10489-022-03516-1
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-022-03516-1