Abstract
Although human pose estimation has achieved great success, the ambiguity of joint prediction has not been well resolved, especially in complex situations (crowded scenes, occlusions, and unnormal poses). We think that is caused by the noisy information introduced by combining multi-level features by simply adding features at each position. To alleviate this problem, we propose a new structure of gated multi-scale feature fusion (GMSFF). This module aims to selectively import high-level features to make up for the missing semantic information of low-resolution feature maps. Inspired by the prior knowledge that the position information of joints can refer to each other, we propose a new fine-tuning strategy for pose estimation—spatial mutual information complementary module (SMICM). It can assist the model in better adjusting the current joint’s position by capturing the information contained in other joints and only adds a little computational cost. We evaluated our proposed method on four datasets: MPII Human Pose Dataset (MPII), COCO keypoint detection Dataset (COCO), Occluded Human Dataset (OCHuman), and CrowdPose Dataset. The experimental results show that with the deepening of the occlusion and crowding level of the datasets, the improvement becomes more and more obvious. In particular, a performance improvement of 2.2 AP was obtained on the OCHuman dataset. In addition, our modules are plug-and-play.














Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Availability of data and material
The datasets analyzed during the current study are available in the MPII (http://human-pose.mpi-inf.mpg.de), COCO (https://cocodataset.org), CrowdPose (https://github.com/Jeff-sjtu/CrowdPose) and OCHuman (https://github.com/liruilong940607/OCHumanApi).
References
Vidanpathirana, M., Sudasingha, I., Vidanapathirana, J., Kanchana, P., Perera, I.: Tracking and frame-rate enhancement for real-time 2D human pose estimation. Vis. Comput. 36, 1501–1519 (2020)
Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. In: European Conference on Computer Vision, pp. 472–487 (2018)
Singh, V.K., Nevatia, R.: Simultaneous tracking and action recognition for single actor human actions. Vis. Comput. 27, 1115–1123 (2011)
Agahian, S., Negin, F., Köse, C.: Improving bag-of-poses with semi-temporal pose descriptors for skeleton-based action recognition. Vis. Comput. 35, 591–607 (2019)
Wu, J., Hu, D., Xiang, F., Yuan, X., Su, J.: 3D human pose estimation by depth map. Vis. Comput. 36, 1401–1410 (2020)
Liu, X., Yin, J., Liu, H., Yin, Y.: PISEP2: pseudo-image sequence evolution-based 3D pose prediction. Vis. Comput. (2021). https://doi.org/10.1007/s00371-021-02135-0
Zhao, H., Tian, M., Sun, S., Shao, J., Yan, J., Yi, S., Wang, X., Tang, X.: Spindle Net: person re-identification with human body region guided feature decomposition and fusion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 907–915 (2017)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1097–1105 (2012)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Newell, A., Yang, K., Deng, J. Stacked hourglass networks for human pose estimation. In: Lecture Notes in Computer Science European Conference on Computer Vision. Springer, Cham, pp. 483–499 (2016)
Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7103–7112 (2018)
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5686–5696 (2019)
Su, K., Yu, D., Xu, Z., Geng, X., Wang, C.: Multi-person pose estimation with enhanced channel-wise and spatial information. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5667–5675 (2019)
Verma, P., Srivastava, R.: Two-stage multi-view deep network for 3D human pose reconstruction using images and its 2D joint heatmaps through enhanced stack-hourglass approach. Vis. Comput. (2021). https://doi.org/10.1007/s00371-021-02120-7
Yang, Q., Shi, W., Chen, J., Tang, Y.: Localization of hard joints in human pose estimation based on residual down-sampling and attention mechanism. Vis. Comput. (2021). https://doi.org/10.1007/s00371-021-02122-5
Zhao, L., Wang, N.N., Gong, C., Yang, J., Gao, X.B.: Estimating human pose efficiently by parallel pyramid networks. IEEE Trans. Image Process. 30, 6785–6800 (2021)
Zhao, L., Xu, J., Gong, C., Yang, J., Zuo, W.M., Gao, X.B.: Learning to acquire the quality of human pose estimation. IEEE Trans. Circuits Syst. Video Technol. 31, 1555–1568 (2021)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6000–6010 (2017)
Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3686–3693 (2014)
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: European Conference on Computer Vision, pp. 740–755 (2014)
Zhang, S.H., Li, R., Dong, X., Rosin, P., Cai, Z., Han, X., Yang, D., Huang, H., Hu, S.M.: Pose2Seg: detection free human instance segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 889–898 (2019)
Li, J., Wang, C., Zhu, H., Mao, Y., Fang, H.S., Lu, C.: CrowdPose: efficient crowded scenes pose estimation and a new benchmark. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10863–10872 (2019)
Tompson, J., Goroshin, R., Jain, A., LeCun, Y., Bregler, C.: Efficient object localization using convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 648–656 (2015)
Yang, W., Li, S., Ouyang, W., Li, H., Wang, X.: Learning feature pyramids for human pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1290–1299 (2017)
Chu, X., Yang, W., Ouyang, W., Ma, C., Yuille, A.L., Wang, X.: Multi-context attention for human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5669–5678 (2017)
Ke, L., Chang, M.C., Qi, H., Lyu, S.: Multi-scale structure-aware network for human pose estimation. In: Proceedings of the European Conference on Computer Vision, pp. 731–746 (2018)
Zhang, H., Ouyang, H., Liu, S., Qi, X., Shen, X., Yang, R., Jia, J.: Human pose estimation with spatial contextual information (2019). arXiv:190101760
Li, W., Wang, Z., Yin, B., Peng, Q., Du, Y., Xiao, T., Yu, G., Lu, H., Wei, Y., Sun, J.: Rethinking on multi-stage networks for human pose estimation (2019). arXiv:190100148
Toshev, A., Szegedy, C.: DeepPose: human pose estimation via deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1653–1660 (2014)
Wei, S.E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4724–4732 (2016)
Chu, X., Ouyang, W., Li, H., Wang, X.: Structured feature learning for pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4715–4723 (2016)
Papandreou, G., Zhu, T., Kanazawa, N., Toshev, A., Tompson, J., Bregler, C., Murphy, K.: Towards accurate multi-person pose estimation in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3711–3719 (2017)
Xia, F., Wang, P., Chen, X., Yuille, A.L.: Joint multi-person pose estimation and semantic part segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6080–6089 (2017)
Amirul, Islam. M., Rochan, M., Bruce, N.D., Wang, Y.: Gated feedback refinement network for dense image labeling. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4877–4885 (2017)
Zhang, L., Dai, J., Lu, H., He, Y., Wang, G.: A bi-directional message passing model for salient object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1741–1750 (2018)
Li, X., Zhao, H., Han, L., Tong, Y., Yang, K.: GFF: gated fully fusion for semantic segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 11418–11425 (2019)
Zhang, F., Zhu, X.T., Ye, M.: Fast human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3512–3521 (2019)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)
Tang, W., Wu, Y.: Does learning specific features for related parts help human pose estimation? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1107–1116 (2019)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. Comput.Sci. (2014)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)
Zhou, L., Chen, Y., Gao, Y., Wang, J., Lu, H.: Occlusion-aware siamese network for human pose estimation. In: European Conference on Computer Vision, pp. 396–412 (2020)
Tang, W., Yu, P., Wu, Y.: Deeply learned compositional models for human pose estimation. In: Proceedings of the European Conference on Computer Vision, pp. 197–214 (2018)
Qiu, L., Zhang, X., Li, Y., Li, G., Wu, X., Xiong, Z., Han, X., Cui, S.: Peeking into occluded joints: a novel framework for crowd pose estimation. In: European Conference on Computer Vision, pp. 488–504 (2020)
Chen, Y., Shen, C., Wei, X.S., Liu, L., Yang, J.: Adversarial PoseNet: a structure-aware convolutional network for human pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1221–1230 (2017)
Funding
Not applicable.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
All authors declare that we have no conflict of interest.
Code availability
Contact the corresponding author if necessary.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhao, X., Guo, C. & Zou, Q. Human pose estimation with gated multi-scale feature fusion and spatial mutual information. Vis Comput 39, 119–137 (2023). https://doi.org/10.1007/s00371-021-02317-w
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00371-021-02317-w