Skip to main content
Log in

Human pose estimation with gated multi-scale feature fusion and spatial mutual information

  • Original article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

Although human pose estimation has achieved great success, the ambiguity of joint prediction has not been well resolved, especially in complex situations (crowded scenes, occlusions, and unnormal poses). We think that is caused by the noisy information introduced by combining multi-level features by simply adding features at each position. To alleviate this problem, we propose a new structure of gated multi-scale feature fusion (GMSFF). This module aims to selectively import high-level features to make up for the missing semantic information of low-resolution feature maps. Inspired by the prior knowledge that the position information of joints can refer to each other, we propose a new fine-tuning strategy for pose estimation—spatial mutual information complementary module (SMICM). It can assist the model in better adjusting the current joint’s position by capturing the information contained in other joints and only adds a little computational cost. We evaluated our proposed method on four datasets: MPII Human Pose Dataset (MPII), COCO keypoint detection Dataset (COCO), Occluded Human Dataset (OCHuman), and CrowdPose Dataset. The experimental results show that with the deepening of the occlusion and crowding level of the datasets, the improvement becomes more and more obvious. In particular, a performance improvement of 2.2 AP was obtained on the OCHuman dataset. In addition, our modules are plug-and-play.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Availability of data and material

The datasets analyzed during the current study are available in the MPII (http://human-pose.mpi-inf.mpg.de), COCO (https://cocodataset.org), CrowdPose (https://github.com/Jeff-sjtu/CrowdPose) and OCHuman (https://github.com/liruilong940607/OCHumanApi).

References

  1. Vidanpathirana, M., Sudasingha, I., Vidanapathirana, J., Kanchana, P., Perera, I.: Tracking and frame-rate enhancement for real-time 2D human pose estimation. Vis. Comput. 36, 1501–1519 (2020)

    Article  Google Scholar 

  2. Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. In: European Conference on Computer Vision, pp. 472–487 (2018)

  3. Singh, V.K., Nevatia, R.: Simultaneous tracking and action recognition for single actor human actions. Vis. Comput. 27, 1115–1123 (2011)

    Article  Google Scholar 

  4. Agahian, S., Negin, F., Köse, C.: Improving bag-of-poses with semi-temporal pose descriptors for skeleton-based action recognition. Vis. Comput. 35, 591–607 (2019)

    Article  Google Scholar 

  5. Wu, J., Hu, D., Xiang, F., Yuan, X., Su, J.: 3D human pose estimation by depth map. Vis. Comput. 36, 1401–1410 (2020)

    Article  Google Scholar 

  6. Liu, X., Yin, J., Liu, H., Yin, Y.: PISEP2: pseudo-image sequence evolution-based 3D pose prediction. Vis. Comput. (2021). https://doi.org/10.1007/s00371-021-02135-0

    Article  Google Scholar 

  7. Zhao, H., Tian, M., Sun, S., Shao, J., Yan, J., Yi, S., Wang, X., Tang, X.: Spindle Net: person re-identification with human body region guided feature decomposition and fusion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 907–915 (2017)

  8. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)

    Article  Google Scholar 

  9. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1097–1105 (2012)

    Google Scholar 

  10. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  11. Newell, A., Yang, K., Deng, J. Stacked hourglass networks for human pose estimation. In: Lecture Notes in Computer Science European Conference on Computer Vision. Springer, Cham, pp. 483–499 (2016)

  12. Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7103–7112 (2018)

  13. Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5686–5696 (2019)

  14. Su, K., Yu, D., Xu, Z., Geng, X., Wang, C.: Multi-person pose estimation with enhanced channel-wise and spatial information. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5667–5675 (2019)

  15. Verma, P., Srivastava, R.: Two-stage multi-view deep network for 3D human pose reconstruction using images and its 2D joint heatmaps through enhanced stack-hourglass approach. Vis. Comput. (2021). https://doi.org/10.1007/s00371-021-02120-7

    Article  Google Scholar 

  16. Yang, Q., Shi, W., Chen, J., Tang, Y.: Localization of hard joints in human pose estimation based on residual down-sampling and attention mechanism. Vis. Comput. (2021). https://doi.org/10.1007/s00371-021-02122-5

    Article  Google Scholar 

  17. Zhao, L., Wang, N.N., Gong, C., Yang, J., Gao, X.B.: Estimating human pose efficiently by parallel pyramid networks. IEEE Trans. Image Process. 30, 6785–6800 (2021)

    Article  Google Scholar 

  18. Zhao, L., Xu, J., Gong, C., Yang, J., Zuo, W.M., Gao, X.B.: Learning to acquire the quality of human pose estimation. IEEE Trans. Circuits Syst. Video Technol. 31, 1555–1568 (2021)

    Article  Google Scholar 

  19. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6000–6010 (2017)

  20. Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3686–3693 (2014)

  21. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: European Conference on Computer Vision, pp. 740–755 (2014)

  22. Zhang, S.H., Li, R., Dong, X., Rosin, P., Cai, Z., Han, X., Yang, D., Huang, H., Hu, S.M.: Pose2Seg: detection free human instance segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 889–898 (2019)

  23. Li, J., Wang, C., Zhu, H., Mao, Y., Fang, H.S., Lu, C.: CrowdPose: efficient crowded scenes pose estimation and a new benchmark. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10863–10872 (2019)

  24. Tompson, J., Goroshin, R., Jain, A., LeCun, Y., Bregler, C.: Efficient object localization using convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 648–656 (2015)

  25. Yang, W., Li, S., Ouyang, W., Li, H., Wang, X.: Learning feature pyramids for human pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1290–1299 (2017)

  26. Chu, X., Yang, W., Ouyang, W., Ma, C., Yuille, A.L., Wang, X.: Multi-context attention for human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5669–5678 (2017)

  27. Ke, L., Chang, M.C., Qi, H., Lyu, S.: Multi-scale structure-aware network for human pose estimation. In: Proceedings of the European Conference on Computer Vision, pp. 731–746 (2018)

  28. Zhang, H., Ouyang, H., Liu, S., Qi, X., Shen, X., Yang, R., Jia, J.: Human pose estimation with spatial contextual information (2019). arXiv:190101760

  29. Li, W., Wang, Z., Yin, B., Peng, Q., Du, Y., Xiao, T., Yu, G., Lu, H., Wei, Y., Sun, J.: Rethinking on multi-stage networks for human pose estimation (2019). arXiv:190100148

  30. Toshev, A., Szegedy, C.: DeepPose: human pose estimation via deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1653–1660 (2014)

  31. Wei, S.E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4724–4732 (2016)

  32. Chu, X., Ouyang, W., Li, H., Wang, X.: Structured feature learning for pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4715–4723 (2016)

  33. Papandreou, G., Zhu, T., Kanazawa, N., Toshev, A., Tompson, J., Bregler, C., Murphy, K.: Towards accurate multi-person pose estimation in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3711–3719 (2017)

  34. Xia, F., Wang, P., Chen, X., Yuille, A.L.: Joint multi-person pose estimation and semantic part segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6080–6089 (2017)

  35. Amirul, Islam. M., Rochan, M., Bruce, N.D., Wang, Y.: Gated feedback refinement network for dense image labeling. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4877–4885 (2017)

  36. Zhang, L., Dai, J., Lu, H., He, Y., Wang, G.: A bi-directional message passing model for salient object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1741–1750 (2018)

  37. Li, X., Zhao, H., Han, L., Tong, Y., Yang, K.: GFF: gated fully fusion for semantic segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 11418–11425 (2019)

  38. Zhang, F., Zhu, X.T., Ye, M.: Fast human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3512–3521 (2019)

  39. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)

  40. Tang, W., Wu, Y.: Does learning specific features for related parts help human pose estimation? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1107–1116 (2019)

  41. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. Comput.Sci. (2014)

  42. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)

  43. Zhou, L., Chen, Y., Gao, Y., Wang, J., Lu, H.: Occlusion-aware siamese network for human pose estimation. In: European Conference on Computer Vision, pp. 396–412 (2020)

  44. Tang, W., Yu, P., Wu, Y.: Deeply learned compositional models for human pose estimation. In: Proceedings of the European Conference on Computer Vision, pp. 197–214 (2018)

  45. Qiu, L., Zhang, X., Li, Y., Li, G., Wu, X., Xiong, Z., Han, X., Cui, S.: Peeking into occluded joints: a novel framework for crowd pose estimation. In: European Conference on Computer Vision, pp. 488–504 (2020)

  46. Chen, Y., Shen, C., Wei, X.S., Liu, L., Yang, J.: Adversarial PoseNet: a structure-aware convolutional network for human pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1221–1230 (2017)

Download references

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qiang Zou.

Ethics declarations

Conflicts of interest

All authors declare that we have no conflict of interest.

Code availability

Contact the corresponding author if necessary.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhao, X., Guo, C. & Zou, Q. Human pose estimation with gated multi-scale feature fusion and spatial mutual information. Vis Comput 39, 119–137 (2023). https://doi.org/10.1007/s00371-021-02317-w

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-021-02317-w

Keywords

Navigation