Enhancing multi-scale information exchange and feature fusion for human pose estimation

Wang, Rui; Wu, Wanyu; Wang, Xiangyang

doi:10.1007/s00371-022-02623-x

Enhancing multi-scale information exchange and feature fusion for human pose estimation

Original article
Published: 18 August 2022

Volume 39, pages 4751–4765, (2023)
Cite this article

The Visual Computer Aims and scope Submit manuscript

596 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Multi-scale feature fusion is an important part of modern network architectures to extract more comprehensive information for most computer vision tasks, such as semantic segmentation and keypoint estimation. However, most existing multi-scale methods add fusion connections between layers or branches directly, which inevitably ignores the semantic information discrepancy between feature maps with different resolutions and depths. Moreover, inappropriate fusion connections may lead to the loss of channel-wise and spatial information. In this paper, we propose a method to enhance and refine multi-scale feature fusion for human pose estimation by employing two attention mechanisms. Specifically, we present a novel multi-head spatial attention (MHSA), which is employed to model context information of the intermediate feature maps and reinforce important local features. Meanwhile, we utilize the position channel attention (PCA) to capture long-range dependencies while retaining the important position information in the attention maps. Combining with the modules of MHSA and PCA, we design an enhanced multi-scale feature fusion network (EMF-HRNet) based on the high-resolution network (HRNet). Our proposed EMF-HRNet yields better results with repeated multi-scale information exchange and feature fusion units. Extensive experiments on two common benchmarks, COCO Keypoint dataset and MPII Human Pose dataset, show that our method significantly improves the performance of state-of-the-art pose estimation methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Attention Refined Network for Human Pose Estimation

Article 20 May 2021

Lightweight Human Pose Estimation Based on Multi-Attention Mechanism

Article 02 January 2024

Spatial and contextual aware network based on multi-resolution for human pose estimation

Article 10 January 2022

References

Zhu, J., Zou, W., Zhu, Z., Yiming, Hu.: Convolutional relation network for skeleton-based action recognition. Neurocomputing 307, 109–117 (2019)
Article Google Scholar
Luvizon, D.-C., Picard, D., Tabia, H.: Multi-task deep learning for real-time 3d human pose estimation and action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 43(8), 2752–2764 (2020)
Google Scholar
Liu, Z., Zhang, H., Chen, Z., Wang, Z., Ouyang, W.: Disentangling and unifying graph convolutions for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 143–152 (2020)
Andriluka, M., Iqbal, U., Insafutdinov, E., Pishchulin, L., Milan, A., Gall, J., Schiele, B.: PoseTrack: a benchmark for human pose estimation and tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5167–5176 (2018)
Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. In: Proceedings of the European Conference on Computer Vision, pp. 466–481 (2018)
Wang, M., Tighe, J., Modolo, D.: Combining detection and tracking for human pose estimation in videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11088–11096 (2020)
Marcos-Ramiro, A., Pizarro, D., Marron-Romera, M., Gatica-Perez, D.: Let Your Body Speak: communicative cue extraction on natural interaction using RGBD data. IEEE Trans. Multimedia 17(10), 1721–1732 (2015)
Article Google Scholar
Liu, Z., Zhu, J., Jiajun, Bu., Chen, C.: A survey of human pose estimation: the body parts parsing based methods. J. Vis. Commun. Image Represent. 32, 10–19 (2015)
Article Google Scholar
Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7103–7112 (2018)
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5693–5703 (2019)
Zhao, X., Guo, C., Zou, Q.: Human pose estimation with gated multi-scale feature fusion and spatial mutual information. Vis Comput, pp. 1–19 (2021)
Zhang, F., Zhu, X., Dai, H., Ye, M., Zhu, C.: Distribution-aware coordinate representation for human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7093–7102 (2020)
Huang, J., Zhu, Z., Guo, F., Huang, G.: The devil is in the details: delving into unbiased data processing for human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5700–5709 (2020)
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C. L.: Microsoft coco: common objects in context. In: European Conference on Computer Vision, pp, 740–755 (2014)
Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2d human pose estimation: new benchmark and state of the art analysis. In: Proceedings of the IEEE Conference on computer Vision and Pattern Recognition, pp. 3686–3693 (2014)
Toshev, A., Szegedy, C.: DeepPose: human pose estimation via deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1653–1660 (2014)
Li, J., Bian, S., Zeng, A., Wang, C., Pang, B., Liu, W., Lu, C.: Human pose regression with residual log-likelihood estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11025–11034 (2021)
Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: European Conference on Computer Vision, pp. 483–499 (2016)
Chu, X., Yang, W., Ouyang, W., Ma, C., Yuille, A.L., Wang, X.: Multi-context attention for human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1831–1840 (2017)
Ke, L., Chang, M.-C., Qi, H., Lyu, S.: Multi-scale structure-aware network for human pose estimation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 713–728 (2018)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. Adv. Neural. Inf. Process. Syst. 28, 1137–1149 (2015)
Google Scholar
He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
Yu, C., Xiao, B., Gao, C., Yuan, L., Zhang, L., Sang, N., Wang, J.: Lite-hrnet: a lightweight high-resolution network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.10440–10450 (2021)
Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., Le, Q.V., Adam, H.: Searching for mobilenetv3. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1314–1324 (2019)
Ma, N., Zhang, X., Zheng, H.-T., Sun, J.: Shufflenet v2: practical guidelines for efficient cnn architecture design. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 116–131 (2018)
Wang, J., Jin, S., Liu, W., Liu, W., Qian, C., Luo, P.: When human pose estimation meets robustness: adversarial algorithms and benchmarks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11855–11864 (2021)
Cao, Z., Simon, T., Wei, S.-E., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7291–7299 (2017)
Cheng, B., Xiao, B., Wang, J., Shi, H., Huang, T.S., Zhang, L.: Higherhrnet: scale-aware representation learning for bottom-up human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5386–5395 (2020)
Geng, Z., Sun, K., Xiao, B., Zhang, Z., Wang, J.: Bottom-up human pose estimation via disentangled keypoint regression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14676–14686 (2021)
Xue, N., Wu, T., Xia, G.-S., Zhang, L.: Learning local-global contextual adaptation for multi-person pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13065–13074 (2022)
Lin, T.-Y., Dollar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017)
Zhigang, Tu., Xie, W., Dauwels, J., Li, B., Yuan, J.: Semantic cues enhanced multimodality multistream CNN for action recognition. IEEE Trans. Circuits Syst. Video Technol. 29(5), 1423–1437 (2018)
Google Scholar
Ren, P., Xiao, Y., Chang, X., Huang, P.-Y., Li, Z., Chen, X., Wang, X.: A comprehensive survey of neural architecture search: challenges and solutions. http://arxiv.org/abs/2006.02903
Gong, X., Chen, W., Jiang, Y., Yuan, Y., Liu, X., Zhang, Q., Li, Y., Wang, Z.: AutoPose: searching multi-scale branch aggregation for pose estimation. http://arxiv.org/abs/2008.07018
Wang, Z., Nie, X., Qu, X., Chen, Y., Liu, S.: Distribution-aware single-stage models for multi-person 3D pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13096–13105 (2021)
Artacho, B.., Savakis, A.: OmniPose: a multi-scale framework for multi-person pose estimation. http://arxiv.org/abs/2103.10180
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)
Hou, Q., Zhou, D., Feng, J.: Coordinate attention for efficient mobile network design. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13713–13722 (2021)
Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: Cbam: Convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)
Park, J., Woo, S., Lee, J.-Y., Kweon, I.S.: Bam: Bottleneck attention module. http://arxiv.org/abs/1807.06514
Yang, Q., Shi, W., Chen, J., Tang, Y.: Localization of hard joints in human pose estimation based on residual down-sampling and attention mechanism. Vis Comput, 1–13 (2021)
Su, K., Yu, D., Xu, Z., Geng, X., Wang, C.: Multi-person pose estimation with enhanced channel-wise and spatial information. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5674–5682 (2019)
Dai, Y., Gieseke, F., Oehmcke, S., Wu, Y., Barnard, K.: Attentional feature fusion. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3560–3569 (2021)
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)
Cao, Y., Xu, J., Lin, S., Wei, F., Hu, H.: Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp 0–0 (2019)
Liu, H., Liu, F., Fan, X., Huang, D.: Polarized self-attention: towards high-quality pixel-wise regression. http://arxiv.org/abs/2107.00782
Zhang, J., Tu, Z., Yang, J., Chen, Y., Yuan, J.: MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D human pose estimation in video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13232–13242 (2022)
Chollet, F.: Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1251–1258 (2017)
Johnson, S., Everingham, M.: Clustered pose and nonlinear appearance models for human pose estimation. BMVC 2(4), 5 (2010)
Google Scholar
Yang, S., Quan, Z., Nie, M., Yang, W.: TransPose: keypoint localization via transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11802–11812 (2021)
Li, Y., Zhang, S., Wang, Z., Yang, S., Yang, W., Xia, S.-T., Zhou, E.: TokenPose: learning keypoint tokens for human pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11313–11322 (2021)
Yuan, Y., Fu, R., Huang, L., Lin, W., Zhang, C., Chen, X., Wang, J.: HRFormer: High-Resolution Transformer for Dense Prediction. http://arxiv.org/abs/2110.09408
Sun, X., Xiao, B., Wei, F., Liang, S., Wei, Y.: Integral human pose regression. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 529–545 (2018)
Papandreou, G., Zhu, T., Kanazawa, N., Toshev, A., Tompson, J., Bregler, C., Murphy, K.: Towards accurate multi-person pose estimation in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4903–4911 (2017)
Fang, H.-S., Xie, S., Tai, Y.-W., Lu, C.: Rmpe: regional multi-person pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2334–2343 (2017).
Yang, W., Li, S., Ouyang, W., Li, H., Wang, X.: Learning feature pyramids for human pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1281–1290 (2017)
Huang, S., Gong, M., Tao, D.: A coarse-fine network for keypoint localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3028–3037 (2017)
Li, K., Wang, S., Zhang, X., Xu, Y., Xu, W., Tu, Z.: Pose recognition with cascade transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1944–1953 (2021)

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grant 61771299.

Author information

Authors and Affiliations

School of Communication and Information Engineering, Shanghai University, Shanghai, China
Rui Wang, Wanyu Wu & Xiangyang Wang

Authors

Rui Wang
View author publications
You can also search for this author in PubMed Google Scholar
Wanyu Wu
View author publications
You can also search for this author in PubMed Google Scholar
Xiangyang Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiangyang Wang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original online version of this article was revised: The declaration text was missing

The original online version of this article was revised: the author portraits were not assigned correct

The original online version of this article was revised: there was an error in the biografie of Wanyu Wu

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, R., Wu, W. & Wang, X. Enhancing multi-scale information exchange and feature fusion for human pose estimation. Vis Comput 39, 4751–4765 (2023). https://doi.org/10.1007/s00371-022-02623-x

Download citation

Accepted: 12 July 2022
Published: 18 August 2022
Issue Date: October 2023
DOI: https://doi.org/10.1007/s00371-022-02623-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Enhancing multi-scale information exchange and feature fusion for human pose estimation

Abstract

Access this article

Similar content being viewed by others

Attention Refined Network for Human Pose Estimation

Lightweight Human Pose Estimation Based on Multi-Attention Mechanism

Spatial and contextual aware network based on multi-resolution for human pose estimation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Enhancing multi-scale information exchange and feature fusion for human pose estimation

Abstract

Access this article

Similar content being viewed by others

Attention Refined Network for Human Pose Estimation

Lightweight Human Pose Estimation Based on Multi-Attention Mechanism

Spatial and contextual aware network based on multi-resolution for human pose estimation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation