MSPENet: multi-scale adaptive fusion and position enhancement network for human pose estimation

Xu, Jia; Liu, Weibin; Xing, Weiwei; Wei, Xiang

doi:10.1007/s00371-022-02460-y

MSPENet: multi-scale adaptive fusion and position enhancement network for human pose estimation

Original article
Published: 02 April 2022

Volume 39, pages 2005–2019, (2023)
Cite this article

The Visual Computer Aims and scope Submit manuscript

Jia Xu¹,
Weibin Liu¹,
Weiwei Xing² &
…
Xiang Wei²

580 Accesses
4 Citations
1 Altmetric
Explore all metrics

Abstract

Human pose estimation is a fundamental yet challenging task in computer vision. Recently, with the involvement of deep neural networks, human pose estimation has made great progresses. However, existing pose estimation networks still have some difficulties in detecting small-scale keypoints and distinguishing semantic confusion keypoints. In this paper, a novel convolutional neural network named multi-scale position enhancement network is proposed to address the above two problems. First, a multi-scale adaptive fusion unit is proposed to dynamically choose and fuse features on different scales, allowing small-scale keypoints to obtain more detailed information that is beneficial for detection. Second, we discover that although appearance-similar parts are difficult to distinguish in semantics, they differ significantly in spatial location. Therefore, a position enhancement module is designed to highlight features of real joint locations while learning more discriminative features to suppress features of similar joint regions. Finally, a global context block is applied to optimize the prediction results in order to further improve the network performance. Experiments on both single- and multi-person pose estimation benchmarks illustrate that our approach yields more accurate and reliable results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Enhancing feature fusion for human pose estimation

Article 24 September 2020

Rui Wang, Jiangwei Tong & Xiangyang Wang

Human pose estimation based on feature enhancement and multi-scale feature fusion

Article 18 June 2022

Dandan Cao, Weibin Liu, … Xiang Wei

SRFNet: selective receptive field network for human pose estimation

Article 31 May 2021

Zhilong Ou, YanMin Luo, … Geng Chen

References

Agahian, S., Negin, F., Köse, C.: Improving bag-of-poses with semi-temporal pose descriptors for skeleton-based action recognition. Vis. Comput. 35(4), 591–607 (2019)
Article Google Scholar
Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2d human pose estimation: new benchmark and state of the art analysis. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3686–3693 (2014)
Andriluka, M., Roth, S., Schiele, B.: Pictorial structures revisited: People detection and articulated pose estimation. In: CVPR (2009)
Artacho, B., Savakis, A.: Unipose: unified human pose estimation in single images and videos. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7033–7042 (2020)
Belagiannis, V., Zisserman, A.: Recurrent human pose estimation. In: 12th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2017), pp. 468–475 (2017)
Bin, Y., Chen, Z.M., Wei, X.S., Chen, X., Gao, C., Sang, N.: Structure-aware human pose estimation with graph convolutional networks. Pattern Recognit. 106, 107410 (2020)
Article Google Scholar
Bulat, A., Kossaifi, J., Tzimiropoulos, G., Pantic, M.: Toward fast and accurate human pose estimation via soft-gated skip connections. In: 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp. 8–15 (2020)
Cao, Y., Xu, J., Lin, S., Wei, F., Hu, H.: GCNet: non-local networks meet squeeze-excitation networks and beyond. In: IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pp. 1971–1980 (2019)
Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1302–1310 (2017)
Carreira, J., Agrawal, P., Fragkiadaki, K., Malik, J.: Human pose estimation with iterative error feedback. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4733–4742 (2016)
Chen, Y., Shen, C., Wei, X.S., Liu, L., Yang, J.: Adversarial PoseNet: a structure-aware convolutional network for human pose estimation. In: IEEE International Conference on Computer Vision (ICCV), pp. 1221–1230 (2017)
Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7103–7112 (2018)
Chu, X., Yang, W., Ouyang, W., Ma, C.X., Yuille, A., Wang, X.: Multi-context attention for human pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5669–5678 (2017)
Dawn, D.D., Shaikh, S.H.: A comprehensive survey of human action recognition with spatio-temporal interest point (STIP) detector. Vis. Comput. 32(3), 289–306 (2016)
Article Google Scholar
Fang, H., Xie, S., Tai, Y.W., Lu, C.: RMPE: regional multi-person pose estimation. In: IEEE International Conference on Computer Vision (ICCV), pp. 2353–2362 (2017)
Gao, G., Yang, J., Jing, X., Shen, F., Yang, W., Yue, D.: Learning robust and discriminative low-rank representations for face recognition with occlusion. Pattern Recognit. 66, 129–143 (2017)
Article Google Scholar
Gao, G., Yu, Y., Yang, J., Qi, G., Yang, M.: Hierarchical deep cnn feature set-based representation learning for robust cross-resolution face recognition. CoRR abs/2103.13851 (2021)
He, K., Gkioxari, G., Dollár, P., Girshick, R.B.: Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 42, 386–397 (2020)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Hu, J., Shen, L., Albanie, S., Sun, G., Wu, E.: Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell. 42, 2011–2023 (2020)
Article Google Scholar
Huang, J.J., Zhu, Z., Guo, F., Huang, G.: The devil is in the details: delving into unbiased data processing for human pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5699–5708 (2020)
Insafutdinov, E., Pishchulin, L., Andres, B., Andriluka, M., Schiele, B.: Deepercut: a deeper, stronger, and faster multi-person pose estimation model. In: ECCV (2016)
Jiang, T., Zhang, Z., Yang, Y.: Modeling coverage with semantic embedding for image caption generation. Vis. Comput. 35(11), 1655–1665 (2019)
Article Google Scholar
Johnson, S., Everingham, M.: Clustered pose and nonlinear appearance models for human pose estimation. In: BMVC (2010)
Ke, L., Chang, M., Qi, H., Lyu, S.: Multi-scale structure-aware network for human pose estimation. CoRR abs/1803.09894 (2018)
Khan, M.A., Javed, K., Khan, S., Saba, T., Habib, U., Khan, J., Abbasi, A.A.: Human action recognition using fusion of multiview and deep features: an application to video surveillance. Multimed. Tools Appl. (2020). https://doi.org/10.1007/s11042-020-08806-9
Article Google Scholar
Kreiss, S., Bertoni, L., Alahi, A.: Pifpaf: Composite fields for human pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11969–11978 (2019)
Lifshitz, I., Fetaya, E., Ullman, S.: Human pose estimation using deep consensus voting. In: ECCV (2016)
Lin, T.Y., Maire, M., Belongie, S.J., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: common objects in context. In: ECCV (2014)
Liu, Z., Duan, Q., Shi, S., Zhao, P.: Multi-level progressive parallel attention guided salient object detection for RGB-D images. Vis. Comput. 37(3), 529–540 (2021)
Article Google Scholar
Moon, G., Chang, J.Y., Lee, K.M.: Multi-scale aggregation R-CNN for 2d multi-person pose estimation. CoRR abs/1905.03912 (2019)
Moon, G., Chang, J.Y., Lee, K.M.: Posefix: Model-agnostic general human pose refinement network. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7765–7773 (2019)
Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: ECCV (2016)
Nie, X., Feng, J., Yan, S.: Mutual learning to adapt for joint human parsing and pose estimation. In: ECCV (2018)
Nie, X., Feng, J., Zuo, Y., Yan, S.: Human pose estimation with parsing induced learner. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2100–2108 (2018)
Ning, G., Zhang, Z., He, Z.: Knowledge-guided deep fractal neural networks for human pose estimation. IEEE Trans. Multimed. 20, 1246–1259 (2018)
Article Google Scholar
Papandreou, G., Zhu, T.L., Kanazawa, N., Toshev, A., Tompson, J., Bregler, C., Murphy, K.: Towards accurate multi-person pose estimation in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3711–3719 (2017)
Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Gehler, P., Schiele, B.: Deepcut: joint subset partition and labeling for multi person pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4929–4937 (2016)
Ryou, S., Jeong, S.G., Perona, P.: Anchor loss: Modulating loss scale based on prediction difficulty. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5991–6000 (2019)
Sapp, B., Taskar, B.: Modec: multimodal decomposable models for human pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3674–3681 (2013)
Su, K., Yu, D., Xu, Z., Geng, X., Wang, C.: Multi-person pose estimation with enhanced channel-wise and spatial information. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5667–5675 (2019)
Tang, W., Yu, P., Wu, Y.: Deeply learned compositional models for human pose estimation. In: ECCV (2018)
Tian, L., Liang, G., Wang, P., Shen, C.: An adversarial human pose estimation network injected with graph structure. Pattern Recognit. 115, 107863 (2021)
Article Google Scholar
Tompson, J., Jain, A., LeCun, Y., Bregler, C.: Joint training of a convolutional network and a graphical model for human pose estimation. In: NIPS (2014)
Toshev, A., Szegedy, C.: Deeppose: human pose estimation via deep neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1653–1660 (2014)
Vidanpathirana, M., Sudasingha, I., Vidanapathirana, J., Kanchana, P., Perera, I.: Tracking and frame-rate enhancement for real-time 2d human pose estimation. Vis. Comput. 36(7), 1501–1519 (2020)
Article Google Scholar
Vishwakarma, S., Agrawal, A.: A survey on activity recognition and behavior understanding in video surveillance. Vis. Comput. 29(10), 983–1009 (2013)
Article Google Scholar
Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H., Wang, X., Tang, X.: Residual attention network for image classification. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6450–6458 (2017)
Wang, J., Long, X., Gao, Y., Ding, E., Wen, S.: Graph-pcnn: two stage human pose estimation with graph pose refinement. In: ECCV (2020)
Wang, K., Zhang, G., Yang, J., Bao, H.: Dynamic human body reconstruction and motion tracking with low-cost depth cameras. Vis. Comput. 37(3), 603–618 (2021)
Article Google Scholar
Wei, S.E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4732 (2016)
Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: Cbam: convolutional block attention module. In: ECCV (2018)
Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. In: ECCV (2018)
Yang, Q., Shi, W., Chen, J., Tang, Y.H.: Localization of hard joints in human pose estimation based on residual down-sampling and attention mechanism. Vis. Comput. (2021). https://doi.org/10.1007/s00371-021-02122-5
Article Google Scholar
Yang, W., Li, S., Ouyang, W., Li, H., Wang, X.: Learning feature pyramids for human pose estimation. In: IEEE International Conference on Computer Vision (ICCV), pp. 1290–1299 (2017)
Zhang, F., Chen, Y., Li, Z., Hong, Z., Liu, J., Ma, F., Han, J., Ding, E.: Acfnet: attentional class feature network for semantic segmentation. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6797–6806 (2019)
Zhang, H., Ouyang, H., Liu, S., Qi, X., Shen, X., Yang, R., Jia, J.: Human pose estimation with spatial contextual information. CoRR abs/1901.01760 (2019)

Download references

Acknowledgements

This research is partially supported by the Beijing Natural Science Foundation (No. 4212025) and National Natural Science Foundation of China (No. 61876018, No. 61906014, No. 61976017).

Author information

Authors and Affiliations

Institute of Information Science, Beijing Jiaotong University, Beijing, 100044, China
Jia Xu & Weibin Liu
School of Software Engineering, Beijing Jiaotong University, Beijing, 100044, China
Weiwei Xing & Xiang Wei

Authors

Jia Xu
View author publications
You can also search for this author in PubMed Google Scholar
Weibin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Weiwei Xing
View author publications
You can also search for this author in PubMed Google Scholar
Xiang Wei
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Weibin Liu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Human and animal rights

This article does not contain any studies with human participants and/or animals performed by any of the authors.

Informed consent

There is no informed consent for this study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xu, J., Liu, W., Xing, W. et al. MSPENet: multi-scale adaptive fusion and position enhancement network for human pose estimation. Vis Comput 39, 2005–2019 (2023). https://doi.org/10.1007/s00371-022-02460-y

Download citation

Accepted: 08 March 2022
Published: 02 April 2022
Issue Date: May 2023
DOI: https://doi.org/10.1007/s00371-022-02460-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MSPENet: multi-scale adaptive fusion and position enhancement network for human pose estimation

Abstract

Access this article

Similar content being viewed by others

Enhancing feature fusion for human pose estimation

Human pose estimation based on feature enhancement and multi-scale feature fusion

SRFNet: selective receptive field network for human pose estimation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Human and animal rights

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

MSPENet: multi-scale adaptive fusion and position enhancement network for human pose estimation

Abstract

Access this article

Similar content being viewed by others

Enhancing feature fusion for human pose estimation

Human pose estimation based on feature enhancement and multi-scale feature fusion

SRFNet: selective receptive field network for human pose estimation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Human and animal rights

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation