Skip to main content
Log in

Combining detailed appearance and multi-scale representation: a structure-context complementary network for human pose estimation

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Human pose estimation is a fundamental and challenging task in the field of computer vision. Hard scenarios, such as occlusion and background confusion, set a great challenge for high-level feature representation because both detailed and multi-scale context must be correctly reasoned. In this paper, we propose a structure-context complementary network (SCC-Net) characterized by the complementarity between a pixel-wise enhanced attention mechanism and atrous convolution-based module. The proposed cross-coordinate attention bottleneck (CCAB) aims to utilize a cross-guide mechanism to promote the robustness of the existing coordinate attention module (CAM) for the background impact. As a complementary module for CCAB, waterfall residual atrous pooling (WRAP) is proposed to refine structure consistency by generating multi-scale features without the feature sparse defect of atrous-based methods. We evaluate our proposed modules and holistic SCC-Net on the COCO and MPII benchmark datasets. Ablation experiments demonstrate that our proposed modules can efficiently boost the performance of body joint detection. Competitive performance is also achieved by our holistic SCC-Net compared to other state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Andriluka M, Pishchulin L, Gehler P, Schiele B (2014) 2d human pose estimation: New benchmark and state of the art analysis. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 3686–3693

  2. Artacho B, Savakis A (2020) Unipose: Unified human pose estimation in single images and videos. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 7035–7044

  3. Artacho B, Savakis A (2021) Unipose+: A unified framework for 2d and 3d human pose estimation in images and videos IEEE Transactions on Pattern Analysis and Machine Intelligence

  4. Bello I, Zoph B, Vaswani A, Shlens J, Le QV (2019) Attention augmented convolutional networks. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3286–3295

  5. Bi HB, Lu D, Zhu HH, Yang LN, Guan HP (2021) STA-net: spatial-temporal attention network for video salient object detection. Appl Intell 51(6):3450–3459. https://doi.org/10.1007/s10489-020-01961-4https://doi.org/10.1007/s10489-020-01961-4

    Article  Google Scholar 

  6. Cao Z, Hidalgo G, Simon T, Wei SE, Sheikh Y (2021) Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Trans Pattern Anal Mach Intell 43(01):172–186

    Article  Google Scholar 

  7. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision. Springer, pp 213–229

  8. Chen K, Wang J, Pang J, Cao Y, Xiong Y, Li X, Sun S, Feng W, Liu Z, Xu J et al (2019) Mmdetection: Open mmlab detection toolbox and benchmark. arXiv:1906.07155

  9. Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2018) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848. https://doi.org/10.1109/TPAMI.2017.2699184

    Article  Google Scholar 

  10. Chen X, Yuille A (2014) Articulated pose estimation by a graphical model with image dependent pairwise relations. In: Proceedings of the 27th International conference on neural information processing systems-volume 1, pp 1736–1744

  11. Chen Y, Kalantidis Y, Li J, Yan S, Feng J (2018) A 2-nets: double attention networks. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 350–359

  12. Chen Y, Wang Z, Peng Y, Zhang Z, Yu G, Sun J (2018) Cascaded pyramid network for multi-person pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7103–7112

  13. Chu X, Ouyang W, Wang X, et al. (2016) Crf-cnn: Modeling structured information in human pose estimation. Adv Neural Inf Process Syst 29:316–324

    Google Scholar 

  14. Chu X, Yang W, Ouyang W, Ma C, Yuille AL, Wang X (2017) Multi-context attention for human pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1831–1840

  15. Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on computer vision and pattern recognition. IEEE, pp 248–255

  16. Ding H, Jiang X, Shuai B, Liu AQ, Wang G (2018) Context contrasted feature and gated multi-scale aggregation for scene segmentation. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 2393–2402

  17. Dong L, Chen X, Wang R, Zhang Q, Izquierdo E (2017) Adore: An adaptive holons representation framework for human pose estimation. IEEE Transactions on Circuits and Systems for Video Technology

  18. Dong X, Yu J, Zhang J (2022) Joint usage of global and local attentions in hourglass network for human pose estimation. Neurocomputing 472:95–102

    Article  Google Scholar 

  19. Fan H, Zhuo T, Yu X, Yang Y, Kankanhalli M (2021) Understanding atomic hand-object interaction with human intention IEEE Transactions on Circuits and Systems for Video Technology

  20. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  21. He T, Zhang Z, Zhang H, Zhang Z, Xie J, Li M (2019) Bag of tricks for image classification with convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 558–567

  22. Hou Q, Zhou D, Feng J (2021) Coordinate attention for efficient mobile network design. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 13713–13722

  23. Hu J, Shen L, Albanie S, Sun G, Vedaldi A (2018) Gather-excite: Exploiting feature context in convolutional neural networks. Adv Neural Inf Process Syst 31:9401–9411

    Google Scholar 

  24. Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141

  25. Hua G, Li L, Liu S (2020) Multipath affinage stacked—hourglass networks for human pose estimation. Front Comput Sci 14(4):1–12

    Article  Google Scholar 

  26. Huang J, Zhou W, Li H, Li W (2018) Attention-based 3d-cnns for large-vocabulary sign language recognition. IEEE Trans Circuits Syst Video Technol 29(9):2822–2832

    Article  Google Scholar 

  27. Huang Z, Ke W, Huang D (2020) Improving object detection with inverted attention. In: 2020 IEEE Winter conference on applications of computer vision (WACV). IEEE, pp 1294–1302

  28. Huang Z, Wang X, Huang L, Huang C, Wei Y, Liu W (2019) Ccnet: Criss-cross attention for semantic segmentation. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 603–612

  29. Ke L, Chang MC, Qi H, Lyu S (2018) Multi-scale structure-aware network for human pose estimation. In: Proceedings of the european conference on computer vision (ECCV), pp 713–728

  30. Khirodkar R, Chari V, Agrawal A, Tyagi A (2021) Multi-instance pose networks: Rethinking top-down pose estimation. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 3122–3131

  31. Kong K, Shin S, Lee J, Song WJ (2018) How to estimate global motion non-iteratively from a coarsely sampled motion vector field. IEEE Trans Circuits Syst Video Technol 29(12):3729–3742

    Article  Google Scholar 

  32. Kreiss S, Bertoni L, Alahi A (2019) Pifpaf: Composite fields for human pose estimation. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 11977–11986

  33. Li Y, Zhang S, Wang Z, Yang S, Yang W, Xia ST, Zhou E (2021) Tokenpose: Learning keypoint tokens for human pose estimation. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 11313–11322

  34. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: European conference on computer vision. Springer, pp 740–755

  35. Linsley D, Shiebler D, Eberhardt S, Serre T (2019) Learning what and where to attend. In: International conference on learning representations

  36. Liu JJ, Hou Q, Cheng MM, Wang C, Feng J (2020) Improving convolutional networks with self-calibrated convolutions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10096–10105

  37. Liu S, Bai X, Fang M, Li L, Hung CC (2021) Mixed graph convolution and residual transformation network for skeleton-based action recognition. Applied Intelligence p 1–12

  38. Luo Y, Xu Z, Liu P, Du Y, Guo JM (2018) Multi-person pose estimation via multi-layer fractal network and joints kinship pattern. IEEE Trans Image Process 28(1):142–155

    Article  MathSciNet  MATH  Google Scholar 

  39. Misra D, Nalamada T, Arasanipalai AU, Hou Q (2021) Rotate to attend: Convolutional triplet attention module. In: Proceedings of the IEEE/CVF Winter conference on applications of computer vision, pp 3139–3148

  40. Mo S, Cai M, Lin L, Tong R, Chen Q, Wang F, Hu H, Iwamoto Y, Han XH, Chen YW (2021) Mutual information-based graph co-attention networks for multimodal prior-guided magnetic resonance imaging segmentation IEEE Transactions on Circuits and Systems for Video Technology

  41. Newell A, Yang K, Deng J (2016) Stacked hourglass networks for human pose estimation. In: European conference on computer vision. Springer, pp 483–499

  42. Nie X, Feng J, Xing J, Xiao S, Yan S (2018) Hierarchical contextual refinement networks for human pose estimation. IEEE Trans Image Process 28(2):924–936

    Article  MathSciNet  MATH  Google Scholar 

  43. Nie X, Feng J, Zhang J, Yan S (2019) Single-stage multi-person pose machines. In: Proceedings of the IEEE/CVF International conference on computer vision, pp 6951–6960

  44. Park J, Woo S, Lee JY, Kweon IS (2018) Bam: Bottleneck attention module. arXiv:1807.06514

  45. Peng G, Zheng Y, Li J, Yang J (2021) A single upper limb pose estimation method based on the improved stacked hourglass network. Int J Appl Math Comput Sci 31(1):123–133

    MATH  Google Scholar 

  46. Ruggero Ronchi M, Perona P (2017) Benchmarking and error diagnosis in multi-instance pose estimation. In: Proceedings of the IEEE international conference on computer vision, pp 369–378

  47. Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision, pp 618–626

  48. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556

  49. Song K, Yang H, Yin Z (2018) Multi-scale attention deep neural network for fast accurate object detection. IEEE Trans Circuits Syst Video Technol 29(10):2972–2985

    Article  Google Scholar 

  50. Su K, Yu D, Xu Z, Geng X, Wang C (2019) Multi-person pose estimation with enhanced channel-wise and spatial information. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 5674–5682

  51. Sun K, Xiao B, Liu D, Wang J (2019) Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 5693–5703

  52. Tompson JJ, Jain A, LeCun Y, Bregler C (2014) Joint training of a convolutional network and a graphical model for human pose estimation. Adv Neural Inf Process Syst 27:1799–1807

    Google Scholar 

  53. Tong H, Fang Z, Wei Z, Cai Q, Gao Y (2021) Sat-net: a side attention network for retinal image segmentation. Appl Intell 51(7):5146–5156

    Article  Google Scholar 

  54. Tsotsos JK (1990) Analyzing vision at the complexity level. Behav Brain Sci 13(3):423–445

    Article  Google Scholar 

  55. Tsotsos JK (2011) A computational perspective on visual attention

  56. Wan T, Luo Y, Zhang Z, Ou Z (2022) Tsnet: Tree structure network for human pose estimation. 2 16:551–558

    Google Scholar 

  57. Wei SE, Ramakrishna V, Kanade T, Sheikh Y (2016) Convolutional pose machines. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp 4724–4732

  58. Woo S, Park J, Lee JY, Kweon IS (2018) Cbam: Convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), pp 3–19

  59. Wu H, Ma X, Li Y (2021) Spatiotemporal multimodal learning with 3d cnns for video action recognition. IEEE Transactions on Circuits and Systems for Video Technology

  60. Xiang S, Chen X, Zhou J (2021) An efficient method for boosting human pose estimation. In: 2021 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting (BMSB). https://doi.org/10.1109/BMSB53066.2021.9547183, pp 1–6

  61. Xiao B, Wu H, Wei Y (2018) Simple baselines for human pose estimation and tracking. In: Proceedings of the European conference on computer vision (ECCV), pp 466–481

  62. Xie S, Girshick R, Dollár P, Tu Z, He K (2017) Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1492–1500

  63. Xu X, Zou Q, Lin X Cfenet: Content-aware feature enhancement network for multi-person pose estimation. Applied Intelligence p 1–22

  64. Yang S, Quan Z, Nie M, Yang W (2020) Transpose: Towards explainable human pose estimation by transformer. arXiv:2012.14214

  65. Yang Y, Ramanan D (2011) Articulated pose estimation with flexible mixtures-of-parts. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 1385–1392

  66. Zhang H, Ouyang H, Liu S, Qi X, Shen X, Yang R, Jia J (2019) Human pose estimation with spatial contextual information. arXiv:1901.01760

  67. Zhao L, Wang N, Gong C, Yang J, Gao X (2021) Estimating human pose efficiently by parallel pyramid networks. IEEE Trans Image Process 30:6785–6800

    Article  Google Scholar 

  68. Zhu Z, Xu M, Bai S, Huang T, Bai X (2019) Asymmetric non-local neural networks for semantic segmentation. In: 2019 IEEE/CVF International conference on computer vision (ICCV)

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grants 61902404, 62001475, 62071472, in part by the Program for “Industrial IoT and Emergency Collaboration” Innovative Research Team in China University of Mining and Technology (CUMT) under Grant 2020ZY002, in part by the Special Fund for cultivating major projects in China University of Mining and Technology (CUMT) under Grand 2020-10991, in part by the Graduate Research and Innovation Projects of Jiangsu Province (KYCX21_2241), in part by China Scholarship Council under Grand 202106420091, and in part by the Yulin smart energy big data application joint Key Laboratory under Grant 202100208-01.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yanjing Sun.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dong, K., Sun, Y., Cheng, X. et al. Combining detailed appearance and multi-scale representation: a structure-context complementary network for human pose estimation. Appl Intell 53, 8097–8113 (2023). https://doi.org/10.1007/s10489-022-03909-2

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-03909-2

Keywords

Navigation