Skip to main content
Log in

CFENet: Content-aware feature enhancement network for multi-person pose estimation

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Multi-person pose estimation is a fundamental yet challenging task in computer vision. Although great success has been made in this field due to the rapid development of deep learning, complex situations (e.g., extreme poses, occlusions, overlapped persons, and crowded scenes) are still not well solved. To further mitigate these issues, we propose a novel Content-aware Feature Enhancement Network (CFENet), which consists of three effective modules: Feature Aggregation and Selection Module (FASM), Feature Fusion Module (FFM) and Dense Upsampling Convolution (DUC) module. The FASM includes Feature Aggregation Module (FAM) and Information Selection Module (ISM). The FAM constructs the hierarchical multi-scale feature aggregations in a granular level to capture more accurate fine-grained representations. The ISM makes the aggregated representations more distinguished, which adaptively highlights the discriminative human part representations both in the spatial location and channel context. Then, we perform FFM which effectively fuses high-resolution spatial features and low-resolution semantic features to obtain more reliable context information for well-estimated joints. Finally, we adopt DUC module to generate more precise prediction, which can recover missing joint details that are usually unavailable in common upsampling process. Comprehensive experiments demonstrate that the proposed approach outperforms most of the popular methods and achieves a competitive performance with the state-of-the-art methods over three benchmark datasets: the recent big dataset CrowdPose, the COCO keypoint detection dataset and the MPII Human Pose dataset. Our code will be released upon acceptance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

References

  1. Andriluka M, Pishchulin L, Gehler PV, Schiele B (2014) 2d human pose estimation: New benchmark and state of the art analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3686–3693

  2. Bodla N, Singh B, Chellappa R, Davis LS (2017) Soft-nms improving object detection with one line of code. In: Proceedings of the IEEE International Conference on Computer Vision, pp 5562–5570

  3. Buitelaar P, Wood I, Negi S, Arcan M, Mccrae JP, Abele A, Robin C, Andryushechkin V, Sagha H, Schmitt M (2018) Mixedemotions: an open-source toolbox for multi-modal emotion analysis. IEEE Trans Multimed 20(9):2454–2465

    Article  Google Scholar 

  4. Cao Z, Simon T, Wei S, Sheikh Y (2017) Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1302–1310

  5. Chen L, Papandreou G, Kokkinos I, Murphy KP, Yuille AL (2018a) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848

    Article  Google Scholar 

  6. Chen Y, Shen C, Wei X, Liu L, Yang J (2017) Adversarial posenet: A structure-aware convolutional network for human pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp 1221–1230

  7. Chen Y, Wang Z, Peng Y, Zhang Z, Yu G, Sun J (2018b) Cascaded pyramid network for multi-person pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7103–7112

  8. Chou C, Chien J, Chen H (2018) Self adversarial training for human pose estimation. In: Proceedings of the Asia Pacific Signal and Information Processing Association Annual Summit and Conference, pp 17–30

  9. Chu X, Yang W, Ouyang W, Ma C, Yuille AL, Wang X (2017) Multi-context attention for human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5669–5678

  10. Dong L, Chen X, Wang R, Zhang Q, Izquierdo E (2018) Adore: an adaptive holons representation framework for human pose estimation. IEEE Trans Circ Syst Video Technol 28(10):2803–2813

    Article  Google Scholar 

  11. Fang H, Xie S, Tai Y, Lu C (2017) Rmpe: Regional multi-person pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp 2353–2362

  12. Gao S, Cheng M, Zhao K, Zhang X, Yang M, Torr PHS (2019) Res2net: a new multi-scale backbone architecture. IEEE Trans Pattern Anal Mach Intell:1–1

  13. He K, Gkioxari G, Dollar P, Girshick RB (2017) Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp 2980–2988

  14. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778

  15. Hou Q, Cheng M, Hu X, Borji A, Tu Z, Torr PHS (2017) Deeply supervised salient object detection with short connections. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 815–828

  16. Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7132–7141

  17. Insafutdinov E, Pishchulin L, Andres B, Andriluka M, Schiele B (2016) Deepercut: a deeper, stronger, and faster multi-person pose estimation model. Eur Conf Comput Vis:34–50

  18. Ke L, Chang M, Qi H, Lyu S (2018) Multi-scale structure-aware network for human pose estimation. Eur Conf Comput Vis:731–746

  19. Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. Computer Science

  20. Kocabas M, Karagoz S, Akbas E (2018) Multiposenet: Fast multi-person pose estimation using pose residual network. Eur Conf Comput Vis:437–453

  21. Li J, Wang C, Zhu H, Mao Y, Fang H, Lu C (2019a) Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

  22. Li W, Wang Z, Yin B, Peng Q, Su J (2019b) Rethinking on multi-stage networks for human pose estimation. arXiv:190100148

  23. Lin T, Dollar P, Girshick RB, He K, Hariharan B, Belongie SJ (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 936–944

  24. Lin T, Maire M, Belongie SJ, Hays J, Perona P, Ramanan D, Dollar P, Zitnick CL (2014) Microsoft coco: Common objects in context. Eur Conf Comput Vis:740–755

  25. Liu S, Li Y, Hua G (2019) Human pose estimation in video via structured space learning and halfway temporal evaluation. IEEE Trans Circ Syst Video Technol:2029–2038

  26. Liu W, Anguelov D, Erhan D, Szegedy C, Reed SE, Fu C, Berg AC (2016) Ssd: Single shot multibox detector. Eur Conf Comput Vis:21–37

  27. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3431–3440

  28. Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60 (2):91–110

    Article  Google Scholar 

  29. Luvizon DC, Tabia H, Picard D (2017) Human pose regression by combining indirect part detection and contextual information. Computers & Graphics

  30. Marcosramiro A, Pizarro D, Marronromera M, Gaticaperez D (2015) Let your body speak: Communicative cue extraction on natural interaction using rgbd data. IEEE Trans Multimed 17(10):1721–1732

    Article  Google Scholar 

  31. Martinezgonzalez A, Villamizar M, Canevet O, Odobez J (2019) Efficient convolutional neural networks for depth-based multi-person pose estimation. IEEE Trans Circ Syst Video Technol:1–1

  32. MS-COCO, 2017 Coco keypoint leaderboard. http://cocodataset.org/

  33. Newell A, Huang Z, Deng J (2017) Associative embedding: End-to-end learning for joint detection and grouping. In: Advances in Neural Information Processing Systems, pp 2277–2287

  34. Newell A, Yang K, Deng J (2016) Stacked hourglass networks for human pose estimation. Eur Conf Comput Vis:483–499

  35. Nie X, Feng J, Zhang J, Yan S (2019) Single-stage multi-person pose machines. In: Proceedings of the IEEE International Conference on Computer Vision, pp 6951–6960

  36. Ning G, Zhang Z, He Z (2018) Knowledge-guided deep fractal neural networks for human pose estimation. IEEE Trans Multimed:1246–1259

  37. Noh H, Hong S, Han B (2015) Learning deconvolution network for semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, pp 1520–1528

  38. Papandreou G, Zhu T, Kanazawa N, Toshev A, Tompson J, Bregler C, Murphy KP (2017) Towards accurate multi-person pose estimation in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3711–3719

  39. Park J, Woo S, Lee J, Kweon IS (2018) Bam: Bottleneck attention module. arXiv:180706514

  40. Facebook (2017) Pytorch. https://pytorch.org/

  41. Pishchulin L, Insafutdinov E, Tang S, Andres B, Andriluka M, Gehler PV, Schiele B (2016) Deepcut: Joint subset partition and labeling for multi person pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4929–4937

  42. Ren S, He K, Girshick RB, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp 91–99

  43. Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. Med Image Comput Comput Assist Intervent:234–241

  44. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein MS et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252

    Article  MathSciNet  Google Scholar 

  45. Mahshid M, Reza S (2019) A motion-aware convLSTM network for action recognition. Appl Intell:1–7

  46. Su K, Yu D, Xu Z, Geng X, Wang C (2019) Multi-person pose estimation with enhanced channel-wise and spatial information. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5674–5682

  47. Sun K, Lan C, Xing J, Zeng W, Liu D, Wang J (2017) Human pose estimation using global and local normalization. In: Proceedings of the IEEE International Conference on Computer Vision, pp 5600–5608

  48. Sun K, Xiao B, Liu D, Wang J (2019) Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5693–5703

  49. Sun X, Xiao B, Wei F, Liang S, Wei Y (2018) Integral human pose regression. Eur Conf Comput Vis:536–553

  50. Szegedy C, Liu W, Jia Y, Sermanet P, Reed SE, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1–9

  51. Tang W, Yu P, Wu Y (2018a) Deeply learned compositional models for human pose estimation. Eur Conf Comput Vis:197–214

  52. Tang Z, Peng X, Geng S, Wu L, Zhang S, Metaxas DN (2018b) Quantized densely connected u-nets for efficient landmark localization. Eur Conf Comput Vis:348–364

  53. Wang P, Chen P, Yuan Y, Liu D, Huang Z, Hou X, Cottrell GW (2018) Understanding convolution for semantic segmentation. Workshop Appl Comput Vis:1451–1460

  54. Wei S, Ramakrishna V, Kanade T, Sheikh Y (2016) Convolutional pose machines. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4724–4732

  55. Xiao B, Wu H, Wei Y (2018) Simple baselines for human pose estimation and tracking. Eur Conf Comput Vis:472–487

  56. Xiao Y, Lu H, Sun C (2015) Pose estimation based on pose cluster and candidates recombination. IEEE Trans Cir Syst Video Technol 25(6):935–943

    Article  Google Scholar 

  57. Yang B, Ma AJ, Yuen PC (2019) Body parts synthesis for cross-quality pose estimation. IEEE Trans Circ Syst Video Technol 29(2):461–474

    Article  Google Scholar 

  58. Yang W, Li S, Ouyang W, Li H, Wang X (2017) Learning feature pyramids for human pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp 1290–1299

  59. Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. Eur Conf Comput Vis:818–833

  60. Zhang X, Ding M, Fan G (2017) Video-based human walking estimation using joint gait and pose manifolds. IEEE Trans Circ Syst Video Technol 27(7):1540–1554

    Article  Google Scholar 

  61. Zhang Z, Zhang X, Peng C, Xue X, Sun J (2018) Exfuse: Enhancing feature fusion for semantic segmentation. Eur Conf Comput Vis:273–288

  62. Dinh D, Lim M, Thang N, Lee S, Kim T (2014) Real-time 3D human pose recovery from a single depth image using principal direction analysis. Appl Intell 41(2):473–486

    Article  Google Scholar 

  63. Wang J, Chen K, Xu R, Liu Z, Joy C, Lin D (2019) CARAFE: Content-Aware ReAssembly of FEatures. Int Conf Comput Vis:3007–2016

  64. Lin X, Zou Q, Xu X, Huang Y, Tian Y (2020) Motion-Aware Feature enhancement network for video prediction. IEEE Trans Circ Syst Video Technol PP(99):1–1

Download references

Acknowledgements

This research is supported by National Nature Science Foundation of China under Grants (No.61473031), the Fundamental Research Funds for the Central Universities (2019JBM019).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qi Zou.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xu, X., Zou, Q. & Lin, X. CFENet: Content-aware feature enhancement network for multi-person pose estimation. Appl Intell 52, 215–236 (2022). https://doi.org/10.1007/s10489-021-02383-6

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-021-02383-6

Keywords

Navigation