Skip to main content

Advertisement

Log in

SP-YOLO: an end-to-end lightweight network for real-time human pose estimation

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

The traditional multi-person human pose estimation method has several problems including low real-time detection effect, low recognition efficiency, and a large number of calculation parameters. In this paper, we propose a lightweight network SP-YOLO based on the YOLO-Pose algorithm for real-time human pose estimation. Firstly, a lightweight ghost spatial pyramid pooling-fast (GSPPF) module is proposed by using GhostNet instead of convolution in the CSPDarkNet backbone network to make the pooling structure do layer-by-layer maxpooling operation and improve the speed of the backbone network. The ultra-lightweight attention mechanism called the shuffle attention (SA) module, is added between layers. Secondly, the neck structure is enhanced, and a more efficient slim path aggregation network (Slim-PAN) module is proposed to optimize the input relationship in the pyramid-enhanced feature structure. The loss function is refined to assign different weights to human pose detection and keypoint detection. Finally, the SP-YOLO method is evaluated through dataset evaluation and real-time detection experiments. A comparison with several other methods on the COCO2017-Keypoints dataset demonstrates that SP-YOLO improves the average precision (AP) by 1.5% compared to the original YOLO-Pose while reducing the number of parameters by 7.9 M. In the real-time detection experiment, the proposed method proves to be effective for end-to-end real-time detection of human poses.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Data availability

The data come from the common dataset.

References

  1. Li, Y., Jia, S., Li, Q.: BalanceHRNet: an effective network for bottom-up human pose estimation. Neural Netw. 161, 297–305 (2023). https://doi.org/10.1016/j.neunet.2023.01.036

    Article  Google Scholar 

  2. Miki, D., Abe, S., Chen, S., et al.: Robust human pose estimation from distorted wide-angle images through iterative search of transformation parameters. SIViP 14, 693–700 (2020). https://doi.org/10.1007/s11760-019-01602-5

    Article  Google Scholar 

  3. Lu, H., Shao, X., Xiao, Y.: Pose estimation with segmentation consistency. IEEE Trans. Image Process. 22(10), 4040–4048 (2013). https://doi.org/10.1109/TIP.2013.2268975

    Article  MathSciNet  Google Scholar 

  4. Wan, T., Luo, Y., Zhang, Z., et al.: TSNet: tree structure network for human pose estimation. SIViP 16, 551–558 (2022). https://doi.org/10.1007/s11760-021-01999-y

    Article  Google Scholar 

  5. Ke, X., Liu, T., Li, Z.: Human attribute recognition method based on pose estimation and multiple-feature fusion. SIViP 14, 1441–1449 (2020). https://doi.org/10.1007/s11760-020-01690-8

    Article  Google Scholar 

  6. Dayarathna, T., Muthukumarana, T., Rathnayaka, Y., Denman, S., de Silva, C., Pemasiri, A., Ahmedt-Aristizabal, D.: Privacy-preserving in-bed pose monitoring: a fusion and reconstruction study. Expert Syst. Appl. 213, 119139 (2023). https://doi.org/10.1016/j.eswa.2022.119139

    Article  Google Scholar 

  7. Gomes, M.E.N., Macêdo, D., Zanchettin, C., De-Mattos-Neto, P.S.G., Oliveira, A.: Multi-human fall detection and localization in videos. Comput. Vision Image Understand. 220, 103442 (2022). https://doi.org/10.1016/j.cviu.2022.103442

    Article  Google Scholar 

  8. Cao, D., Liu, W., Xing, W., et al.: Human pose estimation based on feature enhancement and multi-scale feature fusion. SIViP 17, 643–650 (2023). https://doi.org/10.1007/s11760-022-02271-7

    Article  Google Scholar 

  9. Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. In European conference on computer vision (ECCV), (2018). https://doi.org/10.48550/arXiv.1804.06208

  10. Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. In: The IEEE conference on computer vision and pattern recognition (CVPR), (2018). https://doi.org/10.48550/arXiv.1711.07319

  11. Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: The IEEE conference on computer vision and pattern recognition (CVPR), (2019). https://doi.org/10.48550/arXiv.1902.09212

  12. Cheng, B., Xiao, B., Wang, J., Shi, H., Huang, T.S., Zhang, L.: HigherHRNet: scale-aware representation learning for bottom-up human pose estimation. In: The IEEE conference on computer vision and pattern recognition (CVPR), (2020). https://doi.org/10.48550/arXiv.1908.10357

  13. Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Gehler, P.V., Schiele, B:. DeepCut: joint subset partition and labeling for multi person pose estimation. In: The IEEE conference on computer vision and pattern recognition (CVPR), (2016), pp. 4929–4937. https://doi.org/10.1109/CVPR.2016.533

  14. Insafutdinov, E., Pishchulin, L., Andres, B., Andriluka, M., Schieke, B.: DeeperCut: a deeper, stronger, and faster multi-person pose estimation model. In: The European conference on computer vision (ECCV), (2016). https://doi.org/10.1007/978-3-319-46466-4_3

  15. Cao, Z., Hidalgo, G., Simon, T., Wei, S.E., Sheikh, Y.: OpenPose: realtime multi-person 2D pose estimation using part affinity fields. IEEE Trans. Pattern Anal. Mach. Intell. 43(1), 172–186 (2021). https://doi.org/10.1109/TPAMI.2019.2929257

    Article  Google Scholar 

  16. Yang, S., Feng, Z., Wang, Z., Li, Y., Zhang, S., Quan, Z., Xia, S., Yang, W.: Detecting and grouping keypoints for multi-person pose estimation using instance-aware attention. Pattern Recogn. 136, 109232 (2023). https://doi.org/10.1016/j.patcog.2022.109232

    Article  Google Scholar 

  17. Xu, Y., Piao, Z., Zhang, Z., Liu, W., Gao, S.: SUNNet: a novel framework for simultaneous human parsing and pose estimation. Neurocomputing 444, 349–355 (2021). https://doi.org/10.1016/j.neucom.2020.01.123

    Article  Google Scholar 

  18. Lamas, A., Tabik, S., Montes, A.C., Pérez-Hernández, F., García, J., Olmos, R., Herrera, F.: Human pose estimation for mitigating false negatives in weapon detection in video-surveillance. Neurocomputing 489, 488–503 (2022). https://doi.org/10.1016/j.neucom.2021.12.059

    Article  Google Scholar 

  19. Hou, Q., Zhou, D., Feng, J.: Coordinate attention for efficient mobile network design. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), Nashville, TN, USA, pp. 13708–13717, (2021). https://doi.org/10.1109/CVPR46437.2021.01350

  20. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: IEEE conference on computer vision and pattern recognition (CVPR), (2018). https://doi.org/10.48550/arXiv.1709.01507

  21. Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: CBAM: convolutional block attention module. In: The European conference on computer vision (ECCV), (2018). https://doi.org/10.1007/978-3-030-01234-2_1

  22. Zhang, Q., Yang, Y.: SA-Net: shuffle attention for deep convolutional neural networks. arXiv preprint, (2021). https://doi.org/10.48550/arXiv.2102.00240

  23. Jiang, C., Huang, K., Zhang, S., Wang, X., Xiao, J., Goulermas, Y.: Aggregated pyramid gating network for human pose estimation without pre-training. Pattern Recogn. 138, 109429 (2023). https://doi.org/10.1016/j.patcog.2023.109429

    Article  Google Scholar 

  24. Zhong, F., Li, M., Zhang, K., Hu, J., Liu, L.: DSPNet: a low computational-cost network for human pose estimation. Neurocomputing 423, 327–335 (2021). https://doi.org/10.1016/j.neucom.2020.11.003

    Article  Google Scholar 

  25. Maji, D., Nagori, S., Mathew, M., Poddar, D.: YOLO-Pose: enhancing YOLO for multi-person pose estimation using object keypoint similarity loss. arXiv preprint, (2022). https://doi.org/10.48550/arXiv.2204.06806

  26. Han, K., Wang, Y., Tian, Q., Guo, J., Xu, C., Xu, C.: GhostNet: more features from cheap operations. In: IEEE conference on computer vision and pattern recognition (CVPR), (2020). https://doi.org/10.48550/arXiv.1911.11907

  27. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: The European conference on computer vision (ECCV), vol. 8693, pp. 740-755, (2014)

  28. Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: The European conference on computer vision (ECCV), (2016). https://doi.org/10.48550/arXiv.1603.06937

  29. Geng, Z., Sun, K., Xiao, B., Zhang, Z., Wang, J.: Bottom-Up Human Pose Estimation Via Disentangled Keypoint Regression. In: The IEEE conference on computer vision and pattern recognition (CVPR), (2021). https://doi.org/10.48550/arXiv.2104.02300

  30. Osokin, D.: Real-time 2D multi-person pose estimation on CPU: lightweight OpenPose. arXiv preprint, 2018. https://doi.org/10.48550/arXiv.1811.12004

Download references

Funding

The authors wish to express their gratitude to the National Ministry of Science and Technology Innovation Method Special(2020IM020700).

Author information

Authors and Affiliations

Authors

Contributions

The first draft of the manuscript was written by YZ and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Zongyan Wang.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Ethical approval

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Y., Wang, Z., Li, M. et al. SP-YOLO: an end-to-end lightweight network for real-time human pose estimation. SIViP 18, 863–876 (2024). https://doi.org/10.1007/s11760-023-02812-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11760-023-02812-8

Keywords

Navigation