MDA-YOLO Person: a 2D human pose estimation model based on YOLO detection framework

Dong, Chengang; Tang, Yuhao; Zhang, Liyan

doi:10.1007/s10586-024-04608-y

MDA-YOLO Person: a 2D human pose estimation model based on YOLO detection framework

Published: 11 June 2024

Volume 27, pages 12323–12340, (2024)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Chengang Dong¹,
Yuhao Tang¹ &
Liyan Zhang¹

719 Accesses
Explore all metrics

Abstract

Human pose estimation aims to locate and predict the key points of the human body in images or videos. Due to the challenges of capturing complex spatial relationships and handling different body scales, accurate estimation of human pose remains challenging. Our work proposes a real-time human pose estimation method based on the anchor-assisted YOLOv7 framework, named MDA-YOLO Person. In this study, we propose the Keypoint Augmentation Strategies (KAS) to overcome the challenges faced in human pose estimation and improve the model’s ability to accurately predict keypoints. Furthermore, we introduce the Anchor Adjustment Module (AAM) as a replacement for the original YOLOv7’s detection head. By adjusting the parameters associated with the detector’s anchors, we achieve an increased recall rate and enhance the completeness of the pose estimation. Additionally, we incorporate the Multi-Scale Dual-Head Attention (MDA) module, which effectively models the weights of both channel and spatial dimensions at multiple scales, enabling the model to focus on more salient feature information. As a result, our approach outperforms other methods, as demonstrated by the promising results obtained on two large-scale public datasets. MDA-YOLO Person outperforms the baseline model YOLOv7-pose on both MS COCO 2017 and CrowdPose datasets, with improvements of 2.2% and 3.7% in precision and recall on MS COCO 2017, and 1.9% and 3.5% on CrowdPose, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 6

HDA-pose: a real-time 2D human pose estimation method based on modified YOLOv8

Article 30 May 2024

Self-supervised Siamese keypoint inference network for human pose estimation and tracking

Article 05 March 2024

Position Puzzle Network and Augmentation: localizing human keypoints beyond the bounding box

Article 27 October 2023

Data availability

As our study did not involve the generation or analysis of datasets, the sharing of data is not applicable to this article. We did not gather any specific datasets that would necessitate sharing with other researchers or the general public. Consequently, there are no datasets associated with our investigation that would be accessible for the purpose of data sharing.

References

Xu, M., Wang, Y., Xu, B., Zhang, J., Ren, J., Huang, Z., Poslad, S., Xu, P.: A critical analysis of image-based camera pose estimation techniques. Neurocomputing 570, 127125 (2024)
Article Google Scholar
Ghosh, R.: Product identification in retail stores by combining faster R-CNN and recurrent neural network. Multimedia Tools Appl. 83(3), 7135–7158 (2024)
Article Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in neural information processing systems 30 (2017)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Liu, H., Chen, Q., Tan, Z., Liu, J.-J., Wang, J., Su, X., Li, X., Yao, K., Han, J., Ding, E., et al.: Group pose: a simple baseline for end-to-end multi-person pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15029–15038 (2023)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(06), 1137–1149 (2017)
Article Google Scholar
Luo, Z., Wang, Z., Huang, Y., Wang, L., Tan, T., Zhou, E.: Rethinking the heatmap regression for bottom-up human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13264–13273 (2021)
Koonce, B., Koonce, B.: Mobilenetv3. In: Convolutional Neural Networks with Swift for Tensorflow: Image Recognition and Dataset Categorization, pp. 125–144. Apress, Berkeley (2021)
Chapter Google Scholar
Ding, X., Zhang, X., Ma, N., Han, J., Ding, G., Sun, J.: RepVGG: making VGG-style convnets great again. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13733–13742 (2021)
Xu, S., Wang, X., Lv, W., Chang, Q., Cui, C., Deng, K., Wang, G., Dang, Q., Wei, S., Du, Y., et al.: PP-YOLOE: an evolved version of YOLO. arXiv preprint arXiv:2203.16250 (2022)
Ge, Z., Liu, S., Wang, F., Li, Z., Sun, J.: Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430 (2021)
Wang, C.-Y., Bochkovskiy, A., Liao, H.-Y.M.: YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7464–7475 (2023)
Liu, S., Qi, L., Qin, H., Shi, J., Jia, J.: Path aggregation network for instance segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8759–8768 (2018)
Ma, X., Guo, J., Sansom, A., McGuire, M., Kalaani, A., Chen, Q., Tang, S., Yang, Q., Fu, S.: Spatial pyramid attention for deep convolutional neural networks. IEEE Trans. Multimedia 23, 3048–3058 (2021)
Article Google Scholar
Wang, C.-Y., Bochkovskiy, A., Liao, H.-Y.M.: Scaled-YOLOv4: scaling cross stage partial network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13029–13038 (2021)
Dong, X., Wang, X., Li, B., Wang, H., Chen, G., Cai, M.: YH-pose: human pose estimation in complex coal mine scenarios. Eng. Appl. Artif. Intell. 127, 107338 (2024)
Article Google Scholar
Maji, D., Nagori, S., Mathew, M., Poddar, D.: YOLO-pose: Enhancing yolo for multi person pose estimation using object keypoint similarity loss. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2637–2646 (2022)
Zhang, J., Chen, Z., Tao, D.: Towards high performance human keypoint detection. Int. J. Comput. Vis. 129(9), 2639–2662 (2021)
Article Google Scholar
Li, J., Wang, C., Zhu, H., Mao, Y., Fang, H.-S., Lu, C.: CrowdPose: efficient crowded scenes pose estimation and a new benchmark. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10863–10872 (2019)
Li, C., Li, L., Jiang, H., Weng, K., Geng, Y., Li, L., Ke, Z., Li, Q., Cheng, M., Nie, W., et al.: YOLOv6: a single-stage object detection framework for industrial applications. arXiv preprint arXiv:2209.02976 (2022)
Zhang, H., Ouyang, H., Liu, S., Qi, X., Shen, X., Yang, R., Jia, J.: Human pose estimation with spatial contextual information. arXiv preprint arXiv:1901.01760 (2019)
Bertasius, G., Feichtenhofer, C., Tran, D., Shi, J., Torresani, L.: Learning temporal pose estimation from sparsely-labeled videos. In: Advances in neural information processing systems 32 (2019)
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5693–5703 (2019)
Xu, Y., Zhang, J., Zhang, Q., Tao, D.: ViTPose: simple vision transformer baselines for human pose estimation. Adv. Neural Inf. Process. Syst. 35, 38571–38584 (2022)
Google Scholar
Xu, Y., Zhang, J., Zhang, Q., Tao, D.: ViTPose++: vision transformer for generic body pose estimation. IEEE Trans. Pattern Anal. Mach. Intell. 46, 1212–1230 (2023)
Article Google Scholar
Osokin, D.: Real-time 2D multi-person pose estimation on CPU: lightweight openpose. arXiv preprint arXiv:1811.12004 (2018)
Cheng, B., Xiao, B., Wang, J., Shi, H., Huang, T.S., Zhang, L.: HigherHRNet: scale-aware representation learning for bottom-up human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5386–5395 (2020)
Brasó, G., Kister, N., Leal-Taixé, L.: The center of attention: center-keypoint grouping via attention for multi-person pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11853–11863 (2021)
Geng, Z., Sun, K., Xiao, B., Zhang, Z., Wang, J.: Bottom-up human pose estimation via disentangled keypoint regression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14676–14686 (2021)
Bochkovskiy, A., Wang, C.-Y., Liao, H.-Y.M.: Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 (2020)
Walawalkar, D., Shen, Z., Liu, Z., Savvides, M.: Attentive CutMix: an enhanced data augmentation approach for deep learning based image classification. arXiv preprint arXiv:2003.13048 (2020)
Guo, H.: Nonlinear Mixup: out-of-manifold data augmentation for text classification. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 4044–4051 (2020)
Zhang, S., Wen, L., Bian, X., Lei, Z., Li, S.Z.: Single-shot refinement neural network for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4203–4212 (2018)
Li, S., Yang, L., Huang, J., Hua, X.-S., Zhang, L.: Dynamic anchor feature selection for single-shot object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6609–6618 (2019)
Xu, F., Wang, H., Sun, X., Fu, X.: Refined marine object detector with attention-based spatial pyramid pooling networks and bidirectional feature fusion strategy. Neural Comput. Appl. 34(17), 14881–14894 (2022)
Article Google Scholar
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)
Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: CBAM: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)
Christlein, V., Spranger, L., Seuret, M., Nicolaou, A., Král, P., Maier, A.: Deep generalized max pooling. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1090–1096. IEEE (2019)
Moskvyak, O., Maire, F., Dayoub, F., Baktashmotlagh, M.: Keypoint-aligned embeddings for image retrieval and re-identification. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 676–685 (2021)
Zhang, F., Zhu, X., Dai, H., Ye, M., Zhu, C.: Distribution-aware coordinate representation for human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7093–7102 (2020)
Hua, G., Li, L., Liu, S.: Multipath affinage stacked-hourglass networks for human pose estimation. Front. Comput. Sci. 14, 1–12 (2020)
Article Google Scholar
McNally, W., Vats, K., Wong, A., McPhee, J.: Rethinking keypoint representations: modeling keypoints and poses as objects for multi-person human pose estimation. In: European Conference on Computer Vision, pp. 37–54. Springer (2022)
Shi, D., Wei, X., Li, L., Ren, Y., Tan, W.: End-to-end multi-person pose estimation with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11069–11078 (2022)
Yang, J., Zeng, A., Liu, S., Li, F., Zhang, R., Zhang, L.: Explicit box detection unifies end-to-end multi-person pose estimation. arXiv preprint arXiv:2302.01593 (2023)
Jeon, H.-J., Lang, S., Vogel, C., Behrens, R.: An integrated real-time monocular human pose & shape estimation pipeline for edge devices. In: 2023 IEEE International Conference on Robotics and Biomimetics (ROBIO), pp. 1–6 (2023). IEEE
Xiao, Y., Su, K., Wang, X., Yu, D., Jin, L., He, M., Yuan, Z.: QueryPose: sparse multi-person pose regression via spatial-aware part-level query. Adv. Neural Inf. Process. Syst. 35, 12464–12477 (2022)
Google Scholar
Zhu, X., Lyu, S., Wang, X., Zhao, Q.: TPH-YOLOv5: improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2778–2788 (2021)
Ren, Z., Zhou, Y., Chen, Y., Zhou, R., Gao, Y.: Efficient human pose estimation by maximizing fusion and high-level spatial attention. In: 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021), pp. 01–06. IEEE (2021)
Hou, Q., Zhou, D., Feng, J.: Coordinate attention for efficient mobile network design. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13713–13722 (2021)
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626 (2017)
Yu, Z., Huang, H., Chen, W., Su, Y., Liu, Y., Wang, X.: YOLO-FaceV2: a scale and occlusion aware face detector. arXiv preprint arXiv:2208.02019 (2022)
Chen, J., Mai, H., Luo, L., Chen, X., Wu, K.: Effective feature fusion network in BIFPN for small object detection. In: 2021 IEEE International Conference on Image Processing (ICIP), pp. 699–703 (2021). IEEE
Dai, X., Chen, Y., Yang, J., Zhang, P., Yuan, L., Zhang, L.: Dynamic DETR: end-to-end object detection with dynamic attention. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2988–2997 (2021)
Chen, W., Zhao, Q., Liu, J., Wang, Z., Liu, Y., Yao, M.: Improved YOLO-pose crowd pose estimation. In: Proceedings of the 2023 6th International Conference on Signal Processing and Machine Learning, pp. 201–206 (2023)

Download references

Funding

This work was supported in part by the National Natural Science Foundation of China (Grant No. 62172212) and the Natural Science Foundation of Jiangsu Province (Grant No. BK20230031).

Author information

Authors and Affiliations

Nanjing University of Aeronautics and Astronautics, Nanjing, 210000, China
Chengang Dong, Yuhao Tang & Liyan Zhang

Authors

Chengang Dong
View author publications
You can also search for this author inPubMed Google Scholar
Yuhao Tang
View author publications
You can also search for this author inPubMed Google Scholar
Liyan Zhang
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Liyan Zhang.

Ethics declarations

Conflict of interest

We have conducted a thorough assessment of both financial and non-financial affiliations that could potentially create a conflict of interest with the research presented. We unequivocally declare that no conflict of interest have been identified that could in any way introduce bias or influence the outcomes of our study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Dong, C., Tang, Y. & Zhang, L. MDA-YOLO Person: a 2D human pose estimation model based on YOLO detection framework. Cluster Comput 27, 12323–12340 (2024). https://doi.org/10.1007/s10586-024-04608-y

Download citation

Received: 22 February 2024
Revised: 28 May 2024
Accepted: 29 May 2024
Published: 11 June 2024
Issue Date: December 2024
DOI: https://doi.org/10.1007/s10586-024-04608-y

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MDA-YOLO Person: a 2D human pose estimation model based on YOLO detection framework

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

HDA-pose: a real-time 2D human pose estimation method based on modified YOLOv8

Self-supervised Siamese keypoint inference network for human pose estimation and tracking

Position Puzzle Network and Augmentation: localizing human keypoints beyond the bounding box

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now