Skip to main content
Log in

A strong benchmark for yoga action recognition based on lightweight pose estimation model

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Yoga action recognition is crucial for enabling precise motion analysis and providing effective training guidance, which in turn facilitates the optimization of physical health and skill enhancement. However, current methods struggle to maintain high accuracy and real-time performance when dealing with the complex poses and occlusions. Additionally, these methods neglect the dynamic characteristics and temporal sequence information inherent in yoga actions. Therefore, this paper proposes a two-stage action recognition method tailored for yoga scenarios. The method initially employs pose estimation technology based on knowledge distillation to optimize the accuracy and efficiency of lightweight models in detecting complex poses and occlusions. Subsequently, a lightweight 3D convolutional neural network (3D-CNN) is utilized for action recognition, achieving seamless integration of the two stages through heat maps, thereby enhancing recognition accuracy and precisely capturing spatiotemporal features in video sequences. Experimental results indicate that on the COCO dataset, the DistillPose-m model achieves a 2.5% improvement in Average Precision (AP) compared to RTMPose-m. In the yoga action recognition task, our model exhibites approximately a 2% improvement over traditional Graph Convolutional Network (GCN) methods on both the Deepyoga and 3Dyoga90 datasets. This study enhances the performance and accuracy of pose estimation in yoga scenarios, addressing the challenges of bodily occlusions and complex postures. By fully leveraging the spatiotemporal information inherent in yoga movements, it improves the accuracy of yoga action recognition. This research provides critical insights and support for motion training and analysis systems in other dynamic activities, such as martial arts and dance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Algorithm 1
Algorithm 2
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Data availability

No datasets were generated or analysed during the current study.

References

  1. Govindaraj, R., Karmani, S., Varambally, S., Gangadhar, B.: Yoga and physical exercise-a review and comparison. Int. Rev. Psychiatry 28(3), 242–253 (2016)

    Article  Google Scholar 

  2. Duan, H., Zhao, Y., Chen, K., Lin, D., Dai, B.: Revisiting skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2969–2978 (2022)

  3. Cao, Z., Hidalgo, G., Simon, T., Wei, S.-E., Sheikh, Y.: Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Trans. Pattern Anal. Mach. Intell. 43(1), 172–186 (2021). https://doi.org/10.1109/TPAMI.2019.2929257

    Article  Google Scholar 

  4. Xu, Y., Zhang, J., Zhang, Q., Tao, D.: Vitpose: Simple vision transformer baselines for human pose estimation. Adv. Neural. Inf. Process. Syst. 35, 38571–38584 (2022)

    MATH  Google Scholar 

  5. Bazarevsky, V., Grishchenko, I., Raveendran, K., Zhu, T., Zhang, F., Grundmann, M.: Blazepose: On-device real-time body pose tracking. arXiv preprint arXiv:2006.10204 (2020)

  6. Bajpai, R., Joshi, D.: Movenet: A deep neural network for joint profile prediction across variable walking speeds and slopes. IEEE Trans. Instrum. Meas. 70, 1–11 (2021)

    MATH  Google Scholar 

  7. Zheng, C., Wu, W., Chen, C., Yang, T., Zhu, S., Shen, J., Kehtarnavaz, N., Shah, M.: Deep learning-based human pose estimation: A survey. ACM Comput. Surv. 56(1), 1–37 (2023)

    Article  MATH  Google Scholar 

  8. Jiang, T., Lu, P., Zhang, L., Ma, N., Han, R., Lyu, C., Li, Y., Chen, K.: Rtmpose: Real-time multi-person pose estimation based on mmpose. arXiv e-prints, 2303 (2023)

  9. Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5693–5703 (2019)

  10. Toshev, A., Szegedy, C.: Deeppose: Human pose estimation via deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1653–1660 (2014)

  11. Zhou, X., Wang, D., Krähenbühl, P.: Objects as Points (2019)

  12. Shi, D., Wei, X., Li, L., Ren, Y., Tan, W.: End-to-end multi-person pose estimation with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11069–11078 (2022)

  13. Li, J., Bian, S., Zeng, A., Wang, C., Pang, B., Liu, W., Lu, C.: Human pose regression with residual log-likelihood estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11025–11034 (2021)

  14. Ye, S., Zhang, Y., Hu, J., Cao, L., Zhang, S., Shen, L., Wang, J., Ding, S., Ji, R.: Distilpose: Tokenized pose regression with heatmap distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2163–2172 (2023)

  15. Li, Y., Yang, S., Liu, P., Zhang, S., Wang, Y., Wang, Z., Yang, W., Xia, S.-T.: Simcc: A simple coordinate classification perspective for human pose estimation. In: European Conference on Computer Vision, pp. 89–106 (2022). Springer

  16. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)

  17. Zhang, F., Zhu, X., Ye, M.: Fast human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3517–3526 (2019)

  18. Li, Z., Ye, J., Song, M., Huang, Y., Pan, Z.: Online knowledge distillation for efficient pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11740–11750 (2021)

  19. Weinzaepfel, P., Brégier, R., Combaluzier, H., Leroy, V., Rogez, G.: Dope: Distillation of part experts for whole-body 3d pose estimation in the wild. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVI 16, pp. 380–397 (2020). Springer

  20. Zhang, F., Bazarevsky, V., Vakunov, A., Tkachenka, A., Sung, G., Chang, C.-L., Grundmann, M.: MediaPipe Hands: On-device Real-time Hand Tracking (2020)

  21. Wu, W., Yin, W., Guo, F.: Learning and self-instruction expert system for yoga. In: 2010 2nd International Workshop on Intelligent Systems and Applications, pp. 1–4 (2010). IEEE

  22. Luo, Z., Yang, W., Ding, Z.Q., Liu, L., Chen, I.-M., Yeo, S.H., Ling, K.V., Duh, H.B.-L.: “left arm up!” interactive yoga training in virtual environment. In: 2011 IEEE Virtual Reality Conference, pp. 261–262 (2011). IEEE

  23. Agrawal, Y., Shah, Y., Sharma, A.: Implementation of machine learning technique for identification of yoga poses. In: 2020 IEEE 9th International Conference on Communication Systems and Network Technologies (CSNT), pp. 40–43 (2020). Ieee

  24. Chasmai, M., Das, N., Bhardwaj, A., Garg, R.: A view independent classification framework for yoga postures. SN computer science 3(6), 476 (2022)

    Article  MATH  Google Scholar 

  25. Liaqat, S., Dashtipour, K., Arshad, K., Assaleh, K., Ramzan, N.: A hybrid posture detection framework: Integrating machine learning and deep neural networks. IEEE Sens. J. 21(7), 9515–9522 (2021)

    Article  Google Scholar 

  26. Narayanan, S.S., Misra, D.K., Arora, K., Rai, H.: Yoga pose detection using deep learning techniques. In: Proceedings of the International Conference on Innovative Computing & Communication (ICICC) (2021)

  27. Garg, S., Saxena, A., Gupta, R.: Yoga pose classification: a cnn and mediapipe inspired deep learning approach for real-world application. J. Ambient. Intell. Humaniz. Comput. 14(12), 16551–16562 (2023)

    Article  MATH  Google Scholar 

  28. Bera, A., Nasipuri, M., Krejcar, O., Bhattacharjee, D.: Fine-grained sports, yoga, and dance postures recognition: A benchmark analysis. IEEE Transactions on Instrumentation and Measurement (2023)

  29. Srinivasan, T.: Dynamic and static asana practices. Medknow (2016)

  30. Ullah, A., Ahmad, J., Muhammad, K., Sajjad, M., Baik, S.W.: Action recognition in video sequences using deep bi-directional lstm with cnn features. IEEE access 6, 1155–1166 (2017)

    Article  MATH  Google Scholar 

  31. Sun, B., Ye, X., Yan, T., Wang, Z., Li, H., Wang, Z.: Fine-grained action recognition with robust motion representation decoupling and concentration. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 4779–4788 (2022)

  32. Kim, S.: 3DYoga90: A Hierarchical Video Dataset for Yoga Pose Understanding (2023)

  33. Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)

  34. Shi, L., Zhang, Y., Cheng, J., Lu, H.: Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12026–12035 (2019)

  35. Duan, H., Wang, J., Chen, K., Lin, D.: Pyskl: Towards good practices for skeleton action recognition. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 7351–7354 (2022)

  36. Chen, C.-F.R., Panda, R., Ramakrishnan, K., Feris, R., Cohn, J., Oliva, A., Fan, Q.: Deep analysis of cnn-based spatio-temporal representations for action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6165–6175 (2021)

  37. Noor, N., Park, I.K.: A lightweight skeleton-based 3d-cnn for real-time fall detection and action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2179–2188 (2023)

  38. Sun, B., Ye, X., Wang, Z., Li, H., Wang, Z.: Exploring coarse-to-fine action token localization and interaction for fine-grained video action recognition. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 5070–5078 (2023)

  39. Ahn, D., Kim, S., Hong, H., Ko, B.C.: Star-transformer: a spatio-temporal cross attention transformer for human action recognition. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3330–3339 (2023)

  40. Sun, B., Ye, X., Yan, T., Wang, Z., Li, H., Wang, Z.: Discriminative segment focus network for fine-grained video action recognition. ACM Trans. Multimed. Comput. Commun. Appl. 20(7), 1–20 (2024)

    Article  MATH  Google Scholar 

  41. Zagoruyko, S., Komodakis, N.: Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer (2017)

  42. Cao, Z., Simon, T., Wei, S.-E., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7291–7299 (2017)

  43. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755 (2014). Springer

  44. Yang, J., Zeng, A., Liu, S., Li, F., Zhang, R., Zhang, L.: Explicit box detection unifies end-to-end multi-person pose estimation. In: International Conference on Learning Representations (2023). https://openreview.net/forum?id=s4WVupnJjmX

  45. Jiang, T., Lu, P., Zhang, L., Ma, N., Han, R., Lyu, C., Li, Y., Chen, K.: RTMPose: Real-Time Multi-Person Pose Estimation based on MMPose. arXiv (2023). https://doi.org/10.48550/ARXIV.2303.07399 . https://arxiv.org/abs/2303.07399

  46. Yadav, S.K., Singh, A., Gupta, A., Raheja, J.L.: Real-time yoga recognition using deep learning. Neural Comput. Appl. 31, 9349–9361 (2019)

    Article  Google Scholar 

  47. Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)

  48. Maaten, L., Hinton, G.: Visualizing data using t-sne. Journal of Machine Learning Research 9(11) (2008)

Download references

Acknowledgements

This work was supported by the Natural Science Foundation for Outstanding Young Scholars of Fujian Province (grant number 2022J06023); Fujian Province Science and Technology Empowering Police Research Initiative (grant number 2024Y0064) and the High-level Talent Innovation and Entrepreneurship Project of Quanzhou City (grant number 2023C013R).

Author information

Authors and Affiliations

Authors

Contributions

L.T. Zhou was responsible for the conception of the research, data collection, experimental design and implementation, and manuscript writing. W.W. Zhang contributed to the conception of the research and experimental design, and reviewed and edited the manuscript. B.H. Zhang and X.B. Li were responsible for the analysis of experimental data and the preparation of figures. J.Q. Zhu was responsible for the analysis of experimental data and reviewed the manuscript.

Corresponding author

Correspondence to Weiwei Zhang.

Ethics declarations

Conflict of interest

All authors of this research paper declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhou, L., Zhang, W., Zhang, B. et al. A strong benchmark for yoga action recognition based on lightweight pose estimation model. Multimedia Systems 31, 66 (2025). https://doi.org/10.1007/s00530-024-01646-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00530-024-01646-9

Keywords

Navigation