Abstract
Yoga action recognition is crucial for enabling precise motion analysis and providing effective training guidance, which in turn facilitates the optimization of physical health and skill enhancement. However, current methods struggle to maintain high accuracy and real-time performance when dealing with the complex poses and occlusions. Additionally, these methods neglect the dynamic characteristics and temporal sequence information inherent in yoga actions. Therefore, this paper proposes a two-stage action recognition method tailored for yoga scenarios. The method initially employs pose estimation technology based on knowledge distillation to optimize the accuracy and efficiency of lightweight models in detecting complex poses and occlusions. Subsequently, a lightweight 3D convolutional neural network (3D-CNN) is utilized for action recognition, achieving seamless integration of the two stages through heat maps, thereby enhancing recognition accuracy and precisely capturing spatiotemporal features in video sequences. Experimental results indicate that on the COCO dataset, the DistillPose-m model achieves a 2.5% improvement in Average Precision (AP) compared to RTMPose-m. In the yoga action recognition task, our model exhibites approximately a 2% improvement over traditional Graph Convolutional Network (GCN) methods on both the Deepyoga and 3Dyoga90 datasets. This study enhances the performance and accuracy of pose estimation in yoga scenarios, addressing the challenges of bodily occlusions and complex postures. By fully leveraging the spatiotemporal information inherent in yoga movements, it improves the accuracy of yoga action recognition. This research provides critical insights and support for motion training and analysis systems in other dynamic activities, such as martial arts and dance.

















Similar content being viewed by others
Data availability
No datasets were generated or analysed during the current study.
References
Govindaraj, R., Karmani, S., Varambally, S., Gangadhar, B.: Yoga and physical exercise-a review and comparison. Int. Rev. Psychiatry 28(3), 242–253 (2016)
Duan, H., Zhao, Y., Chen, K., Lin, D., Dai, B.: Revisiting skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2969–2978 (2022)
Cao, Z., Hidalgo, G., Simon, T., Wei, S.-E., Sheikh, Y.: Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Trans. Pattern Anal. Mach. Intell. 43(1), 172–186 (2021). https://doi.org/10.1109/TPAMI.2019.2929257
Xu, Y., Zhang, J., Zhang, Q., Tao, D.: Vitpose: Simple vision transformer baselines for human pose estimation. Adv. Neural. Inf. Process. Syst. 35, 38571–38584 (2022)
Bazarevsky, V., Grishchenko, I., Raveendran, K., Zhu, T., Zhang, F., Grundmann, M.: Blazepose: On-device real-time body pose tracking. arXiv preprint arXiv:2006.10204 (2020)
Bajpai, R., Joshi, D.: Movenet: A deep neural network for joint profile prediction across variable walking speeds and slopes. IEEE Trans. Instrum. Meas. 70, 1–11 (2021)
Zheng, C., Wu, W., Chen, C., Yang, T., Zhu, S., Shen, J., Kehtarnavaz, N., Shah, M.: Deep learning-based human pose estimation: A survey. ACM Comput. Surv. 56(1), 1–37 (2023)
Jiang, T., Lu, P., Zhang, L., Ma, N., Han, R., Lyu, C., Li, Y., Chen, K.: Rtmpose: Real-time multi-person pose estimation based on mmpose. arXiv e-prints, 2303 (2023)
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5693–5703 (2019)
Toshev, A., Szegedy, C.: Deeppose: Human pose estimation via deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1653–1660 (2014)
Zhou, X., Wang, D., Krähenbühl, P.: Objects as Points (2019)
Shi, D., Wei, X., Li, L., Ren, Y., Tan, W.: End-to-end multi-person pose estimation with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11069–11078 (2022)
Li, J., Bian, S., Zeng, A., Wang, C., Pang, B., Liu, W., Lu, C.: Human pose regression with residual log-likelihood estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11025–11034 (2021)
Ye, S., Zhang, Y., Hu, J., Cao, L., Zhang, S., Shen, L., Wang, J., Ding, S., Ji, R.: Distilpose: Tokenized pose regression with heatmap distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2163–2172 (2023)
Li, Y., Yang, S., Liu, P., Zhang, S., Wang, Y., Wang, Z., Yang, W., Xia, S.-T.: Simcc: A simple coordinate classification perspective for human pose estimation. In: European Conference on Computer Vision, pp. 89–106 (2022). Springer
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Zhang, F., Zhu, X., Ye, M.: Fast human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3517–3526 (2019)
Li, Z., Ye, J., Song, M., Huang, Y., Pan, Z.: Online knowledge distillation for efficient pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11740–11750 (2021)
Weinzaepfel, P., Brégier, R., Combaluzier, H., Leroy, V., Rogez, G.: Dope: Distillation of part experts for whole-body 3d pose estimation in the wild. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVI 16, pp. 380–397 (2020). Springer
Zhang, F., Bazarevsky, V., Vakunov, A., Tkachenka, A., Sung, G., Chang, C.-L., Grundmann, M.: MediaPipe Hands: On-device Real-time Hand Tracking (2020)
Wu, W., Yin, W., Guo, F.: Learning and self-instruction expert system for yoga. In: 2010 2nd International Workshop on Intelligent Systems and Applications, pp. 1–4 (2010). IEEE
Luo, Z., Yang, W., Ding, Z.Q., Liu, L., Chen, I.-M., Yeo, S.H., Ling, K.V., Duh, H.B.-L.: “left arm up!” interactive yoga training in virtual environment. In: 2011 IEEE Virtual Reality Conference, pp. 261–262 (2011). IEEE
Agrawal, Y., Shah, Y., Sharma, A.: Implementation of machine learning technique for identification of yoga poses. In: 2020 IEEE 9th International Conference on Communication Systems and Network Technologies (CSNT), pp. 40–43 (2020). Ieee
Chasmai, M., Das, N., Bhardwaj, A., Garg, R.: A view independent classification framework for yoga postures. SN computer science 3(6), 476 (2022)
Liaqat, S., Dashtipour, K., Arshad, K., Assaleh, K., Ramzan, N.: A hybrid posture detection framework: Integrating machine learning and deep neural networks. IEEE Sens. J. 21(7), 9515–9522 (2021)
Narayanan, S.S., Misra, D.K., Arora, K., Rai, H.: Yoga pose detection using deep learning techniques. In: Proceedings of the International Conference on Innovative Computing & Communication (ICICC) (2021)
Garg, S., Saxena, A., Gupta, R.: Yoga pose classification: a cnn and mediapipe inspired deep learning approach for real-world application. J. Ambient. Intell. Humaniz. Comput. 14(12), 16551–16562 (2023)
Bera, A., Nasipuri, M., Krejcar, O., Bhattacharjee, D.: Fine-grained sports, yoga, and dance postures recognition: A benchmark analysis. IEEE Transactions on Instrumentation and Measurement (2023)
Srinivasan, T.: Dynamic and static asana practices. Medknow (2016)
Ullah, A., Ahmad, J., Muhammad, K., Sajjad, M., Baik, S.W.: Action recognition in video sequences using deep bi-directional lstm with cnn features. IEEE access 6, 1155–1166 (2017)
Sun, B., Ye, X., Yan, T., Wang, Z., Li, H., Wang, Z.: Fine-grained action recognition with robust motion representation decoupling and concentration. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 4779–4788 (2022)
Kim, S.: 3DYoga90: A Hierarchical Video Dataset for Yoga Pose Understanding (2023)
Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Shi, L., Zhang, Y., Cheng, J., Lu, H.: Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12026–12035 (2019)
Duan, H., Wang, J., Chen, K., Lin, D.: Pyskl: Towards good practices for skeleton action recognition. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 7351–7354 (2022)
Chen, C.-F.R., Panda, R., Ramakrishnan, K., Feris, R., Cohn, J., Oliva, A., Fan, Q.: Deep analysis of cnn-based spatio-temporal representations for action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6165–6175 (2021)
Noor, N., Park, I.K.: A lightweight skeleton-based 3d-cnn for real-time fall detection and action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2179–2188 (2023)
Sun, B., Ye, X., Wang, Z., Li, H., Wang, Z.: Exploring coarse-to-fine action token localization and interaction for fine-grained video action recognition. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 5070–5078 (2023)
Ahn, D., Kim, S., Hong, H., Ko, B.C.: Star-transformer: a spatio-temporal cross attention transformer for human action recognition. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3330–3339 (2023)
Sun, B., Ye, X., Yan, T., Wang, Z., Li, H., Wang, Z.: Discriminative segment focus network for fine-grained video action recognition. ACM Trans. Multimed. Comput. Commun. Appl. 20(7), 1–20 (2024)
Zagoruyko, S., Komodakis, N.: Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer (2017)
Cao, Z., Simon, T., Wei, S.-E., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7291–7299 (2017)
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755 (2014). Springer
Yang, J., Zeng, A., Liu, S., Li, F., Zhang, R., Zhang, L.: Explicit box detection unifies end-to-end multi-person pose estimation. In: International Conference on Learning Representations (2023). https://openreview.net/forum?id=s4WVupnJjmX
Jiang, T., Lu, P., Zhang, L., Ma, N., Han, R., Lyu, C., Li, Y., Chen, K.: RTMPose: Real-Time Multi-Person Pose Estimation based on MMPose. arXiv (2023). https://doi.org/10.48550/ARXIV.2303.07399 . https://arxiv.org/abs/2303.07399
Yadav, S.K., Singh, A., Gupta, A., Raheja, J.L.: Real-time yoga recognition using deep learning. Neural Comput. Appl. 31, 9349–9361 (2019)
Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)
Maaten, L., Hinton, G.: Visualizing data using t-sne. Journal of Machine Learning Research 9(11) (2008)
Acknowledgements
This work was supported by the Natural Science Foundation for Outstanding Young Scholars of Fujian Province (grant number 2022J06023); Fujian Province Science and Technology Empowering Police Research Initiative (grant number 2024Y0064) and the High-level Talent Innovation and Entrepreneurship Project of Quanzhou City (grant number 2023C013R).
Author information
Authors and Affiliations
Contributions
L.T. Zhou was responsible for the conception of the research, data collection, experimental design and implementation, and manuscript writing. W.W. Zhang contributed to the conception of the research and experimental design, and reviewed and edited the manuscript. B.H. Zhang and X.B. Li were responsible for the analysis of experimental data and the preparation of figures. J.Q. Zhu was responsible for the analysis of experimental data and reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
All authors of this research paper declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhou, L., Zhang, W., Zhang, B. et al. A strong benchmark for yoga action recognition based on lightweight pose estimation model. Multimedia Systems 31, 66 (2025). https://doi.org/10.1007/s00530-024-01646-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00530-024-01646-9