A real-time and general method for converting offline skeleton-based action recognition to online ones

Dong, Liheng; He, Guiqing; Zhang, Zhaoxiang; Xu, Yuelei; Hui, Tian; Xu, Xin; Tao, Chengyang; Li, Huafeng

doi:10.1007/s11554-025-01625-x

A real-time and general method for converting offline skeleton-based action recognition to online ones

Research
Published: 23 January 2025

Volume 22, article number 49, (2025)
Cite this article

Journal of Real-Time Image Processing Aims and scope Submit manuscript

Liheng Dong¹,
Guiqing He¹,
Zhaoxiang Zhang²,
Yuelei Xu²,
Tian Hui³,
Xin Xu²,
Chengyang Tao² &
…
Huafeng Li²

109 Accesses
Explore all metrics

Abstract

Many motion sensors can directly acquire human skeletal data, and then extract features on the skeletal data through GCNs (graph convolutional networks) to perform action recognition. However, almost all state-of-the-art (SOTA) methods are offline methods, cannot perform online inference, wasting computational resources. The existing approach to transforming offline action recognition into online action recognition is to reconstruct the network structure of the offline method. This requires developers to have a deep understanding of the algorithm’s network structure and make extensive modifications, which results in slow development. To address the above issue, this paper points out that to convert offline methods to online ones, the key is removing outdated frame features and fusing new frame features. Furthermore, we propose a general and simple model called Encode One Frame (EOF), which achieves feature removal and fusion by a correlation matrix and the guidance of a teacher model. The EOF model has online inference capabilities, requiring only the input of the new frame of the current sample and the features encoded from the old sample. Based on the EOF model, we further propose the You Only Encode One Frame (YOEOF) algorithm to correct the cumulative errors generated during EOF model online inference. By coupling these proposals, YOEOF achieves online inference and outperforms some SOTA methods on public datasets. The deployment at the application level indicates that our method meets the requirements of high accuracy and real-time performance for dangerous action recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Auto-Learning-GCN: An Ingenious Framework for Skeleton-Based Action Recognition

Two-stream spatio-temporal GCN-transformer networks for skeleton-based action recognition

Article Open access 10 February 2025

A three-stream fusion network for 3D skeleton-based action recognition

Article 01 April 2025

Data availability

The datasets used and analyzed during the current study are publicly available and can be accessed at https://rose1.ntu.edu.sg/dataset/actionRecognition/ and https://github.com/cvdfoundation/kineticsdataset. The relevant data sources are also cited in the manuscript.

References

Shi, L., Zhang, Y., Cheng, J., Lu, H.: Skeleton-based action recognition with directed graph neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 7912–7921 (2019)
Shi, L., Zhang, Y., Cheng, J., Lu, H.: Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2019)
Fernando, B., Gavves, E., Oramas, J.M., Ghodrati, A., Tuytelaars, T.: Modeling video evolution for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5378–5387 (2015)
Ke, Q., Bennamoun, M., An, S., Sohel, F., Boussaid, F.: A new representation of skeleton sequences for 3d action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3288–3297 (2017)
Soo Kim, T., Reiter, A.: Interpretable 3d human action analysis with temporal convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 20–28 (2017)
Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI conference on artificial intelligence, vol. 32 (2018)
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
Niepert, M., Ahmed, M., Kutzkov, K.: Learning convolutional neural networks for graphs. In: International conference on machine learning, pp. 2014–2023. PMLR (2016)
Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. arXiv preprint arXiv:1710.10903 (2017)
Bai, S., Kolter, J.Z., Koltun, V.: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. Preprint arXiv arXiv:1803.01271v2 (2018)
Van Den Oord, A., Dieleman, S., Zen: Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016)
Chen, Y., Zhang, Z., Yuan, C., Li, B., Deng, Y., Hu, W.: Channel-wise topology refinement graph convolution for skeleton-based action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13359–13368 (2021)
Cheng, K., Zhang, Y., He, X., Chen, W., Cheng, J., Lu, H.: Skeleton-based action recognition with shift graph convolutional network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 183–192 (2020)
Cheng, K., Zhang, Y., He, X., Cheng, J., Lu, H.: Extremely lightweight skeleton-based action recognition with shiftgcn++. IEEE Trans. Image Process. 30, 7333–7348 (2021). https://doi.org/10.1109/TIP.2021.3104182
Article MATH Google Scholar
Duan, H., Zhao, Y., Chen, K., Lin, D., Dai, B.: Revisiting skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2969–2978 (2022)
Li, M., Chen, S., Chen, X., Zhang, Y., Wang, Y., Tian, Q.: Actional-structural graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3595–3603 (2019)
Liu, Z., Zhang, H., Chen, Z., Wang, Z., Ouyang, W.: Disentangling and unifying graph convolutions for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 143–152 (2020)
Shi, L., Zhang, Y., Cheng, J., Lu, H.: Adasgn: Adapting joint number and model size for efficient skeleton-based action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13413–13422 (2021)
Xiang, W., Li, C., Zhou, Y., Wang, B., Zhang, L.: Language supervised training for skeleton-based action recognition. arXiv preprint arXiv:2208.05318 (2022)
Zhang, P., Lan, C., Zeng, W., Xing, J., Xue, J., Zheng, N.: Semantics-guided neural networks for efficient skeleton-based human action recognition. In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1112–1121 (2020)
Hedegaard, L., Heidari, N., Iosifidis, A.: Continual spatio-temporal graph convolutional networks. Pattern Recogn. 140, 109528 (2023). https://doi.org/10.1016/j.patcog.2023.109528
Article Google Scholar
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Gaur, U., Zhu, Y., Song, B., Roy-Chowdhury, A.: Astring of feature graphsmodel for recognition of complex activities in natural videos. In: 2011 International Conference on Computer Vision, pp. 2595–2602 (2011). https://doi.org/10.1109/ICCV.2011.6126548
Vemulapalli, R., Arrate, F., Chellappa, R.: Human action recognition by representing 3d skeletons as points in a lie group. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 588–595 (2014)
Wang, J., Liu, Z., Wu, Y., Yuan, J.: Mining actionlet ensemble for action recognition with depth cameras. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1290–1297 (2012). https://doi.org/10.1109/CVPR.2012.6247813
Choutas, V., Weinzaepfel, P., Revaud, J., Schmid, C.: Potion: Pose motion representation for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7024–7033 (2018)
Asghari-Esfeden, S., Sznaier, M., Camps, O.: Dynamic motion representation for human action recognition. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 557–566 (2020)
Yan, A., Wang, Y., Li, Z., Qiao, Y.: Pa3d: Pose-action 3d machine for video recognition. In: Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pp. 7922–7931 (2019)
Lin, Z., Zhang, W., Deng, X., Ma, C., Wang, H.: Image-based pose representation for action recognition and hand gesture recognition. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp. 532–539. IEEE (2020)
Ruiz, A.H., Porzi, L., Bulò, S., Moreno-Noguer, F.: 3d CNNs on distance matrices for human action recognition. In: 2017 ACM (2018)
Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1110–1118 (2015)
Li, M., Chen, S., Chen, X., Zhang, Y., Wang, Y., Tian, Q.: An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3595–3603 (2019)
Liu, J., Shahroudy, A., Xu, D., Wang, G.: Spatio-temporal lstm with trust gates for 3d human action recognition. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part III 14, pp. 816–833. Springer (2016)
Plizzari, C., Cannici, M., Matteucci, M.: Skeleton-based action recognition via spatial and temporal transformer networks. Comput. Vis. Image Underst. 208, 103219 (2021)
Article Google Scholar
Zhou, Y., Li, C., Cheng, Z.Q., Geng, Y., Xie, X., Keuper, M.: Hypergraph transformer for skeleton-based action recognition. arXiv preprint arXiv:2211.09590 (2022)
Wang, S., Zhang, Y., Wei, F., Wang, K., Zhao, M., Jiang, Y.: Skeleton-based action recognition via temporal-channel aggregation. arXiv preprint arXiv:2205.15936 (2022)
Wang, N.: Adaptive graph convolutional network framework for multidimensional time series prediction. arXiv preprint arXiv:2205.04885 (2022)
Gotmare, A., Keskar, N.S., Xiong, C., Socher, R.: A closer look at deep learning heuristics: Learning rate restarts, warmup and distillation. arXiv preprint arXiv:1810.13243 (2018)
Yim, J., Joo, D., Bae, J., Kim, J.: A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4133–4141 (2017)
Park, W., Kim, D., Lu, Y., Cho, M.: Relational knowledge distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3967–3976 (2019)
Song, Y.F., Zhang, Z., Shan, C., Wang, L.: Constructing stronger and faster baselines for skeleton-based action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 45(2), 1474–1488 (2022)
Article MATH Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in neural information processing systems 30 (2017)
Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1010–1019 (2016)
Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L.Y., Kot, A.C.: Ntu rgb+d 120: A large-scale benchmark for 3d human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 42(10), 2684–2701 (2020)
Article MATH Google Scholar
Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
Jordan, M.I.: Serial order: a parallel distributed processing approach. In: Advances in Psychology, vol. 121, pp. 471–495. Elsevier, Amsterdam (1997)
MATH Google Scholar
Jozefowicz, R., Zaremba, W., Sutskever, I.: An empirical exploration of recurrent network architectures. In: International conference on machine learning, pp. 2342–2350. PMLR (2015)
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
Cheng, K., Zhang, Y., He, X., Chen, W., Cheng, J., Lu, H.: Skeleton-based action recognition with shift graph convolutional network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 183–192 (2020)
Liu, Z., Zhang, H., Chen, Z., Wang, Z., Ouyang, W.: Disentangling and unifying graph convolutions for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 143–152 (2020)
Do, J., Kim, M.: Skateformer: skeletal-temporal transformer for human action recognition. In: European Conference on Computer Vision, pp. 401–420. Springer (2025)
Long, N.H.B.: Step catformer: Spatial-temporal effective body-part cross attention transformer for skeleton-based action recognition. arXiv preprint arXiv:2312.03288 (2023)

Download references

Acknowledgements

Project supported by the Young Scientists Fund of the National Natural Science Foundation of China (Grant No.52302506)

Author information

Authors and Affiliations

School of Electronics And Information, Northwestern Polytechnical University, Xi’an, China
Liheng Dong & Guiqing He
Unmanned System Research Institute, Northwestern Polytechnical University, Xi’an, China
Zhaoxiang Zhang, Yuelei Xu, Xin Xu, Chengyang Tao & Huafeng Li
Norinco Group Testing and Research Institute, Beijing, China
Tian Hui

Authors

Liheng Dong
View author publications
You can also search for this author inPubMed Google Scholar
Guiqing He
View author publications
You can also search for this author inPubMed Google Scholar
Zhaoxiang Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Yuelei Xu
View author publications
You can also search for this author inPubMed Google Scholar
Tian Hui
View author publications
You can also search for this author inPubMed Google Scholar
Xin Xu
View author publications
You can also search for this author inPubMed Google Scholar
Chengyang Tao
View author publications
You can also search for this author inPubMed Google Scholar
Huafeng Li
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Liheng Dong.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Dong, L., He, G., Zhang, Z. et al. A real-time and general method for converting offline skeleton-based action recognition to online ones. J Real-Time Image Proc 22, 49 (2025). https://doi.org/10.1007/s11554-025-01625-x

Download citation

Received: 11 July 2024
Accepted: 06 January 2025
Published: 23 January 2025
DOI: https://doi.org/10.1007/s11554-025-01625-x

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A real-time and general method for converting offline skeleton-based action recognition to online ones

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Auto-Learning-GCN: An Ingenious Framework for Skeleton-Based Action Recognition

Two-stream spatio-temporal GCN-transformer networks for skeleton-based action recognition

A three-stream fusion network for 3D skeleton-based action recognition

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now