Skip to main content
Log in

A real-time and general method for converting offline skeleton-based action recognition to online ones

  • Research
  • Published:
Journal of Real-Time Image Processing Aims and scope Submit manuscript

Abstract

Many motion sensors can directly acquire human skeletal data, and then extract features on the skeletal data through GCNs (graph convolutional networks) to perform action recognition. However, almost all state-of-the-art (SOTA) methods are offline methods, cannot perform online inference, wasting computational resources. The existing approach to transforming offline action recognition into online action recognition is to reconstruct the network structure of the offline method. This requires developers to have a deep understanding of the algorithm’s network structure and make extensive modifications, which results in slow development. To address the above issue, this paper points out that to convert offline methods to online ones, the key is removing outdated frame features and fusing new frame features. Furthermore, we propose a general and simple model called Encode One Frame (EOF), which achieves feature removal and fusion by a correlation matrix and the guidance of a teacher model. The EOF model has online inference capabilities, requiring only the input of the new frame of the current sample and the features encoded from the old sample. Based on the EOF model, we further propose the You Only Encode One Frame (YOEOF) algorithm to correct the cumulative errors generated during EOF model online inference. By coupling these proposals, YOEOF achieves online inference and outperforms some SOTA methods on public datasets. The deployment at the application level indicates that our method meets the requirements of high accuracy and real-time performance for dangerous action recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Algorithm 1
Fig. 4
Fig. 5

Similar content being viewed by others

Data availability

The datasets used and analyzed during the current study are publicly available and can be accessed at https://rose1.ntu.edu.sg/dataset/actionRecognition/ and https://github.com/cvdfoundation/kineticsdataset. The relevant data sources are also cited in the manuscript.

References

  1. Shi, L., Zhang, Y., Cheng, J., Lu, H.: Skeleton-based action recognition with directed graph neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 7912–7921 (2019)

  2. Shi, L., Zhang, Y., Cheng, J., Lu, H.: Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2019)

  3. Fernando, B., Gavves, E., Oramas, J.M., Ghodrati, A., Tuytelaars, T.: Modeling video evolution for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5378–5387 (2015)

  4. Ke, Q., Bennamoun, M., An, S., Sohel, F., Boussaid, F.: A new representation of skeleton sequences for 3d action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3288–3297 (2017)

  5. Soo Kim, T., Reiter, A.: Interpretable 3d human action analysis with temporal convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 20–28 (2017)

  6. Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI conference on artificial intelligence, vol. 32 (2018)

  7. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)

  8. Niepert, M., Ahmed, M., Kutzkov, K.: Learning convolutional neural networks for graphs. In: International conference on machine learning, pp. 2014–2023. PMLR (2016)

  9. Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. arXiv preprint arXiv:1710.10903 (2017)

  10. Bai, S., Kolter, J.Z., Koltun, V.: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. Preprint arXiv arXiv:1803.01271v2 (2018)

  11. Van Den Oord, A., Dieleman, S., Zen: Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016)

  12. Chen, Y., Zhang, Z., Yuan, C., Li, B., Deng, Y., Hu, W.: Channel-wise topology refinement graph convolution for skeleton-based action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13359–13368 (2021)

  13. Cheng, K., Zhang, Y., He, X., Chen, W., Cheng, J., Lu, H.: Skeleton-based action recognition with shift graph convolutional network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 183–192 (2020)

  14. Cheng, K., Zhang, Y., He, X., Cheng, J., Lu, H.: Extremely lightweight skeleton-based action recognition with shiftgcn++. IEEE Trans. Image Process. 30, 7333–7348 (2021). https://doi.org/10.1109/TIP.2021.3104182

    Article  MATH  Google Scholar 

  15. Duan, H., Zhao, Y., Chen, K., Lin, D., Dai, B.: Revisiting skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2969–2978 (2022)

  16. Li, M., Chen, S., Chen, X., Zhang, Y., Wang, Y., Tian, Q.: Actional-structural graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3595–3603 (2019)

  17. Liu, Z., Zhang, H., Chen, Z., Wang, Z., Ouyang, W.: Disentangling and unifying graph convolutions for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 143–152 (2020)

  18. Shi, L., Zhang, Y., Cheng, J., Lu, H.: Adasgn: Adapting joint number and model size for efficient skeleton-based action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13413–13422 (2021)

  19. Xiang, W., Li, C., Zhou, Y., Wang, B., Zhang, L.: Language supervised training for skeleton-based action recognition. arXiv preprint arXiv:2208.05318 (2022)

  20. Zhang, P., Lan, C., Zeng, W., Xing, J., Xue, J., Zheng, N.: Semantics-guided neural networks for efficient skeleton-based human action recognition. In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1112–1121 (2020)

  21. Hedegaard, L., Heidari, N., Iosifidis, A.: Continual spatio-temporal graph convolutional networks. Pattern Recogn. 140, 109528 (2023). https://doi.org/10.1016/j.patcog.2023.109528

    Article  Google Scholar 

  22. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)

  23. Gaur, U., Zhu, Y., Song, B., Roy-Chowdhury, A.: Astring of feature graphsmodel for recognition of complex activities in natural videos. In: 2011 International Conference on Computer Vision, pp. 2595–2602 (2011). https://doi.org/10.1109/ICCV.2011.6126548

  24. Vemulapalli, R., Arrate, F., Chellappa, R.: Human action recognition by representing 3d skeletons as points in a lie group. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 588–595 (2014)

  25. Wang, J., Liu, Z., Wu, Y., Yuan, J.: Mining actionlet ensemble for action recognition with depth cameras. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1290–1297 (2012). https://doi.org/10.1109/CVPR.2012.6247813

  26. Choutas, V., Weinzaepfel, P., Revaud, J., Schmid, C.: Potion: Pose motion representation for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7024–7033 (2018)

  27. Asghari-Esfeden, S., Sznaier, M., Camps, O.: Dynamic motion representation for human action recognition. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 557–566 (2020)

  28. Yan, A., Wang, Y., Li, Z., Qiao, Y.: Pa3d: Pose-action 3d machine for video recognition. In: Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pp. 7922–7931 (2019)

  29. Lin, Z., Zhang, W., Deng, X., Ma, C., Wang, H.: Image-based pose representation for action recognition and hand gesture recognition. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp. 532–539. IEEE (2020)

  30. Ruiz, A.H., Porzi, L., Bulò, S., Moreno-Noguer, F.: 3d CNNs on distance matrices for human action recognition. In: 2017 ACM (2018)

  31. Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1110–1118 (2015)

  32. Li, M., Chen, S., Chen, X., Zhang, Y., Wang, Y., Tian, Q.: An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3595–3603 (2019)

  33. Liu, J., Shahroudy, A., Xu, D., Wang, G.: Spatio-temporal lstm with trust gates for 3d human action recognition. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part III 14, pp. 816–833. Springer (2016)

  34. Plizzari, C., Cannici, M., Matteucci, M.: Skeleton-based action recognition via spatial and temporal transformer networks. Comput. Vis. Image Underst. 208, 103219 (2021)

    Article  Google Scholar 

  35. Zhou, Y., Li, C., Cheng, Z.Q., Geng, Y., Xie, X., Keuper, M.: Hypergraph transformer for skeleton-based action recognition. arXiv preprint arXiv:2211.09590 (2022)

  36. Wang, S., Zhang, Y., Wei, F., Wang, K., Zhao, M., Jiang, Y.: Skeleton-based action recognition via temporal-channel aggregation. arXiv preprint arXiv:2205.15936 (2022)

  37. Wang, N.: Adaptive graph convolutional network framework for multidimensional time series prediction. arXiv preprint arXiv:2205.04885 (2022)

  38. Gotmare, A., Keskar, N.S., Xiong, C., Socher, R.: A closer look at deep learning heuristics: Learning rate restarts, warmup and distillation. arXiv preprint arXiv:1810.13243 (2018)

  39. Yim, J., Joo, D., Bae, J., Kim, J.: A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4133–4141 (2017)

  40. Park, W., Kim, D., Lu, Y., Cho, M.: Relational knowledge distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3967–3976 (2019)

  41. Song, Y.F., Zhang, Z., Shan, C., Wang, L.: Constructing stronger and faster baselines for skeleton-based action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 45(2), 1474–1488 (2022)

    Article  MATH  Google Scholar 

  42. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in neural information processing systems 30 (2017)

  43. Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1010–1019 (2016)

  44. Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L.Y., Kot, A.C.: Ntu rgb+d 120: A large-scale benchmark for 3d human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 42(10), 2684–2701 (2020)

    Article  MATH  Google Scholar 

  45. Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)

  46. Jordan, M.I.: Serial order: a parallel distributed processing approach. In: Advances in Psychology, vol. 121, pp. 471–495. Elsevier, Amsterdam (1997)

    MATH  Google Scholar 

  47. Jozefowicz, R., Zaremba, W., Sutskever, I.: An empirical exploration of recurrent network architectures. In: International conference on machine learning, pp. 2342–2350. PMLR (2015)

  48. Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)

  49. Cheng, K., Zhang, Y., He, X., Chen, W., Cheng, J., Lu, H.: Skeleton-based action recognition with shift graph convolutional network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 183–192 (2020)

  50. Liu, Z., Zhang, H., Chen, Z., Wang, Z., Ouyang, W.: Disentangling and unifying graph convolutions for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 143–152 (2020)

  51. Do, J., Kim, M.: Skateformer: skeletal-temporal transformer for human action recognition. In: European Conference on Computer Vision, pp. 401–420. Springer (2025)

  52. Long, N.H.B.: Step catformer: Spatial-temporal effective body-part cross attention transformer for skeleton-based action recognition. arXiv preprint arXiv:2312.03288 (2023)

Download references

Acknowledgements

Project supported by the Young Scientists Fund of the National Natural Science Foundation of China (Grant No.52302506)

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Liheng Dong.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dong, L., He, G., Zhang, Z. et al. A real-time and general method for converting offline skeleton-based action recognition to online ones. J Real-Time Image Proc 22, 49 (2025). https://doi.org/10.1007/s11554-025-01625-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11554-025-01625-x

Keywords