Abstract
Many motion sensors can directly acquire human skeletal data, and then extract features on the skeletal data through GCNs (graph convolutional networks) to perform action recognition. However, almost all state-of-the-art (SOTA) methods are offline methods, cannot perform online inference, wasting computational resources. The existing approach to transforming offline action recognition into online action recognition is to reconstruct the network structure of the offline method. This requires developers to have a deep understanding of the algorithm’s network structure and make extensive modifications, which results in slow development. To address the above issue, this paper points out that to convert offline methods to online ones, the key is removing outdated frame features and fusing new frame features. Furthermore, we propose a general and simple model called Encode One Frame (EOF), which achieves feature removal and fusion by a correlation matrix and the guidance of a teacher model. The EOF model has online inference capabilities, requiring only the input of the new frame of the current sample and the features encoded from the old sample. Based on the EOF model, we further propose the You Only Encode One Frame (YOEOF) algorithm to correct the cumulative errors generated during EOF model online inference. By coupling these proposals, YOEOF achieves online inference and outperforms some SOTA methods on public datasets. The deployment at the application level indicates that our method meets the requirements of high accuracy and real-time performance for dangerous action recognition.






Similar content being viewed by others
Data availability
The datasets used and analyzed during the current study are publicly available and can be accessed at https://rose1.ntu.edu.sg/dataset/actionRecognition/ and https://github.com/cvdfoundation/kineticsdataset. The relevant data sources are also cited in the manuscript.
References
Shi, L., Zhang, Y., Cheng, J., Lu, H.: Skeleton-based action recognition with directed graph neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 7912–7921 (2019)
Shi, L., Zhang, Y., Cheng, J., Lu, H.: Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2019)
Fernando, B., Gavves, E., Oramas, J.M., Ghodrati, A., Tuytelaars, T.: Modeling video evolution for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5378–5387 (2015)
Ke, Q., Bennamoun, M., An, S., Sohel, F., Boussaid, F.: A new representation of skeleton sequences for 3d action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3288–3297 (2017)
Soo Kim, T., Reiter, A.: Interpretable 3d human action analysis with temporal convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 20–28 (2017)
Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI conference on artificial intelligence, vol. 32 (2018)
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
Niepert, M., Ahmed, M., Kutzkov, K.: Learning convolutional neural networks for graphs. In: International conference on machine learning, pp. 2014–2023. PMLR (2016)
Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. arXiv preprint arXiv:1710.10903 (2017)
Bai, S., Kolter, J.Z., Koltun, V.: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. Preprint arXiv arXiv:1803.01271v2 (2018)
Van Den Oord, A., Dieleman, S., Zen: Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016)
Chen, Y., Zhang, Z., Yuan, C., Li, B., Deng, Y., Hu, W.: Channel-wise topology refinement graph convolution for skeleton-based action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13359–13368 (2021)
Cheng, K., Zhang, Y., He, X., Chen, W., Cheng, J., Lu, H.: Skeleton-based action recognition with shift graph convolutional network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 183–192 (2020)
Cheng, K., Zhang, Y., He, X., Cheng, J., Lu, H.: Extremely lightweight skeleton-based action recognition with shiftgcn++. IEEE Trans. Image Process. 30, 7333–7348 (2021). https://doi.org/10.1109/TIP.2021.3104182
Duan, H., Zhao, Y., Chen, K., Lin, D., Dai, B.: Revisiting skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2969–2978 (2022)
Li, M., Chen, S., Chen, X., Zhang, Y., Wang, Y., Tian, Q.: Actional-structural graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3595–3603 (2019)
Liu, Z., Zhang, H., Chen, Z., Wang, Z., Ouyang, W.: Disentangling and unifying graph convolutions for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 143–152 (2020)
Shi, L., Zhang, Y., Cheng, J., Lu, H.: Adasgn: Adapting joint number and model size for efficient skeleton-based action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13413–13422 (2021)
Xiang, W., Li, C., Zhou, Y., Wang, B., Zhang, L.: Language supervised training for skeleton-based action recognition. arXiv preprint arXiv:2208.05318 (2022)
Zhang, P., Lan, C., Zeng, W., Xing, J., Xue, J., Zheng, N.: Semantics-guided neural networks for efficient skeleton-based human action recognition. In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1112–1121 (2020)
Hedegaard, L., Heidari, N., Iosifidis, A.: Continual spatio-temporal graph convolutional networks. Pattern Recogn. 140, 109528 (2023). https://doi.org/10.1016/j.patcog.2023.109528
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Gaur, U., Zhu, Y., Song, B., Roy-Chowdhury, A.: Astring of feature graphsmodel for recognition of complex activities in natural videos. In: 2011 International Conference on Computer Vision, pp. 2595–2602 (2011). https://doi.org/10.1109/ICCV.2011.6126548
Vemulapalli, R., Arrate, F., Chellappa, R.: Human action recognition by representing 3d skeletons as points in a lie group. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 588–595 (2014)
Wang, J., Liu, Z., Wu, Y., Yuan, J.: Mining actionlet ensemble for action recognition with depth cameras. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1290–1297 (2012). https://doi.org/10.1109/CVPR.2012.6247813
Choutas, V., Weinzaepfel, P., Revaud, J., Schmid, C.: Potion: Pose motion representation for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7024–7033 (2018)
Asghari-Esfeden, S., Sznaier, M., Camps, O.: Dynamic motion representation for human action recognition. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 557–566 (2020)
Yan, A., Wang, Y., Li, Z., Qiao, Y.: Pa3d: Pose-action 3d machine for video recognition. In: Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pp. 7922–7931 (2019)
Lin, Z., Zhang, W., Deng, X., Ma, C., Wang, H.: Image-based pose representation for action recognition and hand gesture recognition. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp. 532–539. IEEE (2020)
Ruiz, A.H., Porzi, L., Bulò, S., Moreno-Noguer, F.: 3d CNNs on distance matrices for human action recognition. In: 2017 ACM (2018)
Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1110–1118 (2015)
Li, M., Chen, S., Chen, X., Zhang, Y., Wang, Y., Tian, Q.: An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3595–3603 (2019)
Liu, J., Shahroudy, A., Xu, D., Wang, G.: Spatio-temporal lstm with trust gates for 3d human action recognition. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part III 14, pp. 816–833. Springer (2016)
Plizzari, C., Cannici, M., Matteucci, M.: Skeleton-based action recognition via spatial and temporal transformer networks. Comput. Vis. Image Underst. 208, 103219 (2021)
Zhou, Y., Li, C., Cheng, Z.Q., Geng, Y., Xie, X., Keuper, M.: Hypergraph transformer for skeleton-based action recognition. arXiv preprint arXiv:2211.09590 (2022)
Wang, S., Zhang, Y., Wei, F., Wang, K., Zhao, M., Jiang, Y.: Skeleton-based action recognition via temporal-channel aggregation. arXiv preprint arXiv:2205.15936 (2022)
Wang, N.: Adaptive graph convolutional network framework for multidimensional time series prediction. arXiv preprint arXiv:2205.04885 (2022)
Gotmare, A., Keskar, N.S., Xiong, C., Socher, R.: A closer look at deep learning heuristics: Learning rate restarts, warmup and distillation. arXiv preprint arXiv:1810.13243 (2018)
Yim, J., Joo, D., Bae, J., Kim, J.: A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4133–4141 (2017)
Park, W., Kim, D., Lu, Y., Cho, M.: Relational knowledge distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3967–3976 (2019)
Song, Y.F., Zhang, Z., Shan, C., Wang, L.: Constructing stronger and faster baselines for skeleton-based action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 45(2), 1474–1488 (2022)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in neural information processing systems 30 (2017)
Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1010–1019 (2016)
Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L.Y., Kot, A.C.: Ntu rgb+d 120: A large-scale benchmark for 3d human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 42(10), 2684–2701 (2020)
Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
Jordan, M.I.: Serial order: a parallel distributed processing approach. In: Advances in Psychology, vol. 121, pp. 471–495. Elsevier, Amsterdam (1997)
Jozefowicz, R., Zaremba, W., Sutskever, I.: An empirical exploration of recurrent network architectures. In: International conference on machine learning, pp. 2342–2350. PMLR (2015)
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
Cheng, K., Zhang, Y., He, X., Chen, W., Cheng, J., Lu, H.: Skeleton-based action recognition with shift graph convolutional network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 183–192 (2020)
Liu, Z., Zhang, H., Chen, Z., Wang, Z., Ouyang, W.: Disentangling and unifying graph convolutions for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 143–152 (2020)
Do, J., Kim, M.: Skateformer: skeletal-temporal transformer for human action recognition. In: European Conference on Computer Vision, pp. 401–420. Springer (2025)
Long, N.H.B.: Step catformer: Spatial-temporal effective body-part cross attention transformer for skeleton-based action recognition. arXiv preprint arXiv:2312.03288 (2023)
Acknowledgements
Project supported by the Young Scientists Fund of the National Natural Science Foundation of China (Grant No.52302506)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Dong, L., He, G., Zhang, Z. et al. A real-time and general method for converting offline skeleton-based action recognition to online ones. J Real-Time Image Proc 22, 49 (2025). https://doi.org/10.1007/s11554-025-01625-x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11554-025-01625-x