Abstract
Human skeleton contains intuitive information of motions, therefore, it has been widely studied in action analysis tasks. As a part of action analysis, traditional models human action assessment by handcrafted-feature-based methods, such as dynamic time warping (DTW). These methods only extract the similarity of particular spatiotemporal features, whereas the global spatio-temporal relevance of action analysis tends to be ignored. In this paper, we propose a regression assessment model for action spatio-temporal features, which encodes the temporal features, spatial features and fused features respectively. The self-attention mechanism is taken advantage of to fuse the decoupling features, and then the overall score of action was calculated by regression. Specifically, via structure-feature fusion adaptive graph convolutional networks (SFAGCN), our action assessment network models the deep dependence of global spatio-temporal feature to address the difficulties of limited expressive ability and generalization. Furthermore, the topology of the skeletal graph and the features of the joints are merged by decoupling the spatio-temporal correlations. To confirm the effectiveness of our assessment model, we conduct experiments on six Olympic Games assessment tasks and exceed the state-of-the-art performance in Spearman’s rank correlation analysis.
Similar content being viewed by others
References
Alexiadis DS, Daras P (2014) Quaternionic signal processing techniques for automatic evaluation of dance performances from mocap data. IEEE Trans Multimed 16(5):1391–1406
Lea C, Reiter A, Vidal R, Hager GD (2016) Segmental spatiotemporal cnns for fine-grained action segmentation. In: European conference on computer vision, pp 36–52
Lea C, Flynn MD, Vidal R, Reiter A, Hager GD (2017) Temporal convolutional networks for action segmentation and detection. In: proceedings of the IEEE conference on computer vision and pattern recognition, pp 156–165
Li C, Zhong Q, Xie D, Pu S (2018) Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. arXiv:1804.06055
Song S, Lan C, Xing J, Zeng W, Liu J (2017) An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: Proceedings of the AAAI conference on artificial intelligence, vol 31
Li W, Wen L, Chang M-C, Lim SN, Lyu S (2017) Adaptive rnn tree for large-scale human action recognition. In: Proceedings of the IEEE international conference on computer vision, pp 1444–1452
Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI conference on artificial intelligence, vol 32
Shi L, Zhang Y, Cheng J, Lu H (2019) Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12026–12035
Shi L, Zhang Y, Cheng J, Lu H (2019) Skeleton-based action recognition with directed graph neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7912–7921
Cheng K, Zhang Y, He X, Chen W, Cheng J, Lu H (2020) Skeleton-based action recognition with shift graph convolutional network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 183–192
Pan J-H, Gao J, Zheng W-S (2019) Action assessment by joint relation graphs. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6331–6340
Gao J, Zheng W-S, Pan J-H, Gao C, Wang Y, Zeng W, Lai J (2020) An asymmetric modeling for action assessment. In: European conference on computer vision. Springer, pp 222–238
Shahroudy A, Liu J, Ng T-T, Wang G (2016) Ntu rgb+ d: a large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1010–1019
Liu J, Shahroudy A, Perez M, Wang G, Duan L-Y, Kot AC (2019) Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. IEEE Trans Pattern Anal Mach Intell 42 (10):2684–2701
Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev P et al (2017) The kinetics human action video dataset. arXiv:1705.06950
Parmar P, Morris BT (2017) Learning to score olympic events. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 20–28
Zhang Z, Wang Z, Zhuang S, Huang F (2020) Structure-feature fusion adaptive graph convolutional networks for skeleton-based action recognition. IEEE Access 8:228108–228117
Zia A, Sharma Y, Bettadapura V, Sarin EL, Essa I (2018) Video and accelerometer-based motion analysis for automated surgical skills assessment. Int J CARS 13(3):443–455
Zia A, Sharma Y, Bettadapura V, Sarin EL, Ploetz T, Clements MA, Essa I (2016) Automated video-based assessment of surgical skills for training and evaluation in medical schools. International Journal of Computer Assisted Radiology and Surgery 11(9):1623–1636
Bertasius G, Park HS, Yu SX, Shi J (2017) Am i a baller? basketball performance assessment from first-person videos. In: Proceedings of the IEEE international conference on computer vision, pp 2177–2185
Doughty H, Mayol-Cuevas W, Damen D (2019) The pros and cons: Rank-aware temporal attention for skill determination in long videos. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7862–7871
Parmar P, Morris B (2019) Action quality assessment across multiple actions. In: 2019 IEEE Winter conference on applications of computer vision (WACV), pp 1468–1476
Pirsiavash H, Vondrick C, Torralba A (2014) Assessing the quality of actions. In: European conference on computer vision, pp 556–571
van den Oord A, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior A, Kavukcuoglu K (2016) Wavenet: A generative model for raw audio. arXiv:1609.03499
Li X, Wang W, Hu X, Yang J (2019) Selective kernel networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 510–519
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. arXiv:1706.03762
Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3d skeletons as points in a lie group. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 588–595
Zhang P, Lan C, Xing J, Zeng W, Xue J, Zheng N (2017) View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 2117–2126
Kim ST, Reiter A (2017) Interpretable 3d human action analysis with temporal convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) Workshops, pp 1623–1631
Ke Q, Bennamoun M, An S, Sohel F, Boussaid F (2017) A new representation of skeleton sequences for 3d action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 3288–3297
Liu M, Liu H, Chen C (2017) Enhanced skeleton visualization for view invariant human action recognition. Pattern Recogn 68(68):346–362
Li C, Zhong Q, Xie D, Pu S (2018) Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. In: Twenty-seventh international joint conference on artificial intelligence (IJCAI), pp 786– 792
Shi L, Zhang Y, Cheng J, Lu H (2019) Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 12026–12035
Li B, Li X, Zhang Z, Wu F (2019) Spatio-temporal graph routing for skeleton-based action recognition. In: Thirty-third AAAI conference on artificial intelligence, vol 33, pp 8561–8568
Liu Z, Zhang H, Chen Z, Wang Z, Ouyang W (2020) Disentangling and unifying graph convolutions for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 143–152
Liu J, Shahroudy A, Xu D, Wang G (2016) Spatio-temporal lstm with trust gates for 3d human action recognition. In: European conference on computer vision. Springer, pp 816–833
Liu J, Wang G, Hu P, Duan L-Y, Kot AC (2017) Global context-aware attention lstm networks for 3d action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 1647–1656
Ke Q, Bennamoun M, An S, Sohel F, Boussaid F (2018) Learning clip representations for skeleton-based 3d action recognition. IEEE Trans Image Process 27(6):2842–2855
Acknowledgments
The paper is pratially supported by the National Nature Science Foundation of China(No. 61972267) and Nature Science Foundation of Hebei Province (No. F2019210306).
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhang, Z., Wang, Z., Zhuang, S. et al. Toward action recognition and assessment using SFAGCN and combinative regression model of spatiotemporal features. Appl Intell 53, 757–768 (2023). https://doi.org/10.1007/s10489-022-03411-9
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-022-03411-9