Abstract
Action recognition based on a human skeleton is an extremely challenging research problem. The temporal information contained in the human skeleton is more difficult to extract than the spatial information. Many researchers focus on graph convolution networks and apply them to action recognition. In this study, an action recognition method based on a two-stream network called RNXt-GCN is proposed on the basis of the Spatial-Temporal Graph Convolutional Network (ST-GCN). The human skeleton is converted first into a spatial-temporal graph and a SkeleMotion image which are input into ST-GCN and ResNeXt, respectively, for performing the spatial-temporal convolution. The convolved features are then fused. The proposed method models the temporal information in action from the amplitude and direction of the action and addresses the shortcomings of isolated temporal information in the ST-GCN. The experiments are comprehensively performed on the four datasets: 1) UTD-MHAD, 2) Northwestern-UCLA, 3) NTU RGB-D 60, and 4) NTU RGB-D 120. The proposed model shows very competitive results compared with other models in our experiments. On the experiments of NTU RGB + D 120 dataset, our proposed model outperforms those of the state-of-the-art two-stream models.
Similar content being viewed by others
References
Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI:1–8
Xie S, Girshick R, Dollár P, et al. (2017) Aggregated residual transformations for deep neural networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). 1492–1500
Chen C . (2015) UTD-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor[C]// IEEE International Conference on Image Processing. IEEE
Wang J, Nie X, Xia Y, et al. (2014) Cross-view action modeling, learning and recognition[C] //Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR). 2649-2656
Shahroudy A, Liu J, Ng TT et al (2016) NTU RGB+D: a large scale dataset for 3D human activity analysis[C]// IEEE computer. Society:1010–1019
Liu J, Shahroudy A, Perez M et al (2019) NTU RGB+D 120: a large-scale benchmark for 3D human activity understanding.[J]. IEEE Trans Pattern Anal Mach Intell:1–1
Li J, Wong Y, Zhao Q, et al. (2018) Unsupervised learning of view-invariant action representations[C] //Advances in Neural Information Processing Systems(NIPS). 1254-1264
Fiorini L, Mancioppi G, Semeraro F, Fujita H, Cavallo F (2020) Unsupervised emotional state classification through physiological parameters for social robotics applications[J]. Knowl-Based Syst 190:105217
Han F, Reily B, Hoff W, Zhang H (2017) Space-time representation of people based on 3D skeletal data: A review[J]. Comput Vis Image Underst 158:85–105
Wang P, Li W, Ogunbona P, Wan J, Escalera S (2018) RGB-D-based human motion recognition with deep learning: A survey[J]. Comput Vis Image Underst 171:118–139
Mengyuan Liu, Hong Liu, and Chen Chen (2017) Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognition
Zhang P, Lan C, Xing J, Zeng W, Xue J, Zheng N (2019) View adaptive neural networks for high performance skeleton-based human action recognition[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence 41:1963–1978
Hou Y, Li Z, Wang P, Li W (2018) Skeleton Optical Spectra-Based Action Recognition Using Convolutional Neural Networks[J]. IEEE Transactions on Circuits & Systems for Video Technology 28(3):807–811
Li S, Li W, Cook C, et al. (2018) Independently recurrent neural network (indrnn): Building a longer and deeper rnn[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR). 5457–5466
Hu G, Cui B, Yu S (2019) Skeleton-based action recognition with synchronous local and non-local spatio-temporal learning and frequency attention[C]// Proceedings of the IEEE International Conference on Multimedia and Expo (ICME). 1216–1221
Li C, Zhong Q, Xie D, et al. (2018) Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation[C]// Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence(IJCAI). 786–792
Ke Q, An S, Bennamoun M, Sohel F, Boussaid F (2017) SkeletonNet: mining deep part features for 3-D action recognition[J]. IEEE Signal Processing Letters 24(6):731–735
Liu J, Shahroudy A, Xu D, Kot AC, Wang G (2018) Skeleton-based action recognition using Spatio-temporal LSTM network with trust gates[J]. IEEE Trans Pattern Anal Mach Intell 40(12):3007–3021
Chenyang SI, et al. (2019) An Attention enhanced graph convolutional LSTM network for skeleton-based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR). p. 1227–1236
Shi L, Zhang Y, Cheng J, et al. (2019) Two-stream adaptive graph convolutional networks for skeleton-based action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR). 12026–12035
Wu C, Wu X J, Kittler J. (2019) Spatial residual layer and dense connection block enhanced spatial temporal graph convolutional network for skeleton-based action recognition[C]//Proceedings of the IEEE International Conference on Computer Vision Workshops. 0–0
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos[C]//Advances in Neural Information Processing Systems(NIPS). 568–576
Kipf TN, Welling M (2017) Semi-supervised classification with graph convolutional networks[J]. ICLR
Simonyan K , Zisserman A (2015) Very deep convolutional networks for large-scale image recognition in: ICLR
He K, Zhang X, Ren S, et al. (2016) Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR): 770–778
Szegedy C, Liu W, Jia Y, et al. (2015) Going deeper with convolutions[C]//Proceedings of the IEEE conference on computer vision and pattern recognition(CVPR). 1–9
Du Y, Fu Y, Wang L (2015) Skeleton based action recognition with convolutional neural network[C]//2015 3rd IAPR Asian conference on pattern recognition (ACPR). IEEE:579–583
Choutas V, Weinzaepfel P, Revaud J, et al. Potion: Pose motion representation for action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR). 2018: 7024–7033
Ke Q, Bennamoun M, An S, et al. A new representation of skeleton sequences for 3d action recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR). 2017: 3288–3297
Li C, Zhong Q, Xie D et al (2017) Skeleton-based action recognition with convolutional neural networks[C]//2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). IEEE:597–600
Liu M, Chen C, Liu H (2017) 3d action recognition using data visualization and convolutional neural networks[C]//2017 IEEE international conference on multimedia and expo (ICME). IEEE:925–930
Wang P, Li W, Li C, Hou Y (2018) Action recognition based on joint trajectory maps with convolutional neural networks[J]. Knowl-Based Syst 158:43–53
Wang P, Li Z, Hou Y, et al. (2016) Action recognition based on joint trajectory maps using convolutional neural networks[C]//Proceedings of the 24th ACM international conference on Multimedia. 102–106
Zhengyuan Y , Yuncheng L , Jianchao Y , et al. (2018) Action recognition with spatio-temporal visual attention on skeleton image sequences[J]. IEEE Transactions on Circuits and Systems for Video Technology, pp:1–1
Caetano C, Sena J, Brémond F et al (2019) Skelemotion: a new representation of skeleton joint sequences based on motion information for 3d action recognition[C]. 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS) IEEE:1–8
Zhou L, Li W, Zhang Y et al (2014) Discriminative key pose extraction using extended LC-KSVD for action recognition[C]. International Conference on Digital Lmage Computing: Techniques & Applications:1–8
Wang J, Liu Z, Wu Y, Yuan J (2014) Learning actionlet ensemble for 3D human action recognition[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence 36(5):914–927
Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3d skeletons as points in a lie group[C]. Proceedings of the IEEE conference on computer vision and pattern recognition(CVPR):588–595
Li M, Chen S, Chen X et al (2019) Actional-structural graph convolutional networks for skeleton-based action recognition[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR):3595–3603
Caetano C, Brémond F, Schwartz WR (2019) Skeleton image representation for 3d action recognition based on tree structure and reference joints[C]//2019 32nd SIBGRAPI conference on graphics, patterns and images (SIBGRAPI). IEEE:16–23
Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. (2019) Skeleton-based action recognition with directed graph neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pages 7912–7921
Liu J, Akhtar N, Mian A. Skepxels: Spatio-temporal image representation of human skeleton joints for action recognition[C]//CVPR workshops. 2019
Jun Liu, Amir Shahroudy, Dong Xu, and Gang Wang (2016) Spatio-temporal LSTM with trust gates for 3d human action recognition. In European Conference on Computer Vision(ECCV), pages 816–833. Springer
Liu J, Wang G, Duan L-Y, Abdiyeva K, Kot AC (2017) Skeleton-based human action recognition with global context-aware attention LSTM networks. IEEE Transactions on Image Processing 27(4):1586–1599
Ke Q, Bennamoun M, An S, Sohel F, Boussaid F (2018) Learning clip representations for skeleton-based 3d action recognition. IEEE Trans Image Process 27(6):2842–2855
Liu M, Yuan J (2018) Recognizing human actions as the evolution of pose estimation maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR):1159–1168
Dong J, Gao Y, Lee HJ, Zhou H, Yao Y, Fang Z, Huang B (2020) Action recognition based on the fusion of graph convolutional networks with high order features. Applied Sciences 10(4):1482
Acknowledgments
This work is supported partially by the project of Jilin Provincial Science and Technology Department under the Grant 20180201003GX and the project of Jilin province development and reform commission under the Grant 2019C053-4. The authors gratefully acknowledge the helpful comments and suggestions of the reviewers, which have improved the presentation.
Funding
This research was funded by the project of Jilin Provincial Science and Technology Department under the Grant 20180201003GX and the APC was funded by Grant 20180201003GX too.
Author information
Authors and Affiliations
Contributions
This study was completed by the co-authors. Shuhua Liu conceived the research and wrote the draft. The major experiments and analyses were undertaken by Xiaoying Bai and Ming Fang. Lanting Li was responsible for data processing and drawing figures. Chih-Cheng Hung edited and reviewed the paper. All authors have read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
Authors declare no conflicts of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Liu, S., Bai, X., Fang, M. et al. Mixed graph convolution and residual transformation network for skeleton-based action recognition. Appl Intell 52, 1544–1555 (2022). https://doi.org/10.1007/s10489-021-02517-w
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-021-02517-w