Abstract
Action recognition techniques based on skeleton data are receiving more and more attention in the field of computer vision due to their ability to adapt to dynamic environments and complex backgrounds. Topologizing human skeleton data as spatial-temporal graphs and processing them using graph convolutional networks (GCNs) has been shown to produce good recognition results. However, with existing GCN methods, a fixed-size convolution kernel is often used to extract time-domain features, which may not be very suitable for multi-level model structures. Equal proportion fusion of different streams in a multi-stream network may ignore the difference in recognition ability of different streams, and these will affect the final recognition result. In this paper, we are proposing (1) a multi-scale dilated temporal graph convolution layer (MDTGCL) and (2) a multi-branch feature fusion (MFF) structure. The MDTGCL utilizes multiple convolution kernels and dilated convolution to better adapt to the multi-layer structure of the GCN model and to obtain longer periods of contextual spatial-temporal information, resulting in richer behavioural features. MFF entails weighted fusion based on the results of multi-stream outputs, and this is used to obtain the final recognition results. As higher-order skeleton data are highly discriminative and more conducive to human action recognition, we used spatial information on joints and bones and their multiple motion, as well as angle information pertaining to bones, to model together in this study. By combining the above, we designed a multi-stream, multi-scale dilated spatial-temporal graph convolutional network (2M-STGCN) model and conducted extensive experiments with two large datasets (NTU RGB+D 60 and Kinetics Skeleton 400), which showed that our model performs at SOTA level.






Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Abu-El-Haija S, Perozzi B, Kapoor A et al (2019) MixHop: higher-order graph convolutional architectures via sparsified neighborhood mixing. In: Chaudhuri K, Salakhutdinov R (eds) Proceedings of the 36th international conference on machine learning, proceedings of machine learning research. https://proceedings.mlr.press/v97/abu-el-haija19a.html, vol 97. PMLR, pp 21–29
Aggarwal JK, Ryoo MS (2011) Human activity analysis: a review. Acm Computing Surveys (Csur) 43(3):1–43
Alsarhan T, Ali U, Lu H (2022) Enhanced discriminative graph convolutional network with adaptive temporal modelling for skeleton-based action recognition. Comput Vis Image Underst 216:103,348. https://doi.org/10.1016/j.cviu.2021.103348. https://www.sciencedirect.com/science/article/pii/S107731422100179X
Atwood J, Towsley D (2016) Diffusion-convolutional neural networks. Advances in Neural Information Processing Systems 29
Bai S, Kolter JZ, Koltun V (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv:1803.01271
Cai J, Jiang N, Han X et al (2021) Jolo-gcn: mining joint-centered light-weight information for skeleton-based action recognition. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 2735–2744
Cao Z, Simon T, Wei SE et al (2017) Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7291–7299
Chen Y, Li Y, Zhang C et al (2022) Informed patch enhanced hypergcn for skeleton-based action recognition. Information Processing & Management 59(4):102,950. https://doi.org/10.1016/j.ipm.2022.102950. https://www.sciencedirect.com/science/article/pii/S0306457322000723
Chen Z, Li S, Yang B et al (2021) Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition. In: Proceedings of the AAAI conference on artificial intelligence, pp 1113–1122
Cheng K, Zhang Y, He X et al (2020) Skeleton-based action recognition with shift graph convolutional network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 183–192
Cho K, Van Merriënboer B, Gulcehre C et al (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv:1406.1078
Dix A, Finlay J, Abowd GD et al (2004) Human-computer interaction. Pearson Education
Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1110–1118
Duvenaud DK, Maclaurin D, Iparraguirre J et al (2015) Convolutional networks on graphs for learning molecular fingerprints. Advances in Neural Information Processing Systems 28
Geng P, Li H, Wang F et al (2022) Adaptive multi-level graph convolution with contrastive learning for skeleton-based action recognition. Signal Process 201:108,714. https://doi.org/10.1016/j.sigpro.2022.108714. https://www.sciencedirect.com/science/article/pii/S0165168422002535
Hamilton W, Ying Z, Leskovec J (2017) Inductive representation learning on large graphs. Advances in Neural Information Processing Systems 30
Hao X, Li J, Guo Y et al (2021) Hypergraph neural network for skeleton-based action recognition. IEEE Trans Image Process 30:2263–2275. https://doi.org/10.1109/TIP.2021.3051495
Henaff M, Bruna J, LeCun Y (2015) Deep convolutional networks on graph-structured data. arXiv:1506.05163
Hu W, Tan T, Wang L et al (2004) A survey on visual surveillance of object motion and behaviors. IEEE Transactions on Systems Man and Cybernetics Part C (Applications and Reviews) 34(3):334–352. https://doi.org/10.1109/TSMCC.2004.829274
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Bach F, Blei D (eds) Proceedings of the 32nd international conference on machine learning, proceedings of machine learning research. https://proceedings.mlr.press/v37/ioffe15.html, vol 37. PMLR, Lille, pp 448–456
Kay W, Carreira J, Simonyan K et al (2017) The kinetics human action video dataset. arXiv:1705.06950
Kim IS, Choi HS, Yi KM et al (2010) Intelligent visual surveillance—a survey. International Journal of Control Automation and Systems 8(5):926–939. https://doi.org/10.1007/s12555-010-0501-4
Kipf T, Fetaya E, Wang KC et al (2018) Neural relational inference for interacting systems. In: Dy J, Krause A (eds) Proceedings of the 35th international conference on machine learning, proceedings of machine learning research. https://proceedings.mlr.press/v80/kipf18a.html, vol 80. PMLR, pp 2688–2697
Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. arXiv:160902907
Li B, Dai Y, Cheng X et al (2017) Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep cnn. In: 2017 IEEE international conference on multimedia & expo workshops (ICMEW), pp 601–604. https://doi.org/10.1109/ICMEW.2017.8026282
Li C, Zhong Q, Xie D et al (2019) Collaborative spatiotemporal feature learning for video action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7872–7881
Li M, Chen S, Chen X et al (2019) Actional-structural graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3595–3603
Li R, Wang S, Zhu F et al (2018) Adaptive graph convolutional neural networks. In: Proceedings of the AAAI conference on artificial intelligence
Li S, Li W, Cook C et al (2018) Independently recurrent neural network (indrnn): building a longer and deeper rnn. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5457–5466
Li W, Liu X, Liu Z et al (2020) Skeleton-based action recognition using multi-scale and multi-stream improved graph convolutional network. IEEE Access 8:144:529–144:542. https://doi.org/10.1109/ACCESS.2020.3014445
Li Y, Lu Y, Chen B et al (2022) Learning informative and discriminative features for facial expression recognition in the wild. IEEE Trans Circuits Syst Video Technol 32 (5):3178–3189. https://doi.org/10.1109/TCSVT.2021.3103760
Liu J, Shahroudy A, Xu D et al (2016) Spatio-temporal lstm with trust gates for 3d human action recognition. In: European conference on computer vision. https://doi.org/10.1007/978-3-319-46487-9_50. Springer, pp 816–833
Liu M, Liu H, Chen C (2017) Enhanced skeleton visualization for view invariant human action recognition. Pattern Recogn 68:346–362. https://doi.org/10.1016/j.patcog.2017.02.030. https://www.sciencedirect.com/science/article/pii/S0031320317300936
Liu Z, Zhang H, Chen Z et al (2020) Disentangling and unifying graph convolutions for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 143–152
Monti F, Boscaini D, Masci J et al (2017) Geometric deep learning on graphs and manifolds using mixture model cnns. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5115–5124
Myers BA (1998) A brief history of human-computer interaction technology. Interactions 5 (2):44–54
Niepert M, Ahmed M, Kutzkov K (2016) Learning convolutional neural networks for graphs. In: Balcan MF, Weinberger KQ (eds) Proceedings of The 33rd international conference on machine learning, proceedings of machine learning research. https://proceedings.mlr.press/v48/niepert16.html, vol 48. PMLR, New York, pp 2014–2023
Peng W, Hong X, Chen H et al (2020) Learning graph convolutional network for skeleton-based human action recognition by neural searching. In: Proceedings of the AAAI conference on artificial intelligence, pp 2669–2676
Peng W, Shi J, Varanka T et al (2021) Rethinking the st-gcns for 3d skeleton-based human action recognition. Neurocomputing 454:45–53. https://doi.org/10.1016/j.neucom.2021.05.004. https://www.sciencedirect.com/science/article/pii/S0925231221007153
Plizzari C, Cannici M, Matteucci M (2021) Skeleton-based action recognition via spatial and temporal transformer networks. Comput Vis Image Underst 208-209:103,219. https://doi.org/10.1016/j.cviu.2021.103219. https://www.sciencedirect.com/science/article/pii/S1077314221000631
Rautaray SS, Agrawal A (2015) Vision based hand gesture recognition for human computer interaction: a survey. Artif Intell Rev 43(1):1–54. https://doi.org/10.1007/s10462-012-9356-9
Shahroudy A, Liu J, Ng TT et al (2016) Ntu rgb+ d: a large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1010–1019
Sheikh Y, Sheikh M, Shah M (2005) Exploring the space of a human action. In: Tenth IEEE international conference on computer vision (ICCV’05). https://doi.org/10.1109/ICCV.2005.90, vol 1, pp 144–149
Shi L, Zhang Y, Cheng J et al (2019a) Skeleton-based action recognition with directed graph neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7912–7921
Shi L, Zhang Y, Cheng J et al (2019b) Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12,026–12,035
Shi L, Zhang Y, Cheng J et al (2020) Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. IEEE Trans Image Process 29:9532–9545. https://doi.org/10.1109/TIP.2020.3028207
Song YF, Zhang Z, Shan C et al (2020) Stronger, faster and more explainable: a graph convolutional baseline for skeleton-based action recognition. ACM
Soo Kim T, Reiter A (2017) Interpretable 3d human action analysis with temporal convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 20–28
Strubell E, Verga P, Belanger D et al (2017) Fast and accurate entity recognition with iterated dilated convolutions. arXiv:1702.02098
Suma EA, Krum DM, Lange B et al (2013) Adapting user interfaces for gestural interaction with the flexible action and articulated skeleton toolkit. Computers & Graphics 37 (3):193–201. https://doi.org/10.1016/j.cag.2012.11.004. https://www.sciencedirect.com/science/article/pii/S0097849312001756
Szegedy C et al (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9
Szegedy C, Vanhoucke V, Ioffe S et al (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826
Szegedy C, Ioffe S, Vanhoucke V et al (2017) Inception-v4 inception-resnet and the impact of residual connections on learning. In: Thirty-first AAAI conference on artificial intelligence
Tang Y, Tian Y, Lu J et al (2018) Deep progressive reinforcement learning for skeleton-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5323–5332
Tran D, Wang H, Torresani L et al (2018) A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6450–6459
Velickovic P, Fedus W, Hamilton WL et al (2019) Deep graph infomax. ICLR (Poster) 2 (3):4
Wang H, Wang L (2017) Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 499–508
Wang J, Liu Z, Wu Y et al (2012) Mining actionlet ensemble for action recognition with depth cameras. In: 2012 IEEE conference on computer vision and pattern recognition. https://doi.org/10.1109/CVPR.2012.6247813, pp 1290–1297
Wang P, Li W, Li C et al (2018) Action recognition based on joint trajectory maps with convolutional neural networks. Knowl-Based Syst 158:43–53. https://doi.org/10.1016/j.knosys.2018.05.029. https://www.sciencedirect.com/science/article/pii/S0950705118302582
Wu F, Souza A, Zhang T et al (2019) Simplifying graph convolutional networks. In: Chaudhuri K, Salakhutdinov R (eds) Proceedings of the 36th international conference on machine learning, proceedings of machine learning research. https://proceedings.mlr.press/v97/wu19e.html, vol 97. PMLR, pp 6861–6871
Xie S, Sun C, Huang J et al (2018) Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: Proceedings of the European conference on computer vision (ECCV), pp 305–321
Xu K, Hu W, Leskovec J et al (2018) How powerful are graph neural networks? arXiv:1810.00826
Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Thirty-second AAAI conference on artificial intelligence
Ye F, Pu S, Zhong Q et al (2020) Dynamic gcn: context-enriched topology learning for skeleton-based action recognition. In: Proceedings of the 28th ACM international conference on multimedia, pp 55–63
Yu F, Koltun V (2015) Multi-scale context aggregation by dilated convolutions. arXiv:1511.07122
Yu F, Koltun V, Funkhouser T (2017) Dilated residual networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 472–480
Ziaeefard M, Bergevin R (2015) Semantic human activity recognition: a literature review. Pattern Recogn 48(8):2329–2345. https://doi.org/10.1016/j.patcog.2015.03.006. https://www.sciencedirect.com/science/article/pii/S0031320315000953
Zolfaghari M, Singh K, Brox T (2018) Eco: efficient convolutional network for online video understanding. In: Proceedings of the European conference on computer vision (ECCV), pp 695–712
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Dongjin Yu, Liming Guan, Dongjing Wang, Conghao Ma and Zepeng Hu are contributed equally to this work.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, H., Liu, X., Yu, D. et al. Skeleton-based action recognition with multi-stream, multi-scale dilated spatial-temporal graph convolution network. Appl Intell 53, 17629–17643 (2023). https://doi.org/10.1007/s10489-022-04365-8
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-022-04365-8