Skip to main content
Log in

Skeleton-based action recognition with multi-stream, multi-scale dilated spatial-temporal graph convolution network

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Action recognition techniques based on skeleton data are receiving more and more attention in the field of computer vision due to their ability to adapt to dynamic environments and complex backgrounds. Topologizing human skeleton data as spatial-temporal graphs and processing them using graph convolutional networks (GCNs) has been shown to produce good recognition results. However, with existing GCN methods, a fixed-size convolution kernel is often used to extract time-domain features, which may not be very suitable for multi-level model structures. Equal proportion fusion of different streams in a multi-stream network may ignore the difference in recognition ability of different streams, and these will affect the final recognition result. In this paper, we are proposing (1) a multi-scale dilated temporal graph convolution layer (MDTGCL) and (2) a multi-branch feature fusion (MFF) structure. The MDTGCL utilizes multiple convolution kernels and dilated convolution to better adapt to the multi-layer structure of the GCN model and to obtain longer periods of contextual spatial-temporal information, resulting in richer behavioural features. MFF entails weighted fusion based on the results of multi-stream outputs, and this is used to obtain the final recognition results. As higher-order skeleton data are highly discriminative and more conducive to human action recognition, we used spatial information on joints and bones and their multiple motion, as well as angle information pertaining to bones, to model together in this study. By combining the above, we designed a multi-stream, multi-scale dilated spatial-temporal graph convolutional network (2M-STGCN) model and conducted extensive experiments with two large datasets (NTU RGB+D 60 and Kinetics Skeleton 400), which showed that our model performs at SOTA level.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Abu-El-Haija S, Perozzi B, Kapoor A et al (2019) MixHop: higher-order graph convolutional architectures via sparsified neighborhood mixing. In: Chaudhuri K, Salakhutdinov R (eds) Proceedings of the 36th international conference on machine learning, proceedings of machine learning research. https://proceedings.mlr.press/v97/abu-el-haija19a.html, vol 97. PMLR, pp 21–29

  2. Aggarwal JK, Ryoo MS (2011) Human activity analysis: a review. Acm Computing Surveys (Csur) 43(3):1–43

    Article  Google Scholar 

  3. Alsarhan T, Ali U, Lu H (2022) Enhanced discriminative graph convolutional network with adaptive temporal modelling for skeleton-based action recognition. Comput Vis Image Underst 216:103,348. https://doi.org/10.1016/j.cviu.2021.103348. https://www.sciencedirect.com/science/article/pii/S107731422100179X

    Article  Google Scholar 

  4. Atwood J, Towsley D (2016) Diffusion-convolutional neural networks. Advances in Neural Information Processing Systems 29

  5. Bai S, Kolter JZ, Koltun V (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv:1803.01271

  6. Cai J, Jiang N, Han X et al (2021) Jolo-gcn: mining joint-centered light-weight information for skeleton-based action recognition. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 2735–2744

  7. Cao Z, Simon T, Wei SE et al (2017) Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7291–7299

  8. Chen Y, Li Y, Zhang C et al (2022) Informed patch enhanced hypergcn for skeleton-based action recognition. Information Processing & Management 59(4):102,950. https://doi.org/10.1016/j.ipm.2022.102950. https://www.sciencedirect.com/science/article/pii/S0306457322000723

    Article  Google Scholar 

  9. Chen Z, Li S, Yang B et al (2021) Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition. In: Proceedings of the AAAI conference on artificial intelligence, pp 1113–1122

  10. Cheng K, Zhang Y, He X et al (2020) Skeleton-based action recognition with shift graph convolutional network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 183–192

  11. Cho K, Van Merriënboer B, Gulcehre C et al (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv:1406.1078

  12. Dix A, Finlay J, Abowd GD et al (2004) Human-computer interaction. Pearson Education

  13. Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1110–1118

  14. Duvenaud DK, Maclaurin D, Iparraguirre J et al (2015) Convolutional networks on graphs for learning molecular fingerprints. Advances in Neural Information Processing Systems 28

  15. Geng P, Li H, Wang F et al (2022) Adaptive multi-level graph convolution with contrastive learning for skeleton-based action recognition. Signal Process 201:108,714. https://doi.org/10.1016/j.sigpro.2022.108714. https://www.sciencedirect.com/science/article/pii/S0165168422002535

    Article  Google Scholar 

  16. Hamilton W, Ying Z, Leskovec J (2017) Inductive representation learning on large graphs. Advances in Neural Information Processing Systems 30

  17. Hao X, Li J, Guo Y et al (2021) Hypergraph neural network for skeleton-based action recognition. IEEE Trans Image Process 30:2263–2275. https://doi.org/10.1109/TIP.2021.3051495

    Article  MathSciNet  Google Scholar 

  18. Henaff M, Bruna J, LeCun Y (2015) Deep convolutional networks on graph-structured data. arXiv:1506.05163

  19. Hu W, Tan T, Wang L et al (2004) A survey on visual surveillance of object motion and behaviors. IEEE Transactions on Systems Man and Cybernetics Part C (Applications and Reviews) 34(3):334–352. https://doi.org/10.1109/TSMCC.2004.829274

    Article  Google Scholar 

  20. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Bach F, Blei D (eds) Proceedings of the 32nd international conference on machine learning, proceedings of machine learning research. https://proceedings.mlr.press/v37/ioffe15.html, vol 37. PMLR, Lille, pp 448–456

  21. Kay W, Carreira J, Simonyan K et al (2017) The kinetics human action video dataset. arXiv:1705.06950

  22. Kim IS, Choi HS, Yi KM et al (2010) Intelligent visual surveillance—a survey. International Journal of Control Automation and Systems 8(5):926–939. https://doi.org/10.1007/s12555-010-0501-4

    Article  Google Scholar 

  23. Kipf T, Fetaya E, Wang KC et al (2018) Neural relational inference for interacting systems. In: Dy J, Krause A (eds) Proceedings of the 35th international conference on machine learning, proceedings of machine learning research. https://proceedings.mlr.press/v80/kipf18a.html, vol 80. PMLR, pp 2688–2697

  24. Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. arXiv:160902907

  25. Li B, Dai Y, Cheng X et al (2017) Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep cnn. In: 2017 IEEE international conference on multimedia & expo workshops (ICMEW), pp 601–604. https://doi.org/10.1109/ICMEW.2017.8026282

  26. Li C, Zhong Q, Xie D et al (2019) Collaborative spatiotemporal feature learning for video action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7872–7881

  27. Li M, Chen S, Chen X et al (2019) Actional-structural graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3595–3603

  28. Li R, Wang S, Zhu F et al (2018) Adaptive graph convolutional neural networks. In: Proceedings of the AAAI conference on artificial intelligence

  29. Li S, Li W, Cook C et al (2018) Independently recurrent neural network (indrnn): building a longer and deeper rnn. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5457–5466

  30. Li W, Liu X, Liu Z et al (2020) Skeleton-based action recognition using multi-scale and multi-stream improved graph convolutional network. IEEE Access 8:144:529–144:542. https://doi.org/10.1109/ACCESS.2020.3014445

    Article  Google Scholar 

  31. Li Y, Lu Y, Chen B et al (2022) Learning informative and discriminative features for facial expression recognition in the wild. IEEE Trans Circuits Syst Video Technol 32 (5):3178–3189. https://doi.org/10.1109/TCSVT.2021.3103760

    Article  Google Scholar 

  32. Liu J, Shahroudy A, Xu D et al (2016) Spatio-temporal lstm with trust gates for 3d human action recognition. In: European conference on computer vision. https://doi.org/10.1007/978-3-319-46487-9_50. Springer, pp 816–833

  33. Liu M, Liu H, Chen C (2017) Enhanced skeleton visualization for view invariant human action recognition. Pattern Recogn 68:346–362. https://doi.org/10.1016/j.patcog.2017.02.030. https://www.sciencedirect.com/science/article/pii/S0031320317300936

    Article  Google Scholar 

  34. Liu Z, Zhang H, Chen Z et al (2020) Disentangling and unifying graph convolutions for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 143–152

  35. Monti F, Boscaini D, Masci J et al (2017) Geometric deep learning on graphs and manifolds using mixture model cnns. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5115–5124

  36. Myers BA (1998) A brief history of human-computer interaction technology. Interactions 5 (2):44–54

    Article  Google Scholar 

  37. Niepert M, Ahmed M, Kutzkov K (2016) Learning convolutional neural networks for graphs. In: Balcan MF, Weinberger KQ (eds) Proceedings of The 33rd international conference on machine learning, proceedings of machine learning research. https://proceedings.mlr.press/v48/niepert16.html, vol 48. PMLR, New York, pp 2014–2023

  38. Peng W, Hong X, Chen H et al (2020) Learning graph convolutional network for skeleton-based human action recognition by neural searching. In: Proceedings of the AAAI conference on artificial intelligence, pp 2669–2676

  39. Peng W, Shi J, Varanka T et al (2021) Rethinking the st-gcns for 3d skeleton-based human action recognition. Neurocomputing 454:45–53. https://doi.org/10.1016/j.neucom.2021.05.004. https://www.sciencedirect.com/science/article/pii/S0925231221007153

    Article  Google Scholar 

  40. Plizzari C, Cannici M, Matteucci M (2021) Skeleton-based action recognition via spatial and temporal transformer networks. Comput Vis Image Underst 208-209:103,219. https://doi.org/10.1016/j.cviu.2021.103219. https://www.sciencedirect.com/science/article/pii/S1077314221000631

    Article  Google Scholar 

  41. Rautaray SS, Agrawal A (2015) Vision based hand gesture recognition for human computer interaction: a survey. Artif Intell Rev 43(1):1–54. https://doi.org/10.1007/s10462-012-9356-9

    Article  Google Scholar 

  42. Shahroudy A, Liu J, Ng TT et al (2016) Ntu rgb+ d: a large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1010–1019

  43. Sheikh Y, Sheikh M, Shah M (2005) Exploring the space of a human action. In: Tenth IEEE international conference on computer vision (ICCV’05). https://doi.org/10.1109/ICCV.2005.90, vol 1, pp 144–149

  44. Shi L, Zhang Y, Cheng J et al (2019a) Skeleton-based action recognition with directed graph neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7912–7921

  45. Shi L, Zhang Y, Cheng J et al (2019b) Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12,026–12,035

  46. Shi L, Zhang Y, Cheng J et al (2020) Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. IEEE Trans Image Process 29:9532–9545. https://doi.org/10.1109/TIP.2020.3028207

    Article  MATH  Google Scholar 

  47. Song YF, Zhang Z, Shan C et al (2020) Stronger, faster and more explainable: a graph convolutional baseline for skeleton-based action recognition. ACM

  48. Soo Kim T, Reiter A (2017) Interpretable 3d human action analysis with temporal convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 20–28

  49. Strubell E, Verga P, Belanger D et al (2017) Fast and accurate entity recognition with iterated dilated convolutions. arXiv:1702.02098

  50. Suma EA, Krum DM, Lange B et al (2013) Adapting user interfaces for gestural interaction with the flexible action and articulated skeleton toolkit. Computers & Graphics 37 (3):193–201. https://doi.org/10.1016/j.cag.2012.11.004. https://www.sciencedirect.com/science/article/pii/S0097849312001756

    Article  Google Scholar 

  51. Szegedy C et al (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9

  52. Szegedy C, Vanhoucke V, Ioffe S et al (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826

  53. Szegedy C, Ioffe S, Vanhoucke V et al (2017) Inception-v4 inception-resnet and the impact of residual connections on learning. In: Thirty-first AAAI conference on artificial intelligence

  54. Tang Y, Tian Y, Lu J et al (2018) Deep progressive reinforcement learning for skeleton-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5323–5332

  55. Tran D, Wang H, Torresani L et al (2018) A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6450–6459

  56. Velickovic P, Fedus W, Hamilton WL et al (2019) Deep graph infomax. ICLR (Poster) 2 (3):4

    Google Scholar 

  57. Wang H, Wang L (2017) Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 499–508

  58. Wang J, Liu Z, Wu Y et al (2012) Mining actionlet ensemble for action recognition with depth cameras. In: 2012 IEEE conference on computer vision and pattern recognition. https://doi.org/10.1109/CVPR.2012.6247813, pp 1290–1297

  59. Wang P, Li W, Li C et al (2018) Action recognition based on joint trajectory maps with convolutional neural networks. Knowl-Based Syst 158:43–53. https://doi.org/10.1016/j.knosys.2018.05.029. https://www.sciencedirect.com/science/article/pii/S0950705118302582

    Article  Google Scholar 

  60. Wu F, Souza A, Zhang T et al (2019) Simplifying graph convolutional networks. In: Chaudhuri K, Salakhutdinov R (eds) Proceedings of the 36th international conference on machine learning, proceedings of machine learning research. https://proceedings.mlr.press/v97/wu19e.html, vol 97. PMLR, pp 6861–6871

  61. Xie S, Sun C, Huang J et al (2018) Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: Proceedings of the European conference on computer vision (ECCV), pp 305–321

  62. Xu K, Hu W, Leskovec J et al (2018) How powerful are graph neural networks? arXiv:1810.00826

  63. Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Thirty-second AAAI conference on artificial intelligence

  64. Ye F, Pu S, Zhong Q et al (2020) Dynamic gcn: context-enriched topology learning for skeleton-based action recognition. In: Proceedings of the 28th ACM international conference on multimedia, pp 55–63

  65. Yu F, Koltun V (2015) Multi-scale context aggregation by dilated convolutions. arXiv:1511.07122

  66. Yu F, Koltun V, Funkhouser T (2017) Dilated residual networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 472–480

  67. Ziaeefard M, Bergevin R (2015) Semantic human activity recognition: a literature review. Pattern Recogn 48(8):2329–2345. https://doi.org/10.1016/j.patcog.2015.03.006. https://www.sciencedirect.com/science/article/pii/S0031320315000953

    Article  Google Scholar 

  68. Zolfaghari M, Singh K, Brox T (2018) Eco: efficient convolutional network for online video understanding. In: Proceedings of the European conference on computer vision (ECCV), pp 695–712

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Haiping Zhang or Xu Liu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Dongjin Yu, Liming Guan, Dongjing Wang, Conghao Ma and Zepeng Hu are contributed equally to this work.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, H., Liu, X., Yu, D. et al. Skeleton-based action recognition with multi-stream, multi-scale dilated spatial-temporal graph convolution network. Appl Intell 53, 17629–17643 (2023). https://doi.org/10.1007/s10489-022-04365-8

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-04365-8

Keywords

Navigation