Skeleton-based action recognition with multi-stream, multi-scale dilated spatial-temporal graph convolution network

Zhang, Haiping; Liu, Xu; Yu, Dongjin; Guan, Liming; Wang, Dongjing; Ma, Conghao; Hu, Zepeng

doi:10.1007/s10489-022-04365-8

Skeleton-based action recognition with multi-stream, multi-scale dilated spatial-temporal graph convolution network

Published: 10 January 2023

Volume 53, pages 17629–17643, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Haiping Zhang^1,2,
Xu Liu ORCID: orcid.org/0000-0002-0107-3119³,
Dongjin Yu¹,
Liming Guan²,
Dongjing Wang¹,
Conghao Ma³ &
…
Zepeng Hu¹

1064 Accesses
7 Citations
1 Altmetric
Explore all metrics

Abstract

Action recognition techniques based on skeleton data are receiving more and more attention in the field of computer vision due to their ability to adapt to dynamic environments and complex backgrounds. Topologizing human skeleton data as spatial-temporal graphs and processing them using graph convolutional networks (GCNs) has been shown to produce good recognition results. However, with existing GCN methods, a fixed-size convolution kernel is often used to extract time-domain features, which may not be very suitable for multi-level model structures. Equal proportion fusion of different streams in a multi-stream network may ignore the difference in recognition ability of different streams, and these will affect the final recognition result. In this paper, we are proposing (1) a multi-scale dilated temporal graph convolution layer (MDTGCL) and (2) a multi-branch feature fusion (MFF) structure. The MDTGCL utilizes multiple convolution kernels and dilated convolution to better adapt to the multi-layer structure of the GCN model and to obtain longer periods of contextual spatial-temporal information, resulting in richer behavioural features. MFF entails weighted fusion based on the results of multi-stream outputs, and this is used to obtain the final recognition results. As higher-order skeleton data are highly discriminative and more conducive to human action recognition, we used spatial information on joints and bones and their multiple motion, as well as angle information pertaining to bones, to model together in this study. By combining the above, we designed a multi-stream, multi-scale dilated spatial-temporal graph convolutional network (2M-STGCN) model and conducted extensive experiments with two large datasets (NTU RGB+D 60 and Kinetics Skeleton 400), which showed that our model performs at SOTA level.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-stream ternary enhanced graph convolutional network for skeleton-based action recognition

Article 14 June 2023

Multi-scale spatial–temporal convolutional neural network for skeleton-based action recognition

Article 12 May 2023

Multi-scale Spatial and Temporal Feature Aggregation Graph Convolutional Network for Skeleton-Based Action Recognition

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Abu-El-Haija S, Perozzi B, Kapoor A et al (2019) MixHop: higher-order graph convolutional architectures via sparsified neighborhood mixing. In: Chaudhuri K, Salakhutdinov R (eds) Proceedings of the 36th international conference on machine learning, proceedings of machine learning research. https://proceedings.mlr.press/v97/abu-el-haija19a.html, vol 97. PMLR, pp 21–29
Aggarwal JK, Ryoo MS (2011) Human activity analysis: a review. Acm Computing Surveys (Csur) 43(3):1–43
Article Google Scholar
Alsarhan T, Ali U, Lu H (2022) Enhanced discriminative graph convolutional network with adaptive temporal modelling for skeleton-based action recognition. Comput Vis Image Underst 216:103,348. https://doi.org/10.1016/j.cviu.2021.103348. https://www.sciencedirect.com/science/article/pii/S107731422100179X
Article Google Scholar
Atwood J, Towsley D (2016) Diffusion-convolutional neural networks. Advances in Neural Information Processing Systems 29
Bai S, Kolter JZ, Koltun V (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv:1803.01271
Cai J, Jiang N, Han X et al (2021) Jolo-gcn: mining joint-centered light-weight information for skeleton-based action recognition. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 2735–2744
Cao Z, Simon T, Wei SE et al (2017) Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7291–7299
Chen Y, Li Y, Zhang C et al (2022) Informed patch enhanced hypergcn for skeleton-based action recognition. Information Processing & Management 59(4):102,950. https://doi.org/10.1016/j.ipm.2022.102950. https://www.sciencedirect.com/science/article/pii/S0306457322000723
Article Google Scholar
Chen Z, Li S, Yang B et al (2021) Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition. In: Proceedings of the AAAI conference on artificial intelligence, pp 1113–1122
Cheng K, Zhang Y, He X et al (2020) Skeleton-based action recognition with shift graph convolutional network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 183–192
Cho K, Van Merriënboer B, Gulcehre C et al (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv:1406.1078
Dix A, Finlay J, Abowd GD et al (2004) Human-computer interaction. Pearson Education
Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1110–1118
Duvenaud DK, Maclaurin D, Iparraguirre J et al (2015) Convolutional networks on graphs for learning molecular fingerprints. Advances in Neural Information Processing Systems 28
Geng P, Li H, Wang F et al (2022) Adaptive multi-level graph convolution with contrastive learning for skeleton-based action recognition. Signal Process 201:108,714. https://doi.org/10.1016/j.sigpro.2022.108714. https://www.sciencedirect.com/science/article/pii/S0165168422002535
Article Google Scholar
Hamilton W, Ying Z, Leskovec J (2017) Inductive representation learning on large graphs. Advances in Neural Information Processing Systems 30
Hao X, Li J, Guo Y et al (2021) Hypergraph neural network for skeleton-based action recognition. IEEE Trans Image Process 30:2263–2275. https://doi.org/10.1109/TIP.2021.3051495
Article MathSciNet Google Scholar
Henaff M, Bruna J, LeCun Y (2015) Deep convolutional networks on graph-structured data. arXiv:1506.05163
Hu W, Tan T, Wang L et al (2004) A survey on visual surveillance of object motion and behaviors. IEEE Transactions on Systems Man and Cybernetics Part C (Applications and Reviews) 34(3):334–352. https://doi.org/10.1109/TSMCC.2004.829274
Article Google Scholar
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Bach F, Blei D (eds) Proceedings of the 32nd international conference on machine learning, proceedings of machine learning research. https://proceedings.mlr.press/v37/ioffe15.html, vol 37. PMLR, Lille, pp 448–456
Kay W, Carreira J, Simonyan K et al (2017) The kinetics human action video dataset. arXiv:1705.06950
Kim IS, Choi HS, Yi KM et al (2010) Intelligent visual surveillance—a survey. International Journal of Control Automation and Systems 8(5):926–939. https://doi.org/10.1007/s12555-010-0501-4
Article Google Scholar
Kipf T, Fetaya E, Wang KC et al (2018) Neural relational inference for interacting systems. In: Dy J, Krause A (eds) Proceedings of the 35th international conference on machine learning, proceedings of machine learning research. https://proceedings.mlr.press/v80/kipf18a.html, vol 80. PMLR, pp 2688–2697
Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. arXiv:160902907
Li B, Dai Y, Cheng X et al (2017) Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep cnn. In: 2017 IEEE international conference on multimedia & expo workshops (ICMEW), pp 601–604. https://doi.org/10.1109/ICMEW.2017.8026282
Li C, Zhong Q, Xie D et al (2019) Collaborative spatiotemporal feature learning for video action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7872–7881
Li M, Chen S, Chen X et al (2019) Actional-structural graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3595–3603
Li R, Wang S, Zhu F et al (2018) Adaptive graph convolutional neural networks. In: Proceedings of the AAAI conference on artificial intelligence
Li S, Li W, Cook C et al (2018) Independently recurrent neural network (indrnn): building a longer and deeper rnn. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5457–5466
Li W, Liu X, Liu Z et al (2020) Skeleton-based action recognition using multi-scale and multi-stream improved graph convolutional network. IEEE Access 8:144:529–144:542. https://doi.org/10.1109/ACCESS.2020.3014445
Article Google Scholar
Li Y, Lu Y, Chen B et al (2022) Learning informative and discriminative features for facial expression recognition in the wild. IEEE Trans Circuits Syst Video Technol 32 (5):3178–3189. https://doi.org/10.1109/TCSVT.2021.3103760
Article Google Scholar
Liu J, Shahroudy A, Xu D et al (2016) Spatio-temporal lstm with trust gates for 3d human action recognition. In: European conference on computer vision. https://doi.org/10.1007/978-3-319-46487-9_50. Springer, pp 816–833
Liu M, Liu H, Chen C (2017) Enhanced skeleton visualization for view invariant human action recognition. Pattern Recogn 68:346–362. https://doi.org/10.1016/j.patcog.2017.02.030. https://www.sciencedirect.com/science/article/pii/S0031320317300936
Article Google Scholar
Liu Z, Zhang H, Chen Z et al (2020) Disentangling and unifying graph convolutions for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 143–152
Monti F, Boscaini D, Masci J et al (2017) Geometric deep learning on graphs and manifolds using mixture model cnns. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5115–5124
Myers BA (1998) A brief history of human-computer interaction technology. Interactions 5 (2):44–54
Article Google Scholar
Niepert M, Ahmed M, Kutzkov K (2016) Learning convolutional neural networks for graphs. In: Balcan MF, Weinberger KQ (eds) Proceedings of The 33rd international conference on machine learning, proceedings of machine learning research. https://proceedings.mlr.press/v48/niepert16.html, vol 48. PMLR, New York, pp 2014–2023
Peng W, Hong X, Chen H et al (2020) Learning graph convolutional network for skeleton-based human action recognition by neural searching. In: Proceedings of the AAAI conference on artificial intelligence, pp 2669–2676
Peng W, Shi J, Varanka T et al (2021) Rethinking the st-gcns for 3d skeleton-based human action recognition. Neurocomputing 454:45–53. https://doi.org/10.1016/j.neucom.2021.05.004. https://www.sciencedirect.com/science/article/pii/S0925231221007153
Article Google Scholar
Plizzari C, Cannici M, Matteucci M (2021) Skeleton-based action recognition via spatial and temporal transformer networks. Comput Vis Image Underst 208-209:103,219. https://doi.org/10.1016/j.cviu.2021.103219. https://www.sciencedirect.com/science/article/pii/S1077314221000631
Article Google Scholar
Rautaray SS, Agrawal A (2015) Vision based hand gesture recognition for human computer interaction: a survey. Artif Intell Rev 43(1):1–54. https://doi.org/10.1007/s10462-012-9356-9
Article Google Scholar
Shahroudy A, Liu J, Ng TT et al (2016) Ntu rgb+ d: a large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1010–1019
Sheikh Y, Sheikh M, Shah M (2005) Exploring the space of a human action. In: Tenth IEEE international conference on computer vision (ICCV’05). https://doi.org/10.1109/ICCV.2005.90, vol 1, pp 144–149
Shi L, Zhang Y, Cheng J et al (2019a) Skeleton-based action recognition with directed graph neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7912–7921
Shi L, Zhang Y, Cheng J et al (2019b) Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12,026–12,035
Shi L, Zhang Y, Cheng J et al (2020) Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. IEEE Trans Image Process 29:9532–9545. https://doi.org/10.1109/TIP.2020.3028207
Article MATH Google Scholar
Song YF, Zhang Z, Shan C et al (2020) Stronger, faster and more explainable: a graph convolutional baseline for skeleton-based action recognition. ACM
Soo Kim T, Reiter A (2017) Interpretable 3d human action analysis with temporal convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 20–28
Strubell E, Verga P, Belanger D et al (2017) Fast and accurate entity recognition with iterated dilated convolutions. arXiv:1702.02098
Suma EA, Krum DM, Lange B et al (2013) Adapting user interfaces for gestural interaction with the flexible action and articulated skeleton toolkit. Computers & Graphics 37 (3):193–201. https://doi.org/10.1016/j.cag.2012.11.004. https://www.sciencedirect.com/science/article/pii/S0097849312001756
Article Google Scholar
Szegedy C et al (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9
Szegedy C, Vanhoucke V, Ioffe S et al (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826
Szegedy C, Ioffe S, Vanhoucke V et al (2017) Inception-v4 inception-resnet and the impact of residual connections on learning. In: Thirty-first AAAI conference on artificial intelligence
Tang Y, Tian Y, Lu J et al (2018) Deep progressive reinforcement learning for skeleton-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5323–5332
Tran D, Wang H, Torresani L et al (2018) A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6450–6459
Velickovic P, Fedus W, Hamilton WL et al (2019) Deep graph infomax. ICLR (Poster) 2 (3):4
Google Scholar
Wang H, Wang L (2017) Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 499–508
Wang J, Liu Z, Wu Y et al (2012) Mining actionlet ensemble for action recognition with depth cameras. In: 2012 IEEE conference on computer vision and pattern recognition. https://doi.org/10.1109/CVPR.2012.6247813, pp 1290–1297
Wang P, Li W, Li C et al (2018) Action recognition based on joint trajectory maps with convolutional neural networks. Knowl-Based Syst 158:43–53. https://doi.org/10.1016/j.knosys.2018.05.029. https://www.sciencedirect.com/science/article/pii/S0950705118302582
Article Google Scholar
Wu F, Souza A, Zhang T et al (2019) Simplifying graph convolutional networks. In: Chaudhuri K, Salakhutdinov R (eds) Proceedings of the 36th international conference on machine learning, proceedings of machine learning research. https://proceedings.mlr.press/v97/wu19e.html, vol 97. PMLR, pp 6861–6871
Xie S, Sun C, Huang J et al (2018) Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: Proceedings of the European conference on computer vision (ECCV), pp 305–321
Xu K, Hu W, Leskovec J et al (2018) How powerful are graph neural networks? arXiv:1810.00826
Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Thirty-second AAAI conference on artificial intelligence
Ye F, Pu S, Zhong Q et al (2020) Dynamic gcn: context-enriched topology learning for skeleton-based action recognition. In: Proceedings of the 28th ACM international conference on multimedia, pp 55–63
Yu F, Koltun V (2015) Multi-scale context aggregation by dilated convolutions. arXiv:1511.07122
Yu F, Koltun V, Funkhouser T (2017) Dilated residual networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 472–480
Ziaeefard M, Bergevin R (2015) Semantic human activity recognition: a literature review. Pattern Recogn 48(8):2329–2345. https://doi.org/10.1016/j.patcog.2015.03.006. https://www.sciencedirect.com/science/article/pii/S0031320315000953
Article Google Scholar
Zolfaghari M, Singh K, Brox T (2018) Eco: efficient convolutional network for online video understanding. In: Proceedings of the European conference on computer vision (ECCV), pp 695–712

Download references

Author information

Authors and Affiliations

School of Computer Science, Hangzhou Dianzi University, Qiantang, Hangzhou, 310018, Zhejiang, China
Haiping Zhang, Dongjin Yu, Dongjing Wang & Zepeng Hu
School of Information Engineering, Hangzhou Dianzi University, Qiantang, Hangzhou, 310018, Zhejiang, China
Haiping Zhang & Liming Guan
School of Electronics and Information, Hangzhou Dianzi University, Qiantang, Hangzhou, 310018, Zhejiang, China
Xu Liu & Conghao Ma

Authors

Haiping Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Xu Liu
View author publications
You can also search for this author inPubMed Google Scholar
Dongjin Yu
View author publications
You can also search for this author inPubMed Google Scholar
Liming Guan
View author publications
You can also search for this author inPubMed Google Scholar
Dongjing Wang
View author publications
You can also search for this author inPubMed Google Scholar
Conghao Ma
View author publications
You can also search for this author inPubMed Google Scholar
Zepeng Hu
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding authors

Correspondence to Haiping Zhang or Xu Liu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Dongjin Yu, Liming Guan, Dongjing Wang, Conghao Ma and Zepeng Hu are contributed equally to this work.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, H., Liu, X., Yu, D. et al. Skeleton-based action recognition with multi-stream, multi-scale dilated spatial-temporal graph convolution network. Appl Intell 53, 17629–17643 (2023). https://doi.org/10.1007/s10489-022-04365-8

Download citation

Accepted: 25 November 2022
Published: 10 January 2023
Issue Date: July 2023
DOI: https://doi.org/10.1007/s10489-022-04365-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Skeleton-based action recognition with multi-stream, multi-scale dilated spatial-temporal graph convolution network

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-stream ternary enhanced graph convolutional network for skeleton-based action recognition

Multi-scale spatial–temporal convolutional neural network for skeleton-based action recognition

Multi-scale Spatial and Temporal Feature Aggregation Graph Convolutional Network for Skeleton-Based Action Recognition

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now