Skip to main content
Log in

Skeleton-based multi-stream adaptive-attentional sub-graph convolution network for action recognition

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Recently, graph convolutional networks have achieved remarkable performance with skeleton-based action recognition methods. However, there is potential correlation between different parts of the human body. Many studies have ignored the fact that different actions are the result of the interaction of different human body parts, and that operating on the whole graph provides inadequate information to characterize the action category. In this study, to pay more attention to this problem and further improve the accuracy of action recognition models, sub-graphs based on the depth-first tree traversal order were used to represent the importance and correlation characteristics of joint and bone parts. In addition, beyond the physical structure of the body, joint and bone motion information was also introduced to represent changes in human body parts with movement. To improve the performance of this method, an adaptive-attentional mechanism was added to learn unique topology autonomously for each sample and channel domain. The multi-stream adaptive-attentional sub-graph convolution network was thus proposed for action recognition. The resulting model achieved competitive results on the NTU-RGB + D60 dataset based on 2D or 3D skeleton poses. The experimental results demonstrated the efficacy of our proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Data availability

NTU-RGB + D60 (3D Pose) could be obtained from https://github.com/shahroudy/NTURGB-D.

NTU-RGB + D60 (2D Pose) could be obtained from https://github.com/kennymckormick/pyskl.

References

  1. Atwood J, Towsley D (2015) Diffusion-convolutional neural networks. NIPS'16: proceedings of the 30th international conference on neural information processing systems, pp 2001–2009. https://doi.org/10.5555/3157096.3157320

  2. Cai X, Zhou W, Wu L, Luo J, Li H (2016) Effective active skeleton representation for low latency human action recognition. IEEE Trans Multimed 141-154. https://ieeexplore.ieee.org/document/7346460

  3. Carreira J, Zisserman A (2017) Quo Vadis, action recognition? A new model and the kinetics dataset. 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 6299–6308. https://ieeexplore.ieee.org/document/8099985

  4. Chao L, Zhong Q, Di X, Pu S (2017) Skeleton-based action recognition with convolutional neural networks. 2017 IEEE international conference on multimedia & expo workshops (ICMEW), pp 597-600. https://ieeexplore.ieee.org/document/8026285

  5. Chen C (2015) UTD-MHAD: a multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. 2015 IEEE international conference on image processing (ICIP), pp 168-172. https://ieeexplore.ieee.org/document/7350781

  6. Chen Y, Kalantidis Y, Li J, Yan S, Feng J (2018) A^2$-nets: double attention networks. NIPS'18: proceedings of the 32nd international conference on neural information processing systems, pp 350–359. https://dl.acm.org/doi/10.5555/3326943.3326976

  7. Cheng K, Zhang Y, He X, Chen W, Lu H (2020) Skeleton-based action recognition with shift graph convolutional network. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 180-189. https://ieeexplore.ieee.org/document/9157077

  8. Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 1110-1118. https://ieeexplore.ieee.org/document/7298714

  9. Duan H, Zhao Y, Chen K, Shao D, Lin D, Dai B (2021) Revisiting skeleton-based action recognition. 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 2959-2968. https://ieeexplore.ieee.org/document/9879048

  10. Duvenaud D, Maclaurin D, Aguilera-Iparraguirre J, Gómez-Bombarelli R, Hirzel T, Aspuru-Guzik A, Adams RP (2015) Convolutional Networks on Graphs for Learning Molecular Fingerprints. arXiv:1509.09292. https://arxiv.org/abs/1509.09292v1

  11. Fernando T, Denman S, Sridharan S, Fookes C (2018) Tracking by prediction: a deep generative model for Mutli-person localisation and tracking. 2018 IEEE winter conference on applications of computer vision (WACV), pp 1122-1132. https://ieeexplore.ieee.org/document/8354232

  12. Gaur U, Zhu Y, Bi S, Roy Chowdhury AK (2011) A "string of feature graphs" model for recognition of complex activities in natural videos. International Conference on Computer Vision. IEEE, pp 2595-2602. https://ieeexplore.ieee.org/document/6126548

  13. Hamilton WL, Ying R, Leskovec J (2017) Inductive representation learning on large graphs. NIPS'17: proceedings of the 31st international conference on neural information processing systems, pp 1025–1035. https://doi.org/10.5555/3294771.3294869

  14. Hussein ME, Torki M, Gowayyed MA, El-Saban M (2013) Human action recognition using a temporal hierarchy of covariance descriptors on 3D joint locations. IJCAI '13: proceedings of the twenty-third international joint conference on artificial intelligence, pp 2466–2472. 10.5555/2540128.2540483

  15. Hussein ME, Torki M, Gowayyed MA, El-Saban M (2013) IJCAI '13: proceedings of the twenty-third international joint conference on artificial intelligence, pp 2466–2472. https://dl.acm.org/doi/10.5555/2540128.2540483

  16. Jiang W, Liu Z, Ying W, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. Proceedings of the 2012 IEEE conference on computer vision and pattern recognition (CVPR), pp 1290-1297. https://ieeexplore.ieee.org/document/6247813

  17. Jiang W, Liu Z, Ying W, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. 2012 IEEE conference on computer vision and pattern recognition, pp 1290-1297. https://ieeexplore.ieee.org/document/6247813

  18. Jie H, Li S, Gang S, Albanie S (2017) Squeeze-and-excitation networks. IEEE Trans Pattern Anal Mach Intell 2011-2023. https://ieeexplore.ieee.org/document/8701503

  19. Jie H, Li S, Albanie S, Gang S, Vedaldi A (2018) Gather-excite: exploiting feature context in convolutional neural networks. NIPS'18: proceedings of the 32nd international conference on neural information processing systems, pp 9423–9433. https://dl.acm.org/doi/10.5555/3327546.3327612

  20. Ke Q, Bennamoun M, An S, Sohel F, Boussaid F (2017) A new representation of skeleton sequences for 3D action recognition. 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 4570-4579. https://ieeexplore.ieee.org/document/8099969

  21. Kim TS, Reiter A (2017) Interpretable 3D human action analysis with temporal convolutional networks. 2017 IEEE conference on computer vision and pattern recognition workshops (CVPRW), pp 1623-1631. https://ieeexplore.ieee.org/document/8014941

  22. Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. https://arxiv.org/abs/1609.02907v3

  23. Kipf T, Fetaya E, Wang KC, Welling M, Zemel R (2018) Neural relational inference for interacting systems. arXiv:1802.04687. https://arxiv.org/abs/1802.04687

  24. Li Y, Xia R, Liu X, Huang Q (2019) Learning shape-motion representations from geometric algebra Spatio-temporal model for skeleton-based action recognition. 2019 IEEE international conference on multimedia and Expo (ICME), pp 1066-1071. https://ieeexplore.ieee.org/document/8785009

  25. Li B, Li X, Zhang Z, Wu F (2019) Spatio-temporal graph routing for skeleton-based action recognition. Proceedings of the AAAI conference on artificial intelligence, pp 8561–8568. https://doi.org/10.1609/aaai.v33i01.33018561

  26. Li M, Chen S, Chen X, Zhang Y, Wang Y, Tian Q (2019) Actional-structural graph convolutional networks for skeleton-based action recognition. arXiv:1904.12659.https://doi.org/10.48550/arXiv.1904.12659

  27. Liu R, Shen J, Wang H, Chen C, Asari V (2020) Attention mechanism exploits temporal contexts: real-time 3D human pose reconstruction. 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5063-5072. https://ieeexplore.ieee.org/document/9156272

  28. Liu Z, Zhang H, Chen Z, Wang Z, Ouyang W (2020) Disentangling and unifying graph convolutions for skeleton-based action recognition. 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 140-149. https://ieeexplore.ieee.org/document/9156556

  29. Liu Y, Zhang H, Xu D, He K (2022) Graph transformer network with temporal kernel attention for skeleton-based action recognition. Knowl-Based Syst 108146.1-108146.16. https://doi.org/10.1016/j.knosys.2022.108146

  30. Monti F, Boscaini D, Masci J, Rodola E, Bronstein MM (2017) Geometric deep learning on graphs and manifolds using mixture model CNNs. 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 5425-5434. https://ieeexplore.ieee.org/document/8100059

  31. Niepert M, Ahmed M, Kutzkov K (2016) Learning convolutional neural networks for graphs. ICML'16: proceedings of the 33rd international conference on international conference on machine learning - volume, pp 2014–2023. https://doi.org/10.5555/3045390.3045603

  32. Rahmani H, Mian A (2015) Learning a non-linear knowledge transfer model for cross-view action recognition. 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 2458-2466. https://ieeexplore.ieee.org/document/7298860

  33. Reddy ND, Vo M, Narasimhan SG (2018) CarFusion: combining point tracking and part detection for dynamic 3D reconstruction of vehicles. In: 2018 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 1906-1915. https://ieeexplore.ieee.org/document/8578302

  34. Shahroudy A, Liu J, Ng TT, Wang G (2016) NTU RGB+D: a large scale dataset for 3D human activity analysis. 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 1010-1019. https://ieeexplore.ieee.org/document/7780484

  35. Shi L, Zhang Y, Cheng J, Lu H (2018) Two-stream adaptive graph convolutional networks for skeleton-based action recognition. arXiv:1805.07694. https://doi.org/10.48550/arXiv.1805.07694

  36. Shi L, Zhang Y, Cheng J, Lu H (2020) Skeleton-based action recognition with directed graph neural networks. 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 7904-7913. https://ieeexplore.ieee.org/document/8954160

  37. Shi L, Zhang Y, Cheng J, Lu H (2020) Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. IEEE Trans Image Process 9532-9545. https://ieeexplore.ieee.org/document/9219176

  38. Shuhua L, Xiaoying B, Ming F, Li L, Chih-Cheng H (2021) Mixed graph convolution and residual transformation network for skeleton-based action recognition. Appl Intell:1544–1555. https://doi.org/10.1007/s10489-021-02517-w

  39. Si C, Chen W, Wang W, Wang L, Tan T (2019) An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 1227-1236. https://ieeexplore.ieee.org/document/8954298

  40. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. arXiv:1406.2199. https://doi.org/10.1002/14651858.CD001941.pub3

  41. Thakkar K, Narayanan PJ (2018) Part-based Graph convolutional network for action recognition. arXiv:1809.04983. .48550/arXiv.1809.04983

  42. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. 2015 IEEE international conference on computer vision (ICCV), pp 4489-4497. https://ieeexplore.ieee.org/document/7410867

  43. Vemulapalli R, Chellappa R (2016) Rolling rotations for recognizing human actions from 3D skeletal data. 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 4471-4479. https://ieeexplore.ieee.org/document/7780853

  44. Vemulapalli R, Arrate F, Chellappa R (2014) Human action recognition by representing 3d skeletons as points in a lie group. IEEE conference on computer vision and pattern recognition, pp 588–595

  45. Wang J, Liu Z, Wu Y, Yuan J (2014) Learning Actionlet ensemble for 3D human action recognition. In: Human action recognition with depth cameras. SpringerBriefs in computer science. Springer, Cham, pp 11-40. 1007/978-3-319-04561-0_2

  46. Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. arXiv:1608.00859. https://arxiv.org/abs/1608.00859

  47. Wang X, Girshick R, Gupta A, He K (2017) Non-local neural networks. 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 7794-7803. https://ieeexplore.ieee.org/document/8578911

  48. Wang P, Li W, Li C, Hou Y (2018) Action recognition based on joint trajectory maps with convolutional neural networks. arXiv:1612.09401. 48550/arXiv.1612.09401

  49. Wang Q,Wu B,Zhu P,Li P,Zuo W,Hu Q (2019) ECA-net: efficient channel attention for deep convolutional neural networks. 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 11531-11539. https://ieeexplore.ieee.org/document/9156697

  50. Weinzaepfel P, Rogez G (2021) Mimetics: Towards Understanding Human Actions Out of Context. IJCV 129(5):1675–1690. .48550/arXiv.1912.07249

  51. Xia L, Chen CC, Aggarwal K (2012) View invariant human action recognition using histograms of 3D joints. 2012 IEEE computer society conference on computer vision and pattern recognition workshops, pp 20-27. https://ieeexplore.ieee.org/document/6239233

  52. Xiao F, Yong JL, Grauman K, Malik J, Feichtenhofer C (2020) Audiovisual SlowFast networks for video recognition. arXiv:2001.08740. https://doi.org/10.48550/arXiv.2001.08740

  53. Xu C, Guan Z, Zhao W, Wu H, Ling B (2019) Adversarial incomplete multi-view clustering. Twenty-eighth international joint conference on artificial intelligence {IJCAI-19}. https://doi.org/10.24963/ijcai.2019/546

  54. Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. arXiv preprint arXiv:1801.07455. https://arxiv.org/abs/1801.07455

  55. Yang J, Lu J, Lee S, Batra D, Parikh D (2018) Graph R-CNN for scene graph generation. European conference on computer vision, pp 690–706. https://doi.org/10.1007/978-3-030-01246-5_41

  56. Yang Z, Li Y, Yang J, Luo J (2019) Action recognition with Spatio-temporal visual attention on skeleton image sequences. IEEE Trans Circuits Syst Video Technol 2405-2415. https://ieeexplore.ieee.org/document/8428616

  57. Yang H, Yan D, Zhang L, Li D, Sun YD, You SD, Maybank SJ (2020) Feedback graph convolutional network for skeleton-based action recognition. IEEE Trans Image Process 164-175. https://ieeexplore.ieee.org/document/9626596

  58. Zhang B, Yun Y, Chen C, Yang L, Han J, Ling S (2017) Action recognition using 3D histograms of texture and a multi-class boosting classifier. IEEE Trans Image Process 4648–4660. https://ieeexplore.ieee.org/document/7954740

  59. Zhang P, Lan C, Xing J, Zeng W, Zheng N (2017) View adaptive recurrent neural networks for high performance human action recognition from skeleton data. 2017 IEEE international conference on computer vision (ICCV), pp 2136-2145. https://ieeexplore.ieee.org/document/8237495

Download references

Acknowledgments

The authors thank the anonymous reviewers for valuable comments. This work is was supported by the National Key Research and Development Program of China 2022YFB2503405, and the Natural Science Foundation of Jilin Province Grant No: 20210101061JC.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rui He.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, H., Wu, J., Ma, H. et al. Skeleton-based multi-stream adaptive-attentional sub-graph convolution network for action recognition. Multimed Tools Appl 83, 2935–2958 (2024). https://doi.org/10.1007/s11042-023-15778-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-15778-z

Keywords

Navigation