Skip to main content
Log in

An attentional spatial temporal graph convolutional network with co-occurrence feature learning for action recognition

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Action recognition plays a central role in intelligent surveillance system, game-control, human-computer interaction, and so on. In this work, we design a multi-task framework that improves the recent Spatial-Temporal Graph Convolutional Networks (ST-GCN) for skeleton-based action recognition by introducing the attention mechanism and co-occurrence feature learning. Specifically, we use an attentional branch to pay more attention to more discriminating features and aggregates co-occurrence features from all joints globally in another branch. Additionally, our multi-task framework exploits the inherent correlation between branches to further enhance the classification accuracy and convergence speed. Experiments have been carried out on NTURGB+D and Kinetics human action dataset. The results clearly show that the accuracy of the proposed multi-task framework are distinguishably higher than ST-GCN and other mainstream methods for 3D action recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Baradel F, Wolf C, Mille J (2017) Pose-conditioned spatiotemporal attention for human action recognition. CoRR abs/1703.10106, 2017. 7

  2. Cao Z, Simon T, Wei S-E, Sheikh Y (2017) Realtime multi-person 2d pose estimation using part affinity fields. In: Computer Vision and Pattern Recognition (CVPR), 2017 9, 10

  3. Carreira J, Zisserman A (2017) Quovadis, action recognition? a new model and the kinetics dataset. In: CVPR, 2017. 1, 3, 5, 7, 8

  4. Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1110–1118

  5. Gu J, Wang G, Chen T (2016) Recurrent highway networks with language cnn for image captioning. arXiv preprint arXiv:1612.07086

  6. Hammond DK, Vandergheynst P, Gribonval R (2011) Wavelets on graphs via spectral graph theory. Appl Comput Harmon Anal 30(2):129–150

    Article  MathSciNet  Google Scholar 

  7. Jie H, Li S, Albanie S (2017) Squeeze-and-excitation networks. IEEE Trans Pattern Anal Mach Intell pp(99):1–1

    Google Scholar 

  8. Jin SY, Choi HJ (2012) Essential body-joint and atomic action detection for human activity recognition using longest common subsequence algorithm. In: ICCV, pp 148–159

  9. Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev P et al (2017) The kinetics human action video dataset. In: arXiv:1705.06950

  10. Ke Q, An S, Bennamoun M, Sohel F, Boussaid F (2017) Skeletonnet: mining deep part features for 3d action recognition. In: IEEE signal processing letters

  11. Ke Q, Bennamoun M, An S, Sohel F, Boussaid F (2017) A new representation of skeleton sequences for 3D action recognition. In: CVPR, July 2017

  12. Kim TS, Reiter A (2017) Interpretable 3d human action analysis with temporal convolutional networks. In: BNMW CVPRW

  13. Koniusz P, Cherian A, Porikli F (2016) Tensor representations via kernel linearization for action recognition from 3d skeletons. arXiv preprint arXiv:1604.00239

  14. Li D, Chen X, Zhang Z, Huang K (2017) Learning deep context-aware features over body and latent parts for person re-identification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 384–393

  15. Li C, Zhong Q, Xie D, Pu S (2017) Skeleton-based action recognition with convolutional neural networks. In: arXiv:1704.07595

  16. Li W, Zhu X, Gong S (2018) Harmonious attention network for person reidentification. In: CVPR, vol 1, p 2

  17. Li R, Wang S, Zhu F, Huang J (2018) Adaptive graph convolutional neural networks. arXiv preprint arXiv:1801.03226

  18. Li C, Zhong Q, Xie D, Pu S (2018) Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. In: arXiv:1804.06055

  19. Liu J, Shahroudy A, Xu D, Wang G (2016) Spatio-temporal LSTM with trust gates for 3D human action recognition. In: European conference on computer vision (ECCV). Springer, pp 816–833

  20. Lu G, Zhou Y, Li X (2016) Efficient action recognition via local position offset of 3D skeletal body joints. Multimed Tools Appl 75(6):3479–3494

    Article  Google Scholar 

  21. Nguyen TV (2015) STAP: spatial-temporal attention-aware pooling for action recognition[J]. IEEE Trans Circuits Syst Video Technol 25(1):77–86

    Article  Google Scholar 

  22. Niepert M, Ahmed M, Kutzkov K (2016) Learning convolutional neural networks for graphs. In: International conference on machine learning

  23. Sainath TN, Vinyals O, Senior A, Sak H (2015) Convolutional, long short-term memory, fully connected deep neural networks. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 4580–4584

  24. Shahroudy A, Liu J, Ng T-T, Wang G (2016) Nturgb+d: a large scale data set for 3d human activity analysis. In: CVPR, pp 1010–1019

  25. Si C, Chen W, Wang W, Wang L, Tan T (2019) An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In: CVPR 2019

  26. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: NIPS, pp 568–576

  27. Song S, Lan C, Xing J, Zeng W, Liu J (2017) An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: Proceedings of the thirty-first AAAI conference on artificial intelligence, February 4–9, 2017, San Francisco, California, USA, pp 4263–4270

  28. Sun B, Kong D, Wang S (2018) Effective human action recognition using global and local offsets of skeleton joints. Multimed Tools Appl:1–25. Published online Jul, 2018

  29. Toshev A, Szegedy C (2013) Deeppose: human pose estimation via deep neural networks. CoRR abs/1312.4659

  30. Wang H, Schmid C (2014) Action recognition with improved trajectories. IEEE International Conference on Computer Vision

  31. Wang H et al (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103.1(2013):60–79

    Article  MathSciNet  Google Scholar 

  32. Wang J, Liu Z, Wu Y, Yuan J (2014) Learning actionlet ensemble for 3d human action recognition. TPAMI 36(5):914

    Article  Google Scholar 

  33. Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. In: ECCV, 2016, p 6

  34. Wang C, Zhang Q, Huang C, Liu W, Wang X (2018) Mancs: a multi-task attentional network with curriculum sampling for person re-identification. In: ECCV 2018, pp 384–400

  35. Weston J, Chopra S, Bordes A (2014) Memory networks. arXiv preprint arXiv:1410.3916

  36. Xia L, Chen C-C, Aggarwal J (2012) View invariant human action recognition using histograms of 3D joints. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp 20–27

  37. Xu K, Li C, Tian Y, Sonobe T, Kawarabayashi KI, Jegelka S (2018) Representation learning on graphs with jumping knowledge networks. arXiv preprint arXiv:1806.03536

    Google Scholar 

  38. Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In: AAAI

  39. Yeung S, Russakovsky O, Jin N, Andriluka M, Mori G, Fei-Fei L (2015) Every moment counts: dense detailed labeling of actions in complex videos. Int J Comput Vis 126(2–4):375–389

    MathSciNet  Google Scholar 

  40. Yong D, Yun F, Liang W (2016) Skeleton based action recognition with convolutional neural network. In: Pattern Recognition, pp 579–583

  41. Yu Y, Mann GK, Gosine RG (2010) An object-based visual attention model for robotic applications. IEEE Trans Syst Man Cybern B Cybern 40(5):1398–1412

    Article  Google Scholar 

  42. Zhao Y, Xiong Y, Wang L, Wu Z, Tang X, Lin D (2017) Temporal action detection with structured segment networks. In: ICCV

  43. Zhu W, Lan C, Xing J, Zeng W, Li Y, Shen L, Xie X (2016) Co-occurrence feature learning for skeleton based action recognition using regularized deep lstm networks. In: AAAI Conference on Artificial Intelligence (AAAI)

  44. Zichao M, Zhixin S (2018) Time-varying LSTM networks for action recognition. Multimed Tools Appl:32275–32285. Published online Dec. 2018

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhe-Ming Lu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tian, D., Lu, ZM., Chen, X. et al. An attentional spatial temporal graph convolutional network with co-occurrence feature learning for action recognition. Multimed Tools Appl 79, 12679–12697 (2020). https://doi.org/10.1007/s11042-020-08611-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-08611-4

Keywords

Navigation