Skip to main content
Log in

Graph-aware transformer for skeleton-based action recognition

  • Original article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

Recently, graph convolutional networks (GCNs) play a critical role in skeleton-based human action recognition. However, most GCN-based methods still have two main limitations: (1) The semantic-level adjacency matrix of the skeleton graph is difficult to be manually defined, which restricts the perception field of GCN and limits its ability to extract the spatial–temporal features. (2) The velocity information of human body joints cannot be efficiently used and fully exploited by GCN, because GCN does not represent the correlation between the velocity vectors explicitly. To address these issues, we propose a graph-aware transformer (GAT), which can make full use of the velocity information and learn discriminative spatial–temporal motion features from the sequence of the skeleton graphs in a data-driven way. Besides, similar to the GCN-based model, our GAT also considers the prior structures of the human body including the link-aware structure and the part-aware structure. Extensive experiments on three large-scale datasets, i.e., NTU-RGB+D 60, NTU-RGB+D 120, and Kinetics-Skeleton, demonstrated that the proposed GAT obtains significant improvement compared to the GCN-based baseline for skeleton action recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  1. Agahian, S., Negin, F., Köse, C.: Improving bag-of-poses with semi-temporal pose descriptors for skeleton-based action recognition. Visual Comp. 35(4), 519–607 (2019)

    Article  Google Scholar 

  2. Caetano, C., Brémond, F., Schwartz, W.R.: Skeleton image representation for 3d action recognition based on tree structure and reference joints. In: 2019 32nd SIBGRAPI conference on graphics, patterns and images (SIBGRAPI), IEEE, pp 16–23 (2019a)

  3. Caetano, C., Sena, J., Brémond, F., et al.: Skelemotion: a new representation of skeleton joint sequences based on motion information for 3d action recognition. In: 2019 16th IEEE International conference on advanced video and signal based surveillance (AVSS), IEEE, pp 1–8 (2019c)

  4. Cao, C., Lan, C., Zhang, Y., et al.: Skeleton-based action recognition with gated convolutional neural networks. IEEE Trans. Circuit Sys. Video Tech. 29(11), 3247–3257 (2018)

    Article  Google Scholar 

  5. Cao, Z., Hidalgo, G., Simon, T., et al.: Openpose: realtime multi-person 2d pose estimation using part affinity fields. IEEE Trans. Patt. Anal. & Mach. Intell. PP(99), 1 (2018)

    Google Scholar 

  6. Carion, N., Massa, F., Synnaeve, G., et al.: End-to-end object detection with transformers. In: European Conference on Computer Vision, Springer, Berlin. pp 213–229 (2020b)

  7. Chang, Y., Tu, Z., Xie, W., et al.: Clustering driven deep autoencoder for video anomaly detection. In: European conference on computer vision, Springer, Berlin pp 329–345 (2020)

  8. Chen, H., Wang, Y., Guo, T., et al.: Pre-trained image processing transformer. In: arXiv preprint arXiv:2012.00364 (2020)

  9. Chen, Y., Wang, Z., Peng, Y., et al.: Cascaded pyramid network for multi-person pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7103–7112 (2018)

  10. Cheng, K., Zhang, Y., Cao, C., et al.: Decoupling gcn with dropgraph module for skeleton-based action recognition. In: Proceedings of the European conference on computer vision (ECCV) (2020a)

  11. Cheng, K., Zhang, Y., He, X., et al.: Skeleton-based action recognition with shift graph convolutional network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 183–192 (2020b)

  12. Crasto, N., Weinzaepfel, P., Alahari, K., et al.: Mars: Motion-augmented rgb stream for action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7882–7891 (2019)

  13. Dai, Z., Cai, B., Lin, Y., et al.: Deformable transformers for end-to-end object detection. In: arXiv preprint arXiv:2010.04159 (2020a)

  14. Dai, Z., Cai, B., Lin, Y., et al.: Up-detr: Unsupervised pre-training for object detection with transformers. In: arXiv preprint arXiv:2011.09094 (2020b)

  15. Demisse, G.G., Papadopoulos, K., Aouada, D., et al.: Pose encoding for robust skeleton-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 188–194 (2018)

  16. Devlin, J., Chang, M.W., Lee, K., et al.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  17. Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: arXiv preprint arXiv:2010.11929 (2020)

  18. Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1110–1118 (2015)

  19. Duan, H., Zhao, Y., Chen, K., et al.: Revisiting skeleton-based action recognition. arXiv preprint arXiv:2104.13586 (2021)

  20. Feichtenhofer, C., Fan, H., Malik, J., et al.: Slowfast networks for video recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6202–6211 (2019)

  21. Gao, X., Hu, W., Tang, J., et al.: Optimized skeleton-based action recognition via sparsified graph regression. In: Proceedings of the 27th ACM international conference on multimedia, pp 601–610 (2019)

  22. He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 770–778 (2016)

  23. Hu, Y., Liu, C., Li, Y., et al.: Temporal perceptive network for skeleton-based action recognition. In: BMVC (2017)

  24. Kay, W., Carreira, J., Simonyan, K., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)

  25. Ke, Q., Bennamoun, M., An, S., et al.: A new representation of skeleton sequences for 3d action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3288–3297 (2017)

  26. Ke, Q., Bennamoun, M., An, S., et al.: Learning clip representations for skeleton-based 3d action recognition. IEEE Trans. Image Process. 27(6), 2842–2855 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  27. Kim, T.S., Reiter, A.: Interpretable 3d human action analysis with temporal convolutional networks. In: 2017 IEEE conference on computer vision and pattern recognition workshops (CVPRW), IEEE, pp 1623–1631 (2017)

  28. Li, B., Dai, Y., Cheng, X., et al.: Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep cnn. In: 2017 IEEE International conference on multimedia & expo workshops (ICMEW), IEEE, pp 601–604 (2017a)

  29. Li, C., Zhong, Q., Xie, D., et al.: Skeleton-based action recognition with convolutional neural networks. In: 2017 IEEE International conference on multimedia & Expo Workshops (ICMEW), IEEE, pp 597–600 (2017b)

  30. Li, M., Chen, S., Chen, X., et al.: Actional-structural graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3595–3603 (2019)

  31. Li, M., Chen, S., Zhao, Y., et al.: Dynamic multiscale graph neural networks for 3d skeleton based human motion prediction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 214–223 (2020)

  32. Liu, J., Shahroudy, A., Xu, D., et al.: Spatio-temporal lstm with trust gates for 3d human action recognition. In: European conference on computer vision, Springer, Berlin pp 816–833 (2016)

  33. Liu, J., Wang, G., Hu, P., et al.: Global context-aware attention lstm networks for 3d action recognition. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 1647–1656 (2017a)

  34. Liu, J., Shahroudy, A., Perez, M., et al.: Ntu rgb+d 120: A large-scale benchmark for 3d human activity understanding. In: CoRR, abs/1905.04757 (2019)

  35. Liu, M., Liu, H., Chen, C.: Enhanced skeleton visualization for view invariant human action recognition. Patt. Recognit. 68, 346–362 (2017)

    Article  Google Scholar 

  36. Ma, C., Wang, A., Chen, G., et al.: Hand joints-based gesture recognition for noisy dataset using nested interval unscented Kalman filter with LSTM network. Visual Comp. 34(6), 1053–1063 (2018)

    Article  Google Scholar 

  37. Miyato, T., Si, Maeda, Koyama, M., et al.: Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE Trans. Patt. Anal. Mach. Intell. 41(8), 1979–1993 (2018)

    Article  Google Scholar 

  38. Parmar, N., Vaswani, A., Uszkoreit, J., et al.: Image transformer. In: arXiv preprint arXiv:1802.05751 (2020)

  39. Paszke, A., Gross, S., Massa, F., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in neural information processing systems, pp 8026–8037 (2019)

  40. Peng, G., Wang, S.: Dual semi-supervised learning for facial action unit recognition. In: Proceedings of the AAAI conference on artificial intelligence, pp 8827–8834 (2019)

  41. Peng, W., Hong, X., Chen, H., et al.: Learning graph convolutional network for skeleton-based human action recognition by neural searching. In: Proceedings of the AAAI conference on artificial intelligence (2020)

  42. Plizzari, C., Cannici, M., Matteucci, M.: Skeleton-based action recognition via spatial and temporal transformer networks. Comp. Vis. Image Understand. 208–209(103), 219 (2021). https://doi.org/10.1016/j.cviu.2021.103219

    Article  Google Scholar 

  43. Shahroudy, A., Liu, J., Ng, T.T., et al.: Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1010–1019 (2016)

  44. Shi, L., Zhang, Y., Cheng, J., et al.: Skeleton-based action recognition with directed graph neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7912–7921 (2019)

  45. Shi, L., Zhang, Y., Cheng, J., et al.: Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 12,026–12,035 (2019)

  46. Si, C., Jing, Y., Wang, W., et al.: Skeleton-based action recognition with spatial reasoning and temporal stack learning. In: Proceedings of the European conference on computer vision (ECCV), pp 103–118 (2018)

  47. Si, C., Chen, W., Wang, W., et al.: An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1227–1236 (2019)

  48. Si, C., Nie, X., Wang, W., et al.: Adversarial self-supervised learning for semi-supervised 3d action recognition. In: Proceedings of the European conference on computer vision (ECCV), pp 35–51 (2020)

  49. Song, S., Lan, C., Xing, J., et al.: An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: Thirty-first AAAI conference on artificial intelligence (2017)

  50. Song, S., Lan, C., Xing, J., et al.: Spatio-temporal attention-based LSTM networks for 3d action recognition and detection. IEEE Trans. Image Process. 27(7), 3459–3471 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  51. Straka, M., Hauswiesner, S., Rüther, M., et al.: Skeletal graph based human pose estimation in real-time. In: BMVC, pp 1–12 (2011)

  52. Sun, Z., Cao, S., Yang, Y., et al.: Rethinking transformer-based set prediction for object detection. In: arXiv preprint arXiv:2011.10881 (2020)

  53. Tu, Z., Xie, W., Qin, Q., et al.: Multi-stream CNN: learning representations based on human-related regions for action recognition. Patt. Recogn. 79, 32–43 (2018)

    Article  Google Scholar 

  54. Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008 (2017)

  55. Vemulapalli, R., Chellapa, R.: Rolling rotations for recognizing human actions from 3d skeletal data. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4471–4479 (2016)

  56. Vemulapalli, R., Arrate, F., Chellappa, R.: Human action recognition by representing 3d skeletons as points in a lie group. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 588–595 (2014)

  57. Wang, H., Wang, L.: Beyond joints: Learning representations from primitive geometries for skeleton-based action recognition and detection. IEEE Trans. Image Process. 27(9), 4382–4394 (2018)

    Article  MathSciNet  Google Scholar 

  58. Wang, Y., Xu, Z., Wang, X., et al.: End-to-end video instance segmentation with transformers. In: arXiv preprint arXiv:2011.14503 (2020)

  59. Wen, Y.H., Gao, L., Fu, H., et al.: Graph CNNS with motif and variable temporal block for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 8989–8996 (2019)

  60. Wu, B., Xu, C., Dai, X., et al.: Visual transformers: token-based image representation and processing for computer vision. In: arXiv preprint arXiv:2006.03677 (2020)

  61. Xu, Z., Hu, R., Chen, J., et al.: Semisupervised discriminant multimanifold analysis for action recognition. IEEE Trans. Neur. Netw. Learn Sys. 30(10), 2951–2962 (2019)

    Article  MathSciNet  Google Scholar 

  62. Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Thirty-second AAAI conference on artificial intelligence (2018)

  63. Yang, F., Yang, H., Fu, J., et al.: Learning texture transformer network for image super-resolution. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 5791–5800 (2020)

  64. Yuan, X., Kong, L., Feng, D., et al.: Automatic feature point detection and tracking of human actions in time-of-flight videos. IEEE/CAA J. Automat. Sinica. 4(4), 677–685 (2017). https://doi.org/10.1109/JAS.2017.7510625

    Article  Google Scholar 

  65. Zeng, Y., Fu, J., Chao, H.: Learning joint spatial-temporal transformations for video inpainting. In: Proceedings of the European conference on computer vision (ECCV), pp 528–543 (2020)

  66. Zengeler, N., Kopinski, T., Handmann, U.: Hand gesture recognition in automotive human-machine interaction using depth cameras. Sensors 19(1), 59 (2019)

    Article  Google Scholar 

  67. Zhang, D., He, L., Tu, Z., et al.: Learning motion representation for real-time spatio-temporal action localization. Patt. Recogn. 103(107), 312 (2020)

    Article  Google Scholar 

  68. Zhang, J., Han, Y., Tang, J., et al.: Semi-supervised image-to-video adaptation for video action recognition. IEEE Trans. Cybernet. 47(4), 960–973 (2016)

    Article  Google Scholar 

  69. Zhang, J., Ye, G., Tu, Z., et al.: A spatial attentive and temporal dilated (satd) gcn for skeleton-based action recognition. CAAI Transactions on intelligence technology pp 1–10 (2021a)

  70. Zhang, P., Lan, C., Xing, J., et al.: View adaptive neural networks for high performance skeleton-based human action recognition. IEEE Trans. Patt. Anal. Mach. Intell. 41(8), 1963–1978 (2019)

    Article  Google Scholar 

  71. Zhang, P., Lan, C., Zeng, W., et al.: Semantics-guided neural networks for efficient skeleton-based human action recognition. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 1112–1121 (2020c)

  72. Zhang, X., Xu, C., Tian, X., et al.: Graph edge convolutional neural networks for skeleton-based action recognition. IEEE Trans. Neur. Netw. Learn Sys. 31(8), 3047–3060 (2019)

  73. Zhang, X., Xu, C., Tao, D.: Context aware graph convolution for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 14,333–14,342 (2020d)

  74. Zhang, X., Li, C., Shi, H., et al.: Adapnet: adaptability decomposing encoder-decoder network for weakly supervised action recognition and localization. IEEE Transactions on Neural Networks and Learning Systems (2020e)

  75. Zhao, H., Jiang, L., Jia, J., et al.: Point transformer. In: arXiv preprint arXiv:2012.09164 (2020)

  76. Zhao, R., Wang, K., Su, H., et al.: Bayesian graph convolution lstm for skeleton based action recognition. In: Proceedings of the IEEE international conference on computer vision, pp 6882–6892 (2019)

  77. Zheng, N., Wen, J., Liu, R., et al.: Unsupervised representation learning with long-term dynamics for skeleton based action recognition. In: Thirty-Second AAAI conference on artificial intelligence (2018)

  78. Zheng, W., Li, L., Zhang, Z., et al.: Relational network for skeleton-based action recognition. In: 2019 IEEE International conference on multimedia and expo (ICME), pp 826–831 (2019)

  79. Zhou, L., Zhou, Y., Corso, J.J., et al.: End-to-end dense video captioning with masked transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8739–8748 (2018)

  80. Zhu, K., Wang, R., Zhao, Q., et al.: A cuboid CNN model with an attention mechanism for skeleton-based action recognition. IEEE Trans. Multim. 22(11), 2977–2989 (2019)

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grant No. 62106177. It was also supported by the Joint Fund of the Ministry of Education of China under Grant No. 8091B032156. The numerical calculation was supported by the supercomputing system in the Super-computing Center of Wuhan University.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Chao Wang or Ruide Tu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, J., Xie, W., Wang, C. et al. Graph-aware transformer for skeleton-based action recognition. Vis Comput 39, 4501–4512 (2023). https://doi.org/10.1007/s00371-022-02603-1

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-022-02603-1

Keywords

Navigation