Graph-aware transformer for skeleton-based action recognition

Zhang, Jiaxu; Xie, Wei; Wang, Chao; Tu, Ruide; Tu, Zhigang

doi:10.1007/s00371-022-02603-1

Graph-aware transformer for skeleton-based action recognition

Original article
Published: 26 July 2022

Volume 39, pages 4501–4512, (2023)
Cite this article

The Visual Computer Aims and scope Submit manuscript

Jiaxu Zhang¹,
Wei Xie²,
Chao Wang ORCID: orcid.org/0000-0002-4737-0717¹,
Ruide Tu³ &
…
Zhigang Tu¹

1721 Accesses
Explore all metrics

Abstract

Recently, graph convolutional networks (GCNs) play a critical role in skeleton-based human action recognition. However, most GCN-based methods still have two main limitations: (1) The semantic-level adjacency matrix of the skeleton graph is difficult to be manually defined, which restricts the perception field of GCN and limits its ability to extract the spatial–temporal features. (2) The velocity information of human body joints cannot be efficiently used and fully exploited by GCN, because GCN does not represent the correlation between the velocity vectors explicitly. To address these issues, we propose a graph-aware transformer (GAT), which can make full use of the velocity information and learn discriminative spatial–temporal motion features from the sequence of the skeleton graphs in a data-driven way. Besides, similar to the GCN-based model, our GAT also considers the prior structures of the human body including the link-aware structure and the part-aware structure. Extensive experiments on three large-scale datasets, i.e., NTU-RGB+D 60, NTU-RGB+D 120, and Kinetics-Skeleton, demonstrated that the proposed GAT obtains significant improvement compared to the GCN-based baseline for skeleton action recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-level Temporal-Guided Graph Convolutional Networks for Skeleton-Based Action Recognition

Skeleton-Based Action Recognition with Combined Part-Wise Topology Graph Convolutional Networks

Asymmetric information-regularized learning for skeleton-based action recognition

Article 02 December 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Agahian, S., Negin, F., Köse, C.: Improving bag-of-poses with semi-temporal pose descriptors for skeleton-based action recognition. Visual Comp. 35(4), 519–607 (2019)
Article Google Scholar
Caetano, C., Brémond, F., Schwartz, W.R.: Skeleton image representation for 3d action recognition based on tree structure and reference joints. In: 2019 32nd SIBGRAPI conference on graphics, patterns and images (SIBGRAPI), IEEE, pp 16–23 (2019a)
Caetano, C., Sena, J., Brémond, F., et al.: Skelemotion: a new representation of skeleton joint sequences based on motion information for 3d action recognition. In: 2019 16th IEEE International conference on advanced video and signal based surveillance (AVSS), IEEE, pp 1–8 (2019c)
Cao, C., Lan, C., Zhang, Y., et al.: Skeleton-based action recognition with gated convolutional neural networks. IEEE Trans. Circuit Sys. Video Tech. 29(11), 3247–3257 (2018)
Article Google Scholar
Cao, Z., Hidalgo, G., Simon, T., et al.: Openpose: realtime multi-person 2d pose estimation using part affinity fields. IEEE Trans. Patt. Anal. & Mach. Intell. PP(99), 1 (2018)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., et al.: End-to-end object detection with transformers. In: European Conference on Computer Vision, Springer, Berlin. pp 213–229 (2020b)
Chang, Y., Tu, Z., Xie, W., et al.: Clustering driven deep autoencoder for video anomaly detection. In: European conference on computer vision, Springer, Berlin pp 329–345 (2020)
Chen, H., Wang, Y., Guo, T., et al.: Pre-trained image processing transformer. In: arXiv preprint arXiv:2012.00364 (2020)
Chen, Y., Wang, Z., Peng, Y., et al.: Cascaded pyramid network for multi-person pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7103–7112 (2018)
Cheng, K., Zhang, Y., Cao, C., et al.: Decoupling gcn with dropgraph module for skeleton-based action recognition. In: Proceedings of the European conference on computer vision (ECCV) (2020a)
Cheng, K., Zhang, Y., He, X., et al.: Skeleton-based action recognition with shift graph convolutional network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 183–192 (2020b)
Crasto, N., Weinzaepfel, P., Alahari, K., et al.: Mars: Motion-augmented rgb stream for action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7882–7891 (2019)
Dai, Z., Cai, B., Lin, Y., et al.: Deformable transformers for end-to-end object detection. In: arXiv preprint arXiv:2010.04159 (2020a)
Dai, Z., Cai, B., Lin, Y., et al.: Up-detr: Unsupervised pre-training for object detection with transformers. In: arXiv preprint arXiv:2011.09094 (2020b)
Demisse, G.G., Papadopoulos, K., Aouada, D., et al.: Pose encoding for robust skeleton-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 188–194 (2018)
Devlin, J., Chang, M.W., Lee, K., et al.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: arXiv preprint arXiv:2010.11929 (2020)
Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1110–1118 (2015)
Duan, H., Zhao, Y., Chen, K., et al.: Revisiting skeleton-based action recognition. arXiv preprint arXiv:2104.13586 (2021)
Feichtenhofer, C., Fan, H., Malik, J., et al.: Slowfast networks for video recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6202–6211 (2019)
Gao, X., Hu, W., Tang, J., et al.: Optimized skeleton-based action recognition via sparsified graph regression. In: Proceedings of the 27th ACM international conference on multimedia, pp 601–610 (2019)
He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 770–778 (2016)
Hu, Y., Liu, C., Li, Y., et al.: Temporal perceptive network for skeleton-based action recognition. In: BMVC (2017)
Kay, W., Carreira, J., Simonyan, K., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
Ke, Q., Bennamoun, M., An, S., et al.: A new representation of skeleton sequences for 3d action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3288–3297 (2017)
Ke, Q., Bennamoun, M., An, S., et al.: Learning clip representations for skeleton-based 3d action recognition. IEEE Trans. Image Process. 27(6), 2842–2855 (2018)
Article MathSciNet MATH Google Scholar
Kim, T.S., Reiter, A.: Interpretable 3d human action analysis with temporal convolutional networks. In: 2017 IEEE conference on computer vision and pattern recognition workshops (CVPRW), IEEE, pp 1623–1631 (2017)
Li, B., Dai, Y., Cheng, X., et al.: Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep cnn. In: 2017 IEEE International conference on multimedia & expo workshops (ICMEW), IEEE, pp 601–604 (2017a)
Li, C., Zhong, Q., Xie, D., et al.: Skeleton-based action recognition with convolutional neural networks. In: 2017 IEEE International conference on multimedia & Expo Workshops (ICMEW), IEEE, pp 597–600 (2017b)
Li, M., Chen, S., Chen, X., et al.: Actional-structural graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3595–3603 (2019)
Li, M., Chen, S., Zhao, Y., et al.: Dynamic multiscale graph neural networks for 3d skeleton based human motion prediction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 214–223 (2020)
Liu, J., Shahroudy, A., Xu, D., et al.: Spatio-temporal lstm with trust gates for 3d human action recognition. In: European conference on computer vision, Springer, Berlin pp 816–833 (2016)
Liu, J., Wang, G., Hu, P., et al.: Global context-aware attention lstm networks for 3d action recognition. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 1647–1656 (2017a)
Liu, J., Shahroudy, A., Perez, M., et al.: Ntu rgb+d 120: A large-scale benchmark for 3d human activity understanding. In: CoRR, abs/1905.04757 (2019)
Liu, M., Liu, H., Chen, C.: Enhanced skeleton visualization for view invariant human action recognition. Patt. Recognit. 68, 346–362 (2017)
Article Google Scholar
Ma, C., Wang, A., Chen, G., et al.: Hand joints-based gesture recognition for noisy dataset using nested interval unscented Kalman filter with LSTM network. Visual Comp. 34(6), 1053–1063 (2018)
Article Google Scholar
Miyato, T., Si, Maeda, Koyama, M., et al.: Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE Trans. Patt. Anal. Mach. Intell. 41(8), 1979–1993 (2018)
Article Google Scholar
Parmar, N., Vaswani, A., Uszkoreit, J., et al.: Image transformer. In: arXiv preprint arXiv:1802.05751 (2020)
Paszke, A., Gross, S., Massa, F., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in neural information processing systems, pp 8026–8037 (2019)
Peng, G., Wang, S.: Dual semi-supervised learning for facial action unit recognition. In: Proceedings of the AAAI conference on artificial intelligence, pp 8827–8834 (2019)
Peng, W., Hong, X., Chen, H., et al.: Learning graph convolutional network for skeleton-based human action recognition by neural searching. In: Proceedings of the AAAI conference on artificial intelligence (2020)
Plizzari, C., Cannici, M., Matteucci, M.: Skeleton-based action recognition via spatial and temporal transformer networks. Comp. Vis. Image Understand. 208–209(103), 219 (2021). https://doi.org/10.1016/j.cviu.2021.103219
Article Google Scholar
Shahroudy, A., Liu, J., Ng, T.T., et al.: Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1010–1019 (2016)
Shi, L., Zhang, Y., Cheng, J., et al.: Skeleton-based action recognition with directed graph neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7912–7921 (2019)
Shi, L., Zhang, Y., Cheng, J., et al.: Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 12,026–12,035 (2019)
Si, C., Jing, Y., Wang, W., et al.: Skeleton-based action recognition with spatial reasoning and temporal stack learning. In: Proceedings of the European conference on computer vision (ECCV), pp 103–118 (2018)
Si, C., Chen, W., Wang, W., et al.: An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1227–1236 (2019)
Si, C., Nie, X., Wang, W., et al.: Adversarial self-supervised learning for semi-supervised 3d action recognition. In: Proceedings of the European conference on computer vision (ECCV), pp 35–51 (2020)
Song, S., Lan, C., Xing, J., et al.: An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: Thirty-first AAAI conference on artificial intelligence (2017)
Song, S., Lan, C., Xing, J., et al.: Spatio-temporal attention-based LSTM networks for 3d action recognition and detection. IEEE Trans. Image Process. 27(7), 3459–3471 (2018)
Article MathSciNet MATH Google Scholar
Straka, M., Hauswiesner, S., Rüther, M., et al.: Skeletal graph based human pose estimation in real-time. In: BMVC, pp 1–12 (2011)
Sun, Z., Cao, S., Yang, Y., et al.: Rethinking transformer-based set prediction for object detection. In: arXiv preprint arXiv:2011.10881 (2020)
Tu, Z., Xie, W., Qin, Q., et al.: Multi-stream CNN: learning representations based on human-related regions for action recognition. Patt. Recogn. 79, 32–43 (2018)
Article Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008 (2017)
Vemulapalli, R., Chellapa, R.: Rolling rotations for recognizing human actions from 3d skeletal data. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4471–4479 (2016)
Vemulapalli, R., Arrate, F., Chellappa, R.: Human action recognition by representing 3d skeletons as points in a lie group. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 588–595 (2014)
Wang, H., Wang, L.: Beyond joints: Learning representations from primitive geometries for skeleton-based action recognition and detection. IEEE Trans. Image Process. 27(9), 4382–4394 (2018)
Article MathSciNet Google Scholar
Wang, Y., Xu, Z., Wang, X., et al.: End-to-end video instance segmentation with transformers. In: arXiv preprint arXiv:2011.14503 (2020)
Wen, Y.H., Gao, L., Fu, H., et al.: Graph CNNS with motif and variable temporal block for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 8989–8996 (2019)
Wu, B., Xu, C., Dai, X., et al.: Visual transformers: token-based image representation and processing for computer vision. In: arXiv preprint arXiv:2006.03677 (2020)
Xu, Z., Hu, R., Chen, J., et al.: Semisupervised discriminant multimanifold analysis for action recognition. IEEE Trans. Neur. Netw. Learn Sys. 30(10), 2951–2962 (2019)
Article MathSciNet Google Scholar
Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Thirty-second AAAI conference on artificial intelligence (2018)
Yang, F., Yang, H., Fu, J., et al.: Learning texture transformer network for image super-resolution. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 5791–5800 (2020)
Yuan, X., Kong, L., Feng, D., et al.: Automatic feature point detection and tracking of human actions in time-of-flight videos. IEEE/CAA J. Automat. Sinica. 4(4), 677–685 (2017). https://doi.org/10.1109/JAS.2017.7510625
Article Google Scholar
Zeng, Y., Fu, J., Chao, H.: Learning joint spatial-temporal transformations for video inpainting. In: Proceedings of the European conference on computer vision (ECCV), pp 528–543 (2020)
Zengeler, N., Kopinski, T., Handmann, U.: Hand gesture recognition in automotive human-machine interaction using depth cameras. Sensors 19(1), 59 (2019)
Article Google Scholar
Zhang, D., He, L., Tu, Z., et al.: Learning motion representation for real-time spatio-temporal action localization. Patt. Recogn. 103(107), 312 (2020)
Article Google Scholar
Zhang, J., Han, Y., Tang, J., et al.: Semi-supervised image-to-video adaptation for video action recognition. IEEE Trans. Cybernet. 47(4), 960–973 (2016)
Article Google Scholar
Zhang, J., Ye, G., Tu, Z., et al.: A spatial attentive and temporal dilated (satd) gcn for skeleton-based action recognition. CAAI Transactions on intelligence technology pp 1–10 (2021a)
Zhang, P., Lan, C., Xing, J., et al.: View adaptive neural networks for high performance skeleton-based human action recognition. IEEE Trans. Patt. Anal. Mach. Intell. 41(8), 1963–1978 (2019)
Article Google Scholar
Zhang, P., Lan, C., Zeng, W., et al.: Semantics-guided neural networks for efficient skeleton-based human action recognition. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 1112–1121 (2020c)
Zhang, X., Xu, C., Tian, X., et al.: Graph edge convolutional neural networks for skeleton-based action recognition. IEEE Trans. Neur. Netw. Learn Sys. 31(8), 3047–3060 (2019)
Zhang, X., Xu, C., Tao, D.: Context aware graph convolution for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 14,333–14,342 (2020d)
Zhang, X., Li, C., Shi, H., et al.: Adapnet: adaptability decomposing encoder-decoder network for weakly supervised action recognition and localization. IEEE Transactions on Neural Networks and Learning Systems (2020e)
Zhao, H., Jiang, L., Jia, J., et al.: Point transformer. In: arXiv preprint arXiv:2012.09164 (2020)
Zhao, R., Wang, K., Su, H., et al.: Bayesian graph convolution lstm for skeleton based action recognition. In: Proceedings of the IEEE international conference on computer vision, pp 6882–6892 (2019)
Zheng, N., Wen, J., Liu, R., et al.: Unsupervised representation learning with long-term dynamics for skeleton based action recognition. In: Thirty-Second AAAI conference on artificial intelligence (2018)
Zheng, W., Li, L., Zhang, Z., et al.: Relational network for skeleton-based action recognition. In: 2019 IEEE International conference on multimedia and expo (ICME), pp 826–831 (2019)
Zhou, L., Zhou, Y., Corso, J.J., et al.: End-to-end dense video captioning with masked transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8739–8748 (2018)
Zhu, K., Wang, R., Zhao, Q., et al.: A cuboid CNN model with an attention mechanism for skeleton-based action recognition. IEEE Trans. Multim. 22(11), 2977–2989 (2019)
Article Google Scholar

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grant No. 62106177. It was also supported by the Joint Fund of the Ministry of Education of China under Grant No. 8091B032156. The numerical calculation was supported by the supercomputing system in the Super-computing Center of Wuhan University.

Author information

Authors and Affiliations

State Key Laboratory of Information Engineering in Surveying, Wuhan University, Wuhan, 430072, Hubei, China
Jiaxu Zhang, Chao Wang & Zhigang Tu
School of Computer, Central China Normal University, Wuhan, 430079, Hubei, China
Wei Xie
School Of Information Management, Central China Normal University, Wuhan, 430079, Hubei, China
Ruide Tu

Authors

Jiaxu Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Wei Xie
View author publications
You can also search for this author inPubMed Google Scholar
Chao Wang
View author publications
You can also search for this author inPubMed Google Scholar
Ruide Tu
View author publications
You can also search for this author inPubMed Google Scholar
Zhigang Tu
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding authors

Correspondence to Chao Wang or Ruide Tu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, J., Xie, W., Wang, C. et al. Graph-aware transformer for skeleton-based action recognition. Vis Comput 39, 4501–4512 (2023). https://doi.org/10.1007/s00371-022-02603-1

Download citation

Accepted: 14 June 2022
Published: 26 July 2022
Issue Date: October 2023
DOI: https://doi.org/10.1007/s00371-022-02603-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Graph-aware transformer for skeleton-based action recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-level Temporal-Guided Graph Convolutional Networks for Skeleton-Based Action Recognition

Skeleton-Based Action Recognition with Combined Part-Wise Topology Graph Convolutional Networks

Asymmetric information-regularized learning for skeleton-based action recognition

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now