Skip to main content
Log in

View transform graph attention recurrent networks for skeleton-based action recognition

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

Human action recognition based on skeleton recently has attracted attention of researchers due to the accessibility and popularity of the 3D skeleton data. However, it is complicated to effectively represent spatial–temporal skeleton sequences given the large variations of action representations when they are captured from different viewpoints. In order to get a better representation of the spatial–temporal skeletal features, this paper introduces a view transform graph attention recurrent networks (VT+GARN) method for view-invariant human action recognition. We design a view-invariant transform strategy based on the sequence to reduce the influence of different views on the spatial–temporal position of skeleton joint. Then, the graph attention recurrent network automatically calculates the coefficient of attention and learns the representation of spatiotemporal skeletal features after the transformation and outputs the classification result. Ablation studies and extensive experiments on three challenging datasets, Northwestern-UCLA, NTU RGB+D and UWA3DII, demonstrate the effectiveness and superiority of our method

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Aggarwal, J.K., Xia, L.: Human activity recognition from 3d data: A review. Pattern Recognit. Lett. 48, 70–80 (2014)

    Article  Google Scholar 

  2. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos[C] In: Advances in Neural Information Processing Systems. pp. 568–576 (2014)

  3. Tran, D., Bourdev, L., Fergus, R., et al.: Learning spatiotemporal features with 3d convolutional networks[C] In: Proceedings of the IEEE International Conference on Computer Vision. pp. 4489–4497 (2015)

  4. Wang, L., Xiong, Y., Wang, Z., et al.: Temporal segment networks: Towards good practices for deep action recognition[C]. In: European Conference on Computer Vision. Springer, Cham, pp. 20–36 (2016)

  5. Wang, P., Li, W., Ogunbona, P., et al.: RGB-D-based human motion recognition with deep learning: A survey[J]. Comput. Vision Image Underst. 171, 118–139 (2018)

    Article  Google Scholar 

  6. Wang, J., Liu, Z., Wu, Y., et al.: Mining actionlet ensemble for action recognition with depth cameras[C]. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, pp. 1290–1297 (2012)

  7. Hussein, M.E., Torki, M., Gowayyed, M.A., et al.: Human action recognition using a temporal hierarchy of covariance descriptors on 3d joint locations[C]. In: Twenty-Third International Joint Conference on Artificial Intelligence. pp. 2466–2472 (2013) https://dl.acm.org/citation.cfm?id=2540128.2540483

  8. Vemulapalli, R., Chellapa, R.: Rolling rotations for recognizing human actions from 3d skeletal data[C]. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4471–4479 (2016)

  9. Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition[C]. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1110–1118 (2015)

  10. Liu, J., Shahroudy, A., Xu, D., et al.: Skeleton-based action recognition using spatio-temporal lstm network with trust gates. IEEE Trans. Pattern Anal. Mach. Intell. 40(12), 3007–3021 (2017)

    Article  Google Scholar 

  11. Li, C., Zhong, Q., Xie, D., et al.: Skeleton-based action recognition with convolutional neural networks[C]., In: 2017 IEEE International Conference on Multimedia and Expo Workshops (ICMEW). IEEE, pp. 597–600 (2017)

  12. Ke, Q., Bennamoun, M., An, S., et al.: A new representation of skeleton sequences for 3d action recognition[C]. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3288–3297 (2017)

  13. Kim, T.S., Reiter, A.: Interpretable 3d human action analysis with temporal convolutional networks[C]. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE, pp. 1623–1631 (2017)

  14. Li, C., Zhong, Q., Xie, D., et al.: Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation[J]. arXiv:1804.06055, (2018)

  15. Scarselli, F., Gori, M., Tsoi, A.C., et al.: The graph neural network model. IEEE Trans. Neural Netw. 20(1), 61–80 (2008)

    Article  Google Scholar 

  16. Qi, S., Wang, W., Jia, B., et al.: Learning human-object interactions by graph parsing neural networks[C]. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 401-417 (2018)

  17. Bruna, J., Zaremba, W., Szlam, A., et al.: Spectral networks and locally connected networks on graphs[J]. arXiv:1312.6203 (2013)

  18. Defferrard, M., Bresson, X., Vandergheynst, P.: Convolutional neural networks on graphs with fast localized spectral filtering[C]. In: Advances in Neural Information Processing Systems. pp. 3844–3852 (2016)

  19. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks[J]. arXiv:1609.02907, (2016)

  20. Xia, L., Chen, C.-C, Aggarwal, J.K.: View invariant human action recognition using histograms of 3D joints[C]. In: 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops

  21. Raptis, M., Kirovski, D., Hoppe, H.: Real-time classification of dance gestures from skeleton animation[C]. In: Proceedings of the ACM SIGGRAPH/ Eurographics Symposium on Computer Animation. pp. 147–156 (2011)

  22. Liu, M., Liu, H., Chen, C.: Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognit. 68, 346–362 (2017)

    Article  Google Scholar 

  23. Zhang, P., Lan, C., Xing, J., et al.: View Adaptive Neural Networks for High Performance Skeleton-based Human Action Recognition[J]. IEEE Trans. Pattern Anal. Mach. Intell. (2018). https://doi.org/10.1109/TPAMI.2019.2896631

  24. Velikovi, P., Cucurull, G., Casanova, A., et al.: Graph attention networks[J]. arXiv:1710.10903 (2017)

  25. Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition[C]. In: Thirty-second AAAI conference on artificial intelligence. arXiv:1810.07455v2 (2018)

  26. Tang, Y., Tian, Y., Lu, J., et al.: Deep progressive reinforcement learning for skeleton-based action recognition[C]. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2018) pp. 5323–5332

  27. Si, C., Chen, W., Wang, W., et al.: An Attention Enhanced Graph Convolutional LSTM Network for Skeleton-Based Action Recognition[J]. arXiv:1902.09130 (2019)

  28. Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need[C]. In: Advances in Neural Information Processing Systems. pp. 5998–6008 (2017)

  29. Shahroudy, A., Liu, J., Ng, T.T., et al.: Ntu rgb+ d: A large scale dataset for 3d human activity analysis[C]. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1010–1019 (2016)

  30. Wang, J., Nie, X., Xia, Y., et al.: Cross-view action modeling, learning, and recognition[C]. In: Proceedings of the Conference on Computer Vision and Pattern Recognition, pp. 2649–2656 (2014)

  31. Rahmani, H., Mahmood, A., Huynh, D., et al.: Histogram of oriented principal components for cross-view action recognition[J]. IEEE Trans. Pattern Anal. Mach. Intell. 2430–2443 (2016)

  32. Kingma D.P., Ba, J.: Adam: A method for stochastic optimization. In: ICLR, arXiv:1412.6980 (2015)

  33. Vemulapalli, R., Arrate, F., Chellappa, R.: Human action recognition by representing 3D human skeletons as points in a Lie group[C]In: Cvpr. IEEE Computer Society, pp. 588–595 (2014)

  34. Hu, J.-F., Zheng, W.-S., Lai, J., et al.: Jointly learning heterogeneous features for RGB-D activity recognition[C] In: Proceedings of the Conference on Computer Vision and Pattern Recognition, pp. 5344–5352 (2015)

  35. Zhang, P., Lan, C., Xing, J., et al.: View adaptive recurrent neural networks for high performance human action recognition from skeleton data[C]. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2117-2126 (2017)

  36. Liu, X., Li, Y., Xia, R.: Rotation-based spatial-temporal feature learning from skeleton sequences for action recognition, Signal, Image and Video Process, pp. 1–8. (2020)

  37. Wang, J., Liu, Z., Wu, Y., et al.: Learning actionlet ensemble for 3D human action recognition. IEEE Trans. Pattern Anal. Mach. Intell 36(5), 914–927 (2014)

    Article  Google Scholar 

Download references

Acknowledgements

Project is supported by The National Key R & D Program of China (Grant No. 2017YFB1302400), National Natural Science Foundation of China (Grant No. 61773242, No. 61803227 and No. 61375084), Major Agricultural Applied Technological Innovation Projects of Shandong Province (SD2019NJ014), Shandong Natural Science Foundation (ZR2019MF064) and Intelligent Robot and System Innovation Center Foundation (2019IRS19).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fengyu Zhou.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Huang, Q., Zhou, F., Qin, R. et al. View transform graph attention recurrent networks for skeleton-based action recognition. SIViP 15, 599–606 (2021). https://doi.org/10.1007/s11760-020-01781-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11760-020-01781-6

Keywords

Navigation