Skip to main content
Log in

STAR++: Rethinking spatio-temporal cross attention transformer for video action recognition

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Video action recognition needs to model any differences by subdividing the spatio-temporal features to distinguish various actions. We propose rethinking spatio-temporal cross attention transformer (STAR++), a multi-modal transformer-based model that uses both RGB and skeleton information as an extended version of STAR-Transformer. STAR++ unifies the encoder-decoder structure of the base spatio-temporal cross attention transformer (STAR-Transformer) into an encoder structure and applies a new method of using interval attention as spatio-temporal cross attention. STAR++ provides interval attention from local features to global features as the layer deepens, allowing it to learn appropriately based on the transformer properties, improving the performance. In addition, STAR++ additionally proposes a deformable 3D token selection that can dynamically select and learn tokens for an attention operation such that tokens can be efficiently learned. The proposed STAR++ demonstrated competitive performance when compared with other state-of-the-art models using Penn action and NTU-RGB+D 60, 120, which are action recognition benchmark datasets. In addition, an ablation study was conducted to confirm that each proposed module has an essential effect on the performance improvement.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Algorithm 1
Fig. 4
Fig. 5

Similar content being viewed by others

Availability of data and materials

The data that support the findings of this study are available on request from the corresponding author of references [47,48,49]

Code Availability

Researchers or interested parties are welcome to contact the corresponding author B.C.K. for further explanation, who may also provide the Python codes upon request

References

  1. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Adv Neural Inf Process Syst 27

  2. Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1933–1941

  3. Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6299–6308

  4. Arnab A, Dehghani M, Heigold G, Sun C, Lučić M, Schmid C (2021) Vivit: a video vision transformer. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6836–6846

  5. Bertasius G, Wang H, Torresani L (2021) Is space-time attention all you need for video understanding? In: ICML, 2:4

  6. Liu Z, Ning J, Cao Y, Wei Y, Zhang Z, Lin S, Hu H (2022) Video swin transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3202–3211

  7. Wang J, Torresani L (2022) Deformable video transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14053–14062

  8. Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2020) Deformable detr: deformable transformers for end-to-end object detection. arXiv:2010.04159

  9. Xia Z, Pan X, Song S, Li LE, Huang G (2022) Vision transformer with deformable attention. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4794–4803

  10. Wang J, Yang X, Li H, Liu L, Wu Z, Jiang Y-G (2022) Efficient video transformers with spatial-temporal token selection. In: Computer Vision–ECCV 2022: 17th European conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp 69–86

  11. Yin H, Vahdat A, Alvarez JM, Mallya A, Kautz J, Molchanov P (2022) A-vit: adaptive tokens for efficient vision transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10809–10818

  12. Rao Y, Zhao W, Liu B, Lu J, Zhou J, Hsieh C-J (2021) Dynamicvit: efficient vision transformers with dynamic token sparsification. Advances in neural information processing systems 34:13937–13949

    Google Scholar 

  13. Ahn D, Kim S, Hong H, Ko BC (2023) Star-transformer: a spatio-temporal cross attention transformer for human action recognition. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 3330–3339

  14. Zhu J, Zou W, Xu L, Hu Y, Zhu Z, Chang M, Huang J, Huang G, Du D (2018) Action machine: rethinking action recognition in trimmed videos. arXiv:1812.05770

  15. Baradel F, Wolf C, Mille J, Taylor GW (2018) Glimpse clouds: human activity recognition from unstructured feature points. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 469–478

  16. Wang Z, She Q, Smolic A (2021) Action-net: multipath excitation for action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13214–13223

  17. Liu X, Pintea SL, Nejadasl FK, Booij O, Van Gemert JC (2021) No frame left behind: full video action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14892–14901

  18. Feichtenhofer C (2020) X3d: expanding architectures for efficient video recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 203–213

  19. Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6450–6459

  20. Neimark D, Bar O, Zohar M, Asselmann D (2021) Video transformer network. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3163–3172

  21. Xu M, Xiong Y, Chen H, Li X, Xia W, Tu Z, Soatto S (2021) Long short-term transformer for online action detection. Adv Neural Inf Process Syst 34:1086–1099

    Google Scholar 

  22. Yan S, Xiong X, Arnab A, Lu Z, Zhang M, Sun C, Schmid C (2022) Multiview transformers for video recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3333–3343

  23. Yu B, Yin H, Zhu Z (2017) Spatio-temporal graph convolutional networks: a deep learning framework for traffic forecasting. arXiv:1709.04875

  24. Zhang C, Li Q, Song D (2019) Aspect-based sentiment classification with aspect-specific graph convolutional networks. arXiv:1909.03477

  25. Cheng K, Zhang Y, He X, Chen W, Cheng J, Lu H (2020) Skeleton-based action recognition with shift graph convolutional network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 183–192

  26. Chen Y, Zhang Z, Yuan C, Li B, Deng Y, Hu W (2021) Channel-wise topology refinement graph convolution for skeleton-based action recognition. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 13359–13368

  27. Chi H-g, Ha MH, Chi S, Lee SW, Huang Q, Ramani K (2022) Infogcn: representation learning for human skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 20186–20196

  28. Yang D, Wang Y, Dantcheva A, Garattoni L, Francesca G, Bremond F (2021) Unik: a unified framework for real-world skeleton-based action recognition. arXiv:2107.08580

  29. Das S, Sharma S, Dai R, Bremond F, Thonnat M (2020) Vpn: learning video-pose embedding for activities of daily living. In: Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, proceedings, part IX 16, pp 72–90. Springer

  30. Duan H, Zhao Y, Chen K, Lin D, Dai B (2022) Revisiting skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2969–2978

  31. Joze HRV, Shaban A, Iuzzolino ML, Koishida K (2020) Mmtm: multimodal transfer module for cnn fusion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13289–13299

  32. Munro J, Damen D (2020) Multi-modal domain adaptation for fine-grained action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 122–132

  33. Gao R, Oh T-H, Grauman K, Torresani L (2020) Listen to look: action recognition by previewing audio. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10457–10467

  34. Alamri H, Cartillier V, Das A, Wang J, Cherian A, Essa I, Batra D, Marks TK, Hori C, Anderson P et al (2019) Audio visual scene-aware dialog. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7558–7567

  35. Goyal P, Sahu S, Ghosh S, Lee C (2020) Cross-modal learning for multi-modal video categorization. arXiv:2003.03501

  36. Zhang B, Wang L, Wang Z, Qiao Y, Wang H (2016) Real-time action recognition with enhanced motion vector cnns. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2718–2726

  37. Yang L, Huang Y, Sugano Y, Sato Y (2022) Interact before align: leveraging cross-modal knowledge for domain adaptive action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14722–14732

  38. Alfasly S, Lu J, Xu C, Zou Y (2022) Learnable irrelevant modality dropout for multimodal action recognition on modality-specific annotated videos. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 20208–20217

  39. Shi Z, Liang J, Li Q, Zheng H, Gu Z, Dong J, Zheng B (2021) Multi-modal multi-action video recognition. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 13678–13687

  40. Miech A, Laptev I, Sivic J (2018) Learning a text-video embedding from incomplete and heterogeneous data. arXiv:1804.02516

  41. Ijaz M, Diaz R, Chen C (2022) Multimodal transformer for nursing activity recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2065–2074

  42. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022

  43. Jaderberg M, Simonyan K, Zisserman A et al (2015) Spatial transformer networks. Adv Neural Inf Process Syst 28

  44. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv:2010.11929

  45. Raghu M, Unterthiner T, Kornblith S, Zhang C, Dosovitskiy A (2021) Do vision transformers see like convolutional neural networks? Adv Neural Inf Process Syst 34:12116–12128

    Google Scholar 

  46. Si C, Yu W, Zhou P, Zhou Y, Wang X, Yan S (2022) Inception transformer. arXiv:2205.12956

  47. Zhang W, Zhu M, Derpanis KG (2013) From actemes to action: a strongly-supervised representation for detailed action understanding. In: Proceedings of the IEEE international conference on computer vision, pp 2248–2255

  48. Shahroudy A, Liu J, Ng T-T, Wang G (2016) Ntu rgb+ d: a large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1010–1019

  49. Liu J, Shahroudy A, Perez M, Wang G, Duan L-Y, Kot AC (2019) Ntu rgb+ d 120: a large-scale benchmark for 3d human activity understanding. IEEE Trans Pattern Anal Mach Intell 42(10):2684–2701

    Article  Google Scholar 

  50. Soomro K, Zamir AR, Shah M (2012) Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402

  51. Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) Hmdb: a large video database for human motion recognition. In: 2011 international conference on computer vision, pp 2556–2563. IEEE

  52. Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev P et al (2017) The kinetics human action video dataset. arXiv:1705.06950

  53. Guo T, Liu H, Chen Z, Liu M, Wang T, Ding R (2022) Contrastive learning from extremely augmented skeleton sequences for self-supervised action recognition. Proceedings of the AAAI conference on artificial intelligence 36:762–770

  54. Liu Y, Zhang H, Xu D, He K (2022) Graph transformer network with temporal kernel attention for skeleton-based action recognition. Knowl-Based Syst 240:108146

    Article  Google Scholar 

  55. Zeng A, Sun X, Yang L, Zhao N, Liu M, Xu Q (2021) Learning skeletal graph neural networks for hard 3d pose estimation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 11436–11445

  56. Liu M, Yuan J (2018) Recognizing human actions as the evolution of pose estimation maps. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1159–1168

  57. Chen T, Zhou D, Wang J, Wang S, Guan Y, He X, Ding E (2021) Learning multi-granular spatio-temporal graph network for skeleton-based action recognition. In: Proceedings of the 29th ACM international conference on multimedia, pp 4334–4342

  58. Bruce X, Liu Y, Chan KC (2021) Multimodal fusion via teacher-student network for indoor action recognition. Proceedings of the AAAI Conference on Artificial Intelligence 35:3199–3207

    Article  Google Scholar 

  59. Cao C, Zhang Y, Zhang C, Lu H (2017) Body joint guided 3-d deep convolutional descriptors for action recognition. IEEE Trans Cybernet 48(3):1095–1108

    Article  Google Scholar 

  60. Luvizon DC, Picard D, Tabia H (2018) 2d/3d pose estimation and action recognition using multitask deep learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5137–5146

  61. Zhao R, Xu W, Su H, Ji Q (2019) Bayesian hierarchical dynamic model for human action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7733–7742

  62. Sun JJ, Zhao J, Chen L-C, Schroff F, Adam H, Liu T (2020) View-invariant probabilistic embedding for human pose. In: Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, proceedings, part V 16, pp 53–70. Springer

  63. Hachiuma R, Sato F, Sekii T (2023) Unified keypoint-based action recognition framework via structured keypoint pooling. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 22962–22971

  64. Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6202–6211

  65. Duan H, Zhao Y, Xiong Y, Liu W, Lin D (2020) Omni-sourced webly-supervised learning for video recognition. In: European conference on computer vision, pp 670–688. Springer

  66. Sun K, Xiao B, Liu D, Wang J (2019) Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5693–5703

  67. Bruce X, Liu Y, Zhang X, Zhong S-h, Chan KC (2022) Mmnet: a model-based multimodal network for human action recognition in rgb-d videos. IEEE Trans Pattern Anal Mach Intell

Download references

Acknowledgements

This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education(2022R1I1A3058128)

Funding

This study was funded by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education(2022R1I1A3058128)

Author information

Authors and Affiliations

Authors

Contributions

D.A. and S.K. were responsible for the design and overall investigation. B.C.K. was responsible for the data curation, supervision, writing and editing of manuscript

Corresponding author

Correspondence to Byoung Chul Ko.

Ethics declarations

Conflicts of interest

The authors declare that they have no conflict of interest

Ethical approval

Not applicable

Consent to participate

Not applicable

Consent for publication

Not applicable

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ahn, D., Kim, S. & Ko, B.C. STAR++: Rethinking spatio-temporal cross attention transformer for video action recognition. Appl Intell 53, 28446–28459 (2023). https://doi.org/10.1007/s10489-023-04978-7

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-023-04978-7

Keywords

Navigation