STAR++: Rethinking spatio-temporal cross attention transformer for video action recognition

Ahn, Dasom; Kim, Sangwon; Ko, Byoung Chul

doi:10.1007/s10489-023-04978-7

STAR++: Rethinking spatio-temporal cross attention transformer for video action recognition

Published: 05 October 2023

Volume 53, pages 28446–28459, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

772 Accesses
Explore all metrics

Abstract

Video action recognition needs to model any differences by subdividing the spatio-temporal features to distinguish various actions. We propose rethinking spatio-temporal cross attention transformer (STAR++), a multi-modal transformer-based model that uses both RGB and skeleton information as an extended version of STAR-Transformer. STAR++ unifies the encoder-decoder structure of the base spatio-temporal cross attention transformer (STAR-Transformer) into an encoder structure and applies a new method of using interval attention as spatio-temporal cross attention. STAR++ provides interval attention from local features to global features as the layer deepens, allowing it to learn appropriately based on the transformer properties, improving the performance. In addition, STAR++ additionally proposes a deformable 3D token selection that can dynamically select and learn tokens for an attention operation such that tokens can be efficiently learned. The proposed STAR++ demonstrated competitive performance when compared with other state-of-the-art models using Penn action and NTU-RGB+D 60, 120, which are action recognition benchmark datasets. In addition, an ablation study was conducted to confirm that each proposed module has an essential effect on the performance improvement.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Differential motion attention network for efficient action recognition

Article 13 June 2024

Spatial Temporal Transformer Network for Skeleton-Based Action Recognition

Multi Modal Aware Transformer Network for Effective Daily Life Human Action Recognition

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Availability of data and materials

The data that support the findings of this study are available on request from the corresponding author of references [47,48,49]

Code Availability

Researchers or interested parties are welcome to contact the corresponding author B.C.K. for further explanation, who may also provide the Python codes upon request

References

Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Adv Neural Inf Process Syst 27
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1933–1941
Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6299–6308
Arnab A, Dehghani M, Heigold G, Sun C, Lučić M, Schmid C (2021) Vivit: a video vision transformer. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6836–6846
Bertasius G, Wang H, Torresani L (2021) Is space-time attention all you need for video understanding? In: ICML, 2:4
Liu Z, Ning J, Cao Y, Wei Y, Zhang Z, Lin S, Hu H (2022) Video swin transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3202–3211
Wang J, Torresani L (2022) Deformable video transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14053–14062
Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2020) Deformable detr: deformable transformers for end-to-end object detection. arXiv:2010.04159
Xia Z, Pan X, Song S, Li LE, Huang G (2022) Vision transformer with deformable attention. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4794–4803
Wang J, Yang X, Li H, Liu L, Wu Z, Jiang Y-G (2022) Efficient video transformers with spatial-temporal token selection. In: Computer Vision–ECCV 2022: 17th European conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp 69–86
Yin H, Vahdat A, Alvarez JM, Mallya A, Kautz J, Molchanov P (2022) A-vit: adaptive tokens for efficient vision transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10809–10818
Rao Y, Zhao W, Liu B, Lu J, Zhou J, Hsieh C-J (2021) Dynamicvit: efficient vision transformers with dynamic token sparsification. Advances in neural information processing systems 34:13937–13949
Google Scholar
Ahn D, Kim S, Hong H, Ko BC (2023) Star-transformer: a spatio-temporal cross attention transformer for human action recognition. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 3330–3339
Zhu J, Zou W, Xu L, Hu Y, Zhu Z, Chang M, Huang J, Huang G, Du D (2018) Action machine: rethinking action recognition in trimmed videos. arXiv:1812.05770
Baradel F, Wolf C, Mille J, Taylor GW (2018) Glimpse clouds: human activity recognition from unstructured feature points. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 469–478
Wang Z, She Q, Smolic A (2021) Action-net: multipath excitation for action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13214–13223
Liu X, Pintea SL, Nejadasl FK, Booij O, Van Gemert JC (2021) No frame left behind: full video action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14892–14901
Feichtenhofer C (2020) X3d: expanding architectures for efficient video recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 203–213
Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6450–6459
Neimark D, Bar O, Zohar M, Asselmann D (2021) Video transformer network. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3163–3172
Xu M, Xiong Y, Chen H, Li X, Xia W, Tu Z, Soatto S (2021) Long short-term transformer for online action detection. Adv Neural Inf Process Syst 34:1086–1099
Google Scholar
Yan S, Xiong X, Arnab A, Lu Z, Zhang M, Sun C, Schmid C (2022) Multiview transformers for video recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3333–3343
Yu B, Yin H, Zhu Z (2017) Spatio-temporal graph convolutional networks: a deep learning framework for traffic forecasting. arXiv:1709.04875
Zhang C, Li Q, Song D (2019) Aspect-based sentiment classification with aspect-specific graph convolutional networks. arXiv:1909.03477
Cheng K, Zhang Y, He X, Chen W, Cheng J, Lu H (2020) Skeleton-based action recognition with shift graph convolutional network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 183–192
Chen Y, Zhang Z, Yuan C, Li B, Deng Y, Hu W (2021) Channel-wise topology refinement graph convolution for skeleton-based action recognition. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 13359–13368
Chi H-g, Ha MH, Chi S, Lee SW, Huang Q, Ramani K (2022) Infogcn: representation learning for human skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 20186–20196
Yang D, Wang Y, Dantcheva A, Garattoni L, Francesca G, Bremond F (2021) Unik: a unified framework for real-world skeleton-based action recognition. arXiv:2107.08580
Das S, Sharma S, Dai R, Bremond F, Thonnat M (2020) Vpn: learning video-pose embedding for activities of daily living. In: Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, proceedings, part IX 16, pp 72–90. Springer
Duan H, Zhao Y, Chen K, Lin D, Dai B (2022) Revisiting skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2969–2978
Joze HRV, Shaban A, Iuzzolino ML, Koishida K (2020) Mmtm: multimodal transfer module for cnn fusion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13289–13299
Munro J, Damen D (2020) Multi-modal domain adaptation for fine-grained action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 122–132
Gao R, Oh T-H, Grauman K, Torresani L (2020) Listen to look: action recognition by previewing audio. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10457–10467
Alamri H, Cartillier V, Das A, Wang J, Cherian A, Essa I, Batra D, Marks TK, Hori C, Anderson P et al (2019) Audio visual scene-aware dialog. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7558–7567
Goyal P, Sahu S, Ghosh S, Lee C (2020) Cross-modal learning for multi-modal video categorization. arXiv:2003.03501
Zhang B, Wang L, Wang Z, Qiao Y, Wang H (2016) Real-time action recognition with enhanced motion vector cnns. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2718–2726
Yang L, Huang Y, Sugano Y, Sato Y (2022) Interact before align: leveraging cross-modal knowledge for domain adaptive action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14722–14732
Alfasly S, Lu J, Xu C, Zou Y (2022) Learnable irrelevant modality dropout for multimodal action recognition on modality-specific annotated videos. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 20208–20217
Shi Z, Liang J, Li Q, Zheng H, Gu Z, Dong J, Zheng B (2021) Multi-modal multi-action video recognition. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 13678–13687
Miech A, Laptev I, Sivic J (2018) Learning a text-video embedding from incomplete and heterogeneous data. arXiv:1804.02516
Ijaz M, Diaz R, Chen C (2022) Multimodal transformer for nursing activity recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2065–2074
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022
Jaderberg M, Simonyan K, Zisserman A et al (2015) Spatial transformer networks. Adv Neural Inf Process Syst 28
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv:2010.11929
Raghu M, Unterthiner T, Kornblith S, Zhang C, Dosovitskiy A (2021) Do vision transformers see like convolutional neural networks? Adv Neural Inf Process Syst 34:12116–12128
Google Scholar
Si C, Yu W, Zhou P, Zhou Y, Wang X, Yan S (2022) Inception transformer. arXiv:2205.12956
Zhang W, Zhu M, Derpanis KG (2013) From actemes to action: a strongly-supervised representation for detailed action understanding. In: Proceedings of the IEEE international conference on computer vision, pp 2248–2255
Shahroudy A, Liu J, Ng T-T, Wang G (2016) Ntu rgb+ d: a large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1010–1019
Liu J, Shahroudy A, Perez M, Wang G, Duan L-Y, Kot AC (2019) Ntu rgb+ d 120: a large-scale benchmark for 3d human activity understanding. IEEE Trans Pattern Anal Mach Intell 42(10):2684–2701
Article Google Scholar
Soomro K, Zamir AR, Shah M (2012) Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) Hmdb: a large video database for human motion recognition. In: 2011 international conference on computer vision, pp 2556–2563. IEEE
Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev P et al (2017) The kinetics human action video dataset. arXiv:1705.06950
Guo T, Liu H, Chen Z, Liu M, Wang T, Ding R (2022) Contrastive learning from extremely augmented skeleton sequences for self-supervised action recognition. Proceedings of the AAAI conference on artificial intelligence 36:762–770
Liu Y, Zhang H, Xu D, He K (2022) Graph transformer network with temporal kernel attention for skeleton-based action recognition. Knowl-Based Syst 240:108146
Article Google Scholar
Zeng A, Sun X, Yang L, Zhao N, Liu M, Xu Q (2021) Learning skeletal graph neural networks for hard 3d pose estimation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 11436–11445
Liu M, Yuan J (2018) Recognizing human actions as the evolution of pose estimation maps. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1159–1168
Chen T, Zhou D, Wang J, Wang S, Guan Y, He X, Ding E (2021) Learning multi-granular spatio-temporal graph network for skeleton-based action recognition. In: Proceedings of the 29th ACM international conference on multimedia, pp 4334–4342
Bruce X, Liu Y, Chan KC (2021) Multimodal fusion via teacher-student network for indoor action recognition. Proceedings of the AAAI Conference on Artificial Intelligence 35:3199–3207
Article Google Scholar
Cao C, Zhang Y, Zhang C, Lu H (2017) Body joint guided 3-d deep convolutional descriptors for action recognition. IEEE Trans Cybernet 48(3):1095–1108
Article Google Scholar
Luvizon DC, Picard D, Tabia H (2018) 2d/3d pose estimation and action recognition using multitask deep learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5137–5146
Zhao R, Xu W, Su H, Ji Q (2019) Bayesian hierarchical dynamic model for human action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7733–7742
Sun JJ, Zhao J, Chen L-C, Schroff F, Adam H, Liu T (2020) View-invariant probabilistic embedding for human pose. In: Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, proceedings, part V 16, pp 53–70. Springer
Hachiuma R, Sato F, Sekii T (2023) Unified keypoint-based action recognition framework via structured keypoint pooling. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 22962–22971
Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6202–6211
Duan H, Zhao Y, Xiong Y, Liu W, Lin D (2020) Omni-sourced webly-supervised learning for video recognition. In: European conference on computer vision, pp 670–688. Springer
Sun K, Xiao B, Liu D, Wang J (2019) Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5693–5703
Bruce X, Liu Y, Zhang X, Zhong S-h, Chan KC (2022) Mmnet: a model-based multimodal network for human action recognition in rgb-d videos. IEEE Trans Pattern Anal Mach Intell

Download references

Acknowledgements

This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education(2022R1I1A3058128)

Funding

This study was funded by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education(2022R1I1A3058128)

Author information

Authors and Affiliations

Department of Computer Engineering, Keimyung University, Daegu, 42601, South Korea
Dasom Ahn, Sangwon Kim & Byoung Chul Ko

Authors

Dasom Ahn
View author publications
You can also search for this author in PubMed Google Scholar
Sangwon Kim
View author publications
You can also search for this author in PubMed Google Scholar
Byoung Chul Ko
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

D.A. and S.K. were responsible for the design and overall investigation. B.C.K. was responsible for the data curation, supervision, writing and editing of manuscript

Corresponding author

Correspondence to Byoung Chul Ko.

Ethics declarations

Conflicts of interest

The authors declare that they have no conflict of interest

Ethical approval

Not applicable

Consent to participate

Not applicable

Consent for publication

Not applicable

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Ahn, D., Kim, S. & Ko, B.C. STAR++: Rethinking spatio-temporal cross attention transformer for video action recognition. Appl Intell 53, 28446–28459 (2023). https://doi.org/10.1007/s10489-023-04978-7

Download citation

Accepted: 19 August 2023
Published: 05 October 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s10489-023-04978-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

STAR++: Rethinking spatio-temporal cross attention transformer for video action recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Differential motion attention network for efficient action recognition

Spatial Temporal Transformer Network for Skeleton-Based Action Recognition

Multi Modal Aware Transformer Network for Effective Daily Life Human Action Recognition

Explore related subjects

Availability of data and materials

Code Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflicts of interest

Ethical approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now