Semantic-guided spatio-temporal attention for few-shot action recognition

Wang, Jianyu; Liu, Baolin

doi:10.1007/s10489-024-05294-4

Semantic-guided spatio-temporal attention for few-shot action recognition

Published: 16 February 2024

Volume 54, pages 2458–2471, (2024)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

174 Accesses
Explore all metrics

Abstract

Few-shot action recognition is a challenging problem aimed at learning a model capable of adapting to recognize new categories using only a few labeled videos. Recently, some works use attention mechanisms to focus on relevant regions to obtain discriminative representations. Despite the significant progress, these methods still cannot attain outstanding performance due to insufficient examples and a scarcity of additional supplementary information. In this paper, we propose a novel Semantic-guided Spatio-temporal Attention (SGSTA) approach for few-shot action recognition. The main idea of SGSTA is to exploit the semantic information contained in the text embedding of labels to guide attention to more accurately capture the rich spatio-temporal context in videos when visual content is insufficient. Specifically, SGSTA comprises two essential components: a visual-text alignment module and a semantic-guided spatio-temporal attention module. The former is used to align visual features and text embeddings to eliminate semantic gaps between them. The latter is further divided into spatial attention and temporal attention. Firstly, a semantic-guided spatial attention is applied on the frame feature map to focus on semantically relevant spatial regions. Then, a semantic-guided temporal attention is used to encode the semantically enhanced temporal context with a temporal Transformer. Finally, use the spatio-temporally contextual representation obtained to learn relationship matching between support and query sequences. In this way, SGSTA can fully utilize rich semantic priors in label embeddings to improve class-specific discriminability and achieve accurate few-shot recognition. Comprehensive experiments on four challenging benchmarks demonstrate that the proposed SGSTA is effective and achieves competitive performance over existing state-of-the-art methods under various settings.

Graphical Abstract

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning to Prompt for Vision-Language Models

Article 31 July 2022

Visual attention network

Article Open access 28 July 2023

FSODv2: A Deep Calibrated Few-Shot Object Detection Network

Article 04 April 2024

Data availability

All data used during this study are public. The SSv2 dataset is available at https://developer.qualcomm.com/software/ai-datasets/something-something. The Kinetics dataset is available at https://www.deepmind.com/open-source/kinetics. The HMDB51 dataset is available at https://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/. The few-shot split of them are also publicly available and collected at https://github.com/tobyperrett/trx/tree/main/splits.

References

Wang Y, Nie X, Shi Y et al (2019) Attention-based video hashing for large-scale video retrieval. IEEE Trans Cogn Dev Syst 13(3):491–502
Article Google Scholar
Naveen Kumar GS, Reddy VSK (2022) High Performance Algorithm for Content-Based Video Retrieval Using Multiple Features[M]//Intelligent Systems and Sustainable Computing: Proceedings of ICISSC 2021. Singapore: Springer Nature Singapore, 637–646
Pang G, Shen C, Cao L et al (2021) Deep learning for anomaly detection: A review. ACM Comput Surv (CSUR) 54(2):1–38
Article Google Scholar
Asad M, Jiang H, Yang J et al (2022) Multi-Stream 3D latent feature clustering for abnormality detection in videos. Appl Intell 52:1126–1143
Verma KK, Singh BM, Dixit A (2022) A review of supervised and unsupervised machine learning techniques for suspicious behavior recognition in intelligent surveillance system. Int J Inform Technol 14(1):397–410
Tao H (2020) Detecting smoky vehicles from traffic surveillance videos based on dynamic features. Appl Intell 50(4):1057–1072
Article Google Scholar
Nazar M, Alam MM, Yafi E et al (2021) A systematic review of human–computer interaction and explainable artificial intelligence in healthcare with artificial intelligence techniques. IEEE Access 9:153316–153348
Article Google Scholar
Lv Z, Poiesi F, Dong Q et al (2022) Deep learning for intelligent human-computer interaction. Appl Sci 12(22):11457
Article CAS Google Scholar
Lin J, Gan C, Wang K et al (2020) TSM: Temporal shift module for efficient and scalable video understanding on edge devices. IEEE Trans Pattern Anal Mach Intell 44(5):2760–2774
Google Scholar
Tong Z, Song Y, Wang J et al (2022) Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Adv Neural Inf Process Syst 35:10078–10093
Google Scholar
Yudistira N, Kavitha MS, Kurita T (2022) Weakly-supervised action localization, and action recognition using global–local attention of 3D CNN. Int J Comput Vision 130(10):2349–2363
Article Google Scholar
Wu H, Ma X, Li Y (2023) Multi-level channel attention excitation network for human action recognition in videos. Signal Process Image Commun 114:116940
Article Google Scholar
Damen D, Doughty H, Farinella GM et al (2020) The epic-kitchens dataset: Collection, challenges and baselines. IEEE Trans Pattern Anal Mach Intell 43(11):4125–4141
Article Google Scholar
Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset[C]//proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308
Goyal R, Ebrahimi Kahou S, Michalski V et al (2017) The "something something" video database for learning and evaluating visual common sense[C]//Proceedings of the IEEE international conference on computer vision. 5842–5850
Zhang S, Zhou J, He X (2021) Learning implicit temporal alignment for few-shot video classification[C] //IJCAI. 1309–1315
Perrett T, Masullo A, Burghardt T et al (2021) Temporal-relational crosstransformers for few-shot action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 475–484
Zhu L, Yang Y (2018) Compound memory networks for few-shot video classification[C]//Proceedings of the European Conference on Computer Vision (ECCV). 751–766
Cao K, Ji J, Cao Z et al (2020) Few-shot video classification via temporal alignment[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10618–10627
Wang X, Zhang S, Qing Z et al (2022) Hybrid relation guided set matching for few-shot action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19948–19957
Mai S, Hu H, Xu J (2019) Attentive matching network for few-shot learning. Comput Vis Image Underst 187:102781
Article Google Scholar
Ding Y, Liu Y (2022) A novel few-shot action recognition method: temporal relational crosstransformers based on image difference pyramid. IEEE Access 10:94536–94544
Article Google Scholar
Zhang H, Zhang L, Qi X et al (2020) Few-Shot Action Recognition with Permutation-Invariant Attention[C]//European Conference on Computer Vision. 525–542
Thatipelli A, Narayan S, Khan S et al (2022) Spatio-temporal relation modeling for few-shot action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19958–19967
Yan K, Zhang C, Hou J et al (2022) Inferring prototypes for multi-label few-shot image classification with word vector guided attention[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 36(3):2991–2999
Chen J, Zhuo L, Wei Z et al (2023) Knowledge driven weights estimation for large-scale few-shot image recognition. Pattern Recogn 142:109668
Article Google Scholar
Wang Q, Chen K (2020) Multi-label zero-shot human action recognition via joint latent ranking embedding. Neural Netw 122:1–23
Article PubMed Google Scholar
Zhang R, Che T, Ghahramani Z et al (2018) Metagan: an adversarial approach to few-shot learning. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 2371–2380
Seo JW, Jung HG, Lee SW (2021) Self-augmentation: Generalizing deep networks to unseen classes for few-shot learning. Neural Netw 138:140–149
Article PubMed Google Scholar
Finn C, Abbeel P, Levine S (2017) Model-agnostic meta-learning for fast adaptation of deep networks[C]//International conference on machine learning. PMLR 1126–1135
Vuorio R, Sun SH, Hu H et al (2019) Multimodal model-agnostic meta-learning via task-aware modulation. Adv Neural Inf Process Syst 32:1–12
Xu Z, Chen X, Tang W et al (2021) Meta weight learning via model-agnostic meta-learning. Neurocomputing 432:124–132
Article Google Scholar
Ji Z, Chai X, Yu Y et al (2020) Improved prototypical networks for few-shot learning. Pattern Recogn Lett 140:81–87
Article ADS Google Scholar
Sung F, Yang Y, Zhang L et al (2018) Learning to compare: Relation network for few-shot learning[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 1199–1208
Zhu L, Yang Y (2020) Label independent memory for semi-supervised few-shot video classification. IEEE Trans Pattern Anal Mach Intell 44(1):273–285
Google Scholar
Zong P, Chen P, Yu T et al (2021) Few-shot action recognition using task-adaptive parameters. Electron Lett 57(22):848–850
Article ADS Google Scholar
Wang X, Zhang S, Qing Z et al (2023) MoLo: Motion-augmented Long-short Contrastive Learning for Few-shot Action Recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18011–18021
Yu C, Xue H, Jiang Y et al (2021) A simple and efficient text matching model based on deep interaction. Inf Process Manage 58(6):102738
Article Google Scholar
Zhang Z, Wu S, Jiang D et al (2021) BERT-JAM: Maximizing the utilization of BERT for neural machine translation. Neurocomputing 460:84–94
Article Google Scholar
Laenen S, Bertinetto L (2021) On episodes, prototypical networks, and few-shot learning. Adv Neural Inf Process Syst 34:24581–24592
Google Scholar
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp 6000–6010
Graves A, Mohamed A, Hinton G (2013) Speech recognition with deep recurrent neural networks[C]//2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 6645–6649
Chung J, Gulcehre C, Cho K et al (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling[C]//NIPS 2014 Workshop on Deep Learning, December 2014
Tolstikhin IO, Houlsby N, Kolesnikov A et al (2021) Mlp-mixer: An all-mlp architecture for vision. Adv Neural Inf Process Syst 34:24261–24272
Google Scholar
Kuehne H, Jhuang H, Garrote E et al (2011) HMDB: A large video database for human motion recognition[C]//Proceedings of the 2011 International Conference on Computer Vision. 2556-2563
He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778
Deng J, Dong W, Socher R et al (2009) ImageNet: A large-scale hierarchical image database[C]//2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE 248–255
Bottou L (2010) Large-Scale Machine Learning with Stochastic Gradient Descent[J]. Proceedings of COMPSTAT'2010, 177–186
Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation[C]//Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532–1543
Li S, Liu H, Qian R et al (2022) TA2N: Two-stage action alignment network for few-shot action recognition[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 36(2):1404–1411
Jiang L, Yu J, Dang Y et al (2023) HiTIM: Hierarchical Task Information Mining for Few-Shot Action Recognition. Appl Sci 13(9):5277
Article CAS Google Scholar
Deng F, Zhong J, Li N et al (2023) Exploring cross-video matching for few-shot video classification via dual-hierarchy graph neural network learning. Image Vis Comput 139:104822
Article Google Scholar
Mikolov T, Sutskever I, Chen K et al (2013) Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, pp 3111–3119
Bojanowski P, Grave E, Joulin A et al (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146
Article Google Scholar
Devlin J, Chang M W, Lee K et al (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]//Proceedings of NAACL-HLT. 4171–4186
Reimers N, Gurevych I (2019) Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 3982–3992

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (No.U2133218), the National Key Research and Development Program of China (No.2018YFB0204304) and the Fundamental Research Funds for the Central Universities of China (No.FRF-MP-19-007 and No.FRF-TP-20-065A1Z).

Author information

Authors and Affiliations

School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing, 100083, China
Jianyu Wang & Baolin Liu

Authors

Jianyu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Baolin Liu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Jianyu Wang: Conceptualization; Methodology; Data processing; Experiments; Writing—original draft; Baolin Liu: Supervision; Validation; Visualization; Writing—review and editing.

Corresponding author

Correspondence to Baolin Liu.

Ethics declarations

Ethical and informed consent for data used

The data used in this study is legally obtained from public datasets. It has been acquired with proper permissions and authorizations, ensuring compliance with ethical standards.

Competing interests

This research has no potential competing interests, which encompass financial, non-financial, or other associations with individuals or organizations that could improperly impact our work.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, J., Liu, B. Semantic-guided spatio-temporal attention for few-shot action recognition. Appl Intell 54, 2458–2471 (2024). https://doi.org/10.1007/s10489-024-05294-4

Download citation

Accepted: 28 January 2024
Published: 16 February 2024
Issue Date: February 2024
DOI: https://doi.org/10.1007/s10489-024-05294-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Semantic-guided spatio-temporal attention for few-shot action recognition