Skip to main content
Log in

Semantic-guided spatio-temporal attention for few-shot action recognition

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Few-shot action recognition is a challenging problem aimed at learning a model capable of adapting to recognize new categories using only a few labeled videos. Recently, some works use attention mechanisms to focus on relevant regions to obtain discriminative representations. Despite the significant progress, these methods still cannot attain outstanding performance due to insufficient examples and a scarcity of additional supplementary information. In this paper, we propose a novel Semantic-guided Spatio-temporal Attention (SGSTA) approach for few-shot action recognition. The main idea of SGSTA is to exploit the semantic information contained in the text embedding of labels to guide attention to more accurately capture the rich spatio-temporal context in videos when visual content is insufficient. Specifically, SGSTA comprises two essential components: a visual-text alignment module and a semantic-guided spatio-temporal attention module. The former is used to align visual features and text embeddings to eliminate semantic gaps between them. The latter is further divided into spatial attention and temporal attention. Firstly, a semantic-guided spatial attention is applied on the frame feature map to focus on semantically relevant spatial regions. Then, a semantic-guided temporal attention is used to encode the semantically enhanced temporal context with a temporal Transformer. Finally, use the spatio-temporally contextual representation obtained to learn relationship matching between support and query sequences. In this way, SGSTA can fully utilize rich semantic priors in label embeddings to improve class-specific discriminability and achieve accurate few-shot recognition. Comprehensive experiments on four challenging benchmarks demonstrate that the proposed SGSTA is effective and achieves competitive performance over existing state-of-the-art methods under various settings.

Graphical Abstract

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data availability

All data used during this study are public. The SSv2 dataset is available at https://developer.qualcomm.com/software/ai-datasets/something-something. The Kinetics dataset is available at https://www.deepmind.com/open-source/kinetics. The HMDB51 dataset is available at https://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/. The few-shot split of them are also publicly available and collected at https://github.com/tobyperrett/trx/tree/main/splits.

References

  1. Wang Y, Nie X, Shi Y et al (2019) Attention-based video hashing for large-scale video retrieval. IEEE Trans Cogn Dev Syst 13(3):491–502

    Article  Google Scholar 

  2. Naveen Kumar GS, Reddy VSK (2022) High Performance Algorithm for Content-Based Video Retrieval Using Multiple Features[M]//Intelligent Systems and Sustainable Computing: Proceedings of ICISSC 2021. Singapore: Springer Nature Singapore, 637–646

  3. Pang G, Shen C, Cao L et al (2021) Deep learning for anomaly detection: A review. ACM Comput Surv (CSUR) 54(2):1–38

    Article  Google Scholar 

  4. Asad M, Jiang H, Yang J et al (2022) Multi-Stream 3D latent feature clustering for abnormality detection in videos. Appl Intell 52:1126–1143

  5. Verma KK, Singh BM, Dixit A (2022) A review of supervised and unsupervised machine learning techniques for suspicious behavior recognition in intelligent surveillance system. Int J Inform Technol 14(1):397–410

  6. Tao H (2020) Detecting smoky vehicles from traffic surveillance videos based on dynamic features. Appl Intell 50(4):1057–1072

    Article  Google Scholar 

  7. Nazar M, Alam MM, Yafi E et al (2021) A systematic review of human–computer interaction and explainable artificial intelligence in healthcare with artificial intelligence techniques. IEEE Access 9:153316–153348

    Article  Google Scholar 

  8. Lv Z, Poiesi F, Dong Q et al (2022) Deep learning for intelligent human-computer interaction. Appl Sci 12(22):11457

    Article  CAS  Google Scholar 

  9. Lin J, Gan C, Wang K et al (2020) TSM: Temporal shift module for efficient and scalable video understanding on edge devices. IEEE Trans Pattern Anal Mach Intell 44(5):2760–2774

    Google Scholar 

  10. Tong Z, Song Y, Wang J et al (2022) Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Adv Neural Inf Process Syst 35:10078–10093

    Google Scholar 

  11. Yudistira N, Kavitha MS, Kurita T (2022) Weakly-supervised action localization, and action recognition using global–local attention of 3D CNN. Int J Comput Vision 130(10):2349–2363

    Article  Google Scholar 

  12. Wu H, Ma X, Li Y (2023) Multi-level channel attention excitation network for human action recognition in videos. Signal Process Image Commun 114:116940

    Article  Google Scholar 

  13. Damen D, Doughty H, Farinella GM et al (2020) The epic-kitchens dataset: Collection, challenges and baselines. IEEE Trans Pattern Anal Mach Intell 43(11):4125–4141

    Article  Google Scholar 

  14. Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset[C]//proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308

  15. Goyal R, Ebrahimi Kahou S, Michalski V et al (2017) The "something something" video database for learning and evaluating visual common sense[C]//Proceedings of the IEEE international conference on computer vision. 5842–5850

  16. Zhang S, Zhou J, He X (2021) Learning implicit temporal alignment for few-shot video classification[C] //IJCAI. 1309–1315

  17. Perrett T, Masullo A, Burghardt T et al (2021) Temporal-relational crosstransformers for few-shot action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 475–484

  18. Zhu L, Yang Y (2018) Compound memory networks for few-shot video classification[C]//Proceedings of the European Conference on Computer Vision (ECCV). 751–766

  19. Cao K, Ji J, Cao Z et al (2020) Few-shot video classification via temporal alignment[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10618–10627

  20. Wang X, Zhang S, Qing Z et al (2022) Hybrid relation guided set matching for few-shot action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19948–19957

  21. Mai S, Hu H, Xu J (2019) Attentive matching network for few-shot learning. Comput Vis Image Underst 187:102781

    Article  Google Scholar 

  22. Ding Y, Liu Y (2022) A novel few-shot action recognition method: temporal relational crosstransformers based on image difference pyramid. IEEE Access 10:94536–94544

    Article  Google Scholar 

  23. Zhang H, Zhang L, Qi X et al (2020) Few-Shot Action Recognition with Permutation-Invariant Attention[C]//European Conference on Computer Vision. 525–542

  24. Thatipelli A, Narayan S, Khan S et al (2022) Spatio-temporal relation modeling for few-shot action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19958–19967

  25. Yan K, Zhang C, Hou J et al (2022) Inferring prototypes for multi-label few-shot image classification with word vector guided attention[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 36(3):2991–2999

  26. Chen J, Zhuo L, Wei Z et al (2023) Knowledge driven weights estimation for large-scale few-shot image recognition. Pattern Recogn 142:109668

    Article  Google Scholar 

  27. Wang Q, Chen K (2020) Multi-label zero-shot human action recognition via joint latent ranking embedding. Neural Netw 122:1–23

    Article  PubMed  Google Scholar 

  28. Zhang R, Che T, Ghahramani Z et al (2018) Metagan: an adversarial approach to few-shot learning. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp 2371–2380

  29. Seo JW, Jung HG, Lee SW (2021) Self-augmentation: Generalizing deep networks to unseen classes for few-shot learning. Neural Netw 138:140–149

    Article  PubMed  Google Scholar 

  30. Finn C, Abbeel P, Levine S (2017) Model-agnostic meta-learning for fast adaptation of deep networks[C]//International conference on machine learning. PMLR 1126–1135

  31. Vuorio R, Sun SH, Hu H et al (2019) Multimodal model-agnostic meta-learning via task-aware modulation. Adv Neural Inf Process Syst 32:1–12

  32. Xu Z, Chen X, Tang W et al (2021) Meta weight learning via model-agnostic meta-learning. Neurocomputing 432:124–132

    Article  Google Scholar 

  33. Ji Z, Chai X, Yu Y et al (2020) Improved prototypical networks for few-shot learning. Pattern Recogn Lett 140:81–87

    Article  ADS  Google Scholar 

  34. Sung F, Yang Y, Zhang L et al (2018) Learning to compare: Relation network for few-shot learning[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 1199–1208

  35. Zhu L, Yang Y (2020) Label independent memory for semi-supervised few-shot video classification. IEEE Trans Pattern Anal Mach Intell 44(1):273–285

    Google Scholar 

  36. Zong P, Chen P, Yu T et al (2021) Few-shot action recognition using task-adaptive parameters. Electron Lett 57(22):848–850

    Article  ADS  Google Scholar 

  37. Wang X, Zhang S, Qing Z et al (2023) MoLo: Motion-augmented Long-short Contrastive Learning for Few-shot Action Recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18011–18021

  38. Yu C, Xue H, Jiang Y et al (2021) A simple and efficient text matching model based on deep interaction. Inf Process Manage 58(6):102738

    Article  Google Scholar 

  39. Zhang Z, Wu S, Jiang D et al (2021) BERT-JAM: Maximizing the utilization of BERT for neural machine translation. Neurocomputing 460:84–94

    Article  Google Scholar 

  40. Laenen S, Bertinetto L (2021) On episodes, prototypical networks, and few-shot learning. Adv Neural Inf Process Syst 34:24581–24592

    Google Scholar 

  41. Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp 6000–6010

  42. Graves A, Mohamed A, Hinton G (2013) Speech recognition with deep recurrent neural networks[C]//2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 6645–6649

  43. Chung J, Gulcehre C, Cho K et al (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling[C]//NIPS 2014 Workshop on Deep Learning, December 2014

  44. Tolstikhin IO, Houlsby N, Kolesnikov A et al (2021) Mlp-mixer: An all-mlp architecture for vision. Adv Neural Inf Process Syst 34:24261–24272

    Google Scholar 

  45. Kuehne H, Jhuang H, Garrote E et al (2011) HMDB: A large video database for human motion recognition[C]//Proceedings of the 2011 International Conference on Computer Vision. 2556-2563

  46. He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778

  47. Deng J, Dong W, Socher R et al (2009) ImageNet: A large-scale hierarchical image database[C]//2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE 248–255

  48. Bottou L (2010) Large-Scale Machine Learning with Stochastic Gradient Descent[J]. Proceedings of COMPSTAT'2010, 177–186

  49. Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation[C]//Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532–1543

  50. Li S, Liu H, Qian R et al (2022) TA2N: Two-stage action alignment network for few-shot action recognition[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 36(2):1404–1411

  51. Jiang L, Yu J, Dang Y et al (2023) HiTIM: Hierarchical Task Information Mining for Few-Shot Action Recognition. Appl Sci 13(9):5277

    Article  CAS  Google Scholar 

  52. Deng F, Zhong J, Li N et al (2023) Exploring cross-video matching for few-shot video classification via dual-hierarchy graph neural network learning. Image Vis Comput 139:104822

    Article  Google Scholar 

  53. Mikolov T, Sutskever I, Chen K et al (2013) Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, pp 3111–3119

  54. Bojanowski P, Grave E, Joulin A et al (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146

    Article  Google Scholar 

  55. Devlin J, Chang M W, Lee K et al (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]//Proceedings of NAACL-HLT. 4171–4186

  56. Reimers N, Gurevych I (2019) Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 3982–3992

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (No.U2133218), the National Key Research and Development Program of China (No.2018YFB0204304) and the Fundamental Research Funds for the Central Universities of China (No.FRF-MP-19-007 and No.FRF-TP-20-065A1Z).

Author information

Authors and Affiliations

Authors

Contributions

Jianyu Wang: Conceptualization; Methodology; Data processing; Experiments; Writing—original draft; Baolin Liu: Supervision; Validation; Visualization; Writing—review and editing.

Corresponding author

Correspondence to Baolin Liu.

Ethics declarations

Ethical and informed consent for data used

The data used in this study is legally obtained from public datasets. It has been acquired with proper permissions and authorizations, ensuring compliance with ethical standards.

Competing interests

This research has no potential competing interests, which encompass financial, non-financial, or other associations with individuals or organizations that could improperly impact our work.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, J., Liu, B. Semantic-guided spatio-temporal attention for few-shot action recognition. Appl Intell 54, 2458–2471 (2024). https://doi.org/10.1007/s10489-024-05294-4

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-024-05294-4

Keywords

Navigation