Spatio-Temporal Self-supervision for Few-Shot Action Recognition

Yu, Wanchuan; Guo, Hanyu; Yan, Yan; Li, Jie; Wang, Hanzi

doi:10.1007/978-981-99-8429-9_7

Wanchuan Yu¹⁵,
Hanyu Guo¹⁵,
Yan Yan¹⁵,
Jie Li¹⁶ &
…
Hanzi Wang¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14425))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

956 Accesses

Abstract

Few-shot action recognition aims to classify unseen action classes with limited labeled training samples. Most current works follow the metric learning technology to learn a good embedding and an appropriate comparison metric. Due to the limited labeled data, the generalization of embedding networks is limited when employing the meta-learning process with episodic tasks. In this paper, we aim to repurpose self-supervised learning to learn a more generalized few-shot embedding model. Specifically, a Spatio-Temporal Self-supervision (STS) framework for few-shot action recognition is proposed to generate self-supervision loss at the spatial and temporal levels as auxiliary losses. By this means, the proposed STS can provide a robust representation for few-shot action recognition. Furthermore, we propose a Spatio-Temporal Aggregation (STA) module that accounts for the spatial information relationship among all frames within a video sequence to achieve optimal video embedding. Experiments on several challenging few-shot action recognition benchmarks show the effectiveness of the proposed method in achieving state-of-the-art performance for few-shot action recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bishay, M., Zoumpourlis, G., Patras, I.: Tarn: temporal attentive relation network for few-shot and zero-shot action recognition. arXiv preprint arXiv:1907.09021 (2019)
Cao, K., Ji, J., Cao, Z., Chang, C.Y., Niebles, J.C.: Few-shot video classification via temporal alignment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10618–10627 (2020)
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020)
Google Scholar
Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: International Conference on Machine Learning, pp. 1126–1135. PMLR (2017)
Google Scholar
Goyal, R., et al.: The “something something” video database for learning and evaluating visual common sense. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5842–5850 (2017)
Google Scholar
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)
Google Scholar
Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: 2011 International Conference on Computer Vision, pp. 2556–2563. IEEE (2011)
Google Scholar
Lee, H., Hwang, S.J., Shin, J.: Self-supervised label augmentation via input transformations. In: International Conference on Machine Learning, pp. 5714–5724. PMLR (2020)
Google Scholar
Liu, Z., et al.: Video swin transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3202–3211 (2022)
Google Scholar
Nguyen, K.D., Tran, Q.H., Nguyen, K., Hua, B.S., Nguyen, R.: Inductive and transductive few-shot video classification via appearance and temporal alignments. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13680, pp. 471–487. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20044-1_27
Chapter Google Scholar
Perrett, T., Masullo, A., Burghardt, T., Mirmehdi, M., Damen, D.: Temporal-relational crosstransformers for few-shot action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 475–484 (2021)
Google Scholar
Qiu, Z., Yao, T., Ngo, C.W., Tian, X., Mei, T.: Learning spatio-temporal representation with local and global diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12056–12065 (2019)
Google Scholar
Rong, Y., Lu, X., Sun, Z., Chen, Y., Xiong, S.: ESPT: a self-supervised episodic spatial pretext task for improving few-shot learning. arXiv preprint arXiv:2304.13287 (2023)
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
Thatipelli, A., Narayan, S., Khan, S., Anwer, R.M., Khan, F.S., Ghanem, B.: Spatio-temporal relation modeling for few-shot action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19958–19967 (2022)
Google Scholar
Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.: Matching networks for one shot learning. In: Advances in Neural Information Processing Systems, vol. 29 (2016)
Google Scholar
Wang, L., Zhu, S., Li, Z., Fang, Z.: Complementary temporal classification activation maps in temporal action localization. In: Ma, H., et al. (eds.) PRCV 2021. LNCS, vol. 13020, pp. 373–384. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-88007-1_31
Chapter Google Scholar
Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2
Chapter Google Scholar
Wang, X., et al.: Hybrid relation guided set matching for few-shot action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19948–19957 (2022)
Google Scholar
Wu, J., Zhang, T., Zhang, Z., Wu, F., Zhang, Y.: Motion-modulated temporal fragment alignment network for few-shot action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9151–9160 (2022)
Google Scholar
Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 305–321 (2018)
Google Scholar
Zhang, H., Zhang, L., Qi, X., Li, H., Torr, P.H.S., Koniusz, P.: Few-shot action recognition with permutation-invariant attention. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 525–542. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_31
Chapter Google Scholar
Zheng, S., Chen, S., Jin, Q.: Few-shot action recognition with hierarchical matching and contrastive learning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13664, pp. 297–313. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19772-7_18
Chapter Google Scholar
Zhu, L., Yang, Y.: Compound memory networks for few-shot video classification. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 751–766 (2018)
Google Scholar
Zhu, L., Yang, Y.: Label independent memory for semi-supervised few-shot video classification. IEEE Trans. Pattern Anal. Mach. Intell. 44(1), 273–285 (2020)
Google Scholar

Download references

Acknowledgements

This work was supported by the National Key Research and Development Program of China under Grant 2022ZD0160402, by the National Natural Science Foundation of China under Grant U21A20514, 62176195, and Grants 62372388, 62071404, and by the FuXiaQuan National Independent Innovation Demonstration Zone Collaborative Innovation Platform Project under Grant 3502ZCQXT2022008.

Author information

Authors and Affiliations

Fujian Key Laboratory of Sensing and Computing for Smart City, School of Informatics, Xiamen University, Xiamen, China
Wanchuan Yu, Hanyu Guo, Yan Yan & Hanzi Wang
Video and Image Processing System Laboratory, School of Electronic Engineering, Xidian University, Xi’an, China
Jie Li

Authors

Wanchuan Yu
View author publications
You can also search for this author in PubMed Google Scholar
Hanyu Guo
View author publications
You can also search for this author in PubMed Google Scholar
Yan Yan
View author publications
You can also search for this author in PubMed Google Scholar
Jie Li
View author publications
You can also search for this author in PubMed Google Scholar
Hanzi Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hanzi Wang .

Editor information

Editors and Affiliations

Nanjing University of Information Science and Technology, Nanjing, China
Qingshan Liu
Xiamen University, Xiamen, China
Hanzi Wang
Beijing University of Posts and Telecommunications, Beijing, China
Zhanyu Ma
Sun Yat-sen University, Guangzhou, China
Weishi Zheng
Peking University, Beijing, China
Hongbin Zha
Chinese Academy of Sciences, Beijing, China
Xilin Chen
Chinese Academy of Sciences, Beijing, China
Liang Wang
Xiamen University, Xiamen, China
Rongrong Ji

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yu, W., Guo, H., Yan, Y., Li, J., Wang, H. (2024). Spatio-Temporal Self-supervision for Few-Shot Action Recognition. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14425. Springer, Singapore. https://doi.org/10.1007/978-981-99-8429-9_7

Download citation

DOI: https://doi.org/10.1007/978-981-99-8429-9_7
Published: 24 December 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8428-2
Online ISBN: 978-981-99-8429-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Spatio-Temporal Self-supervision for Few-Shot Action Recognition