Skip to main content

Efficient Few-Shot Action Recognition via Multi-level Post-reasoning

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

The integration with CLIP (Contrastive Vision-Language Pre-training) has significantly refreshed the accuracy leaderboard of FSAR (Few-Shot Action Recognition). However, the trainable overhead of ensuring that the domain alignment of CLIP and FSAR is often unbearable. To mitigate this issue, we present an Efficient Multi-Level Post-Reasoning Network, namely EMP-Net. By design, a post-reasoning mechanism is proposed for domain adaptation, which avoids most gradient backpropagation, improving the efficiency; meanwhile, a multi-level representation is utilised during the reasoning and matching processes to improve the discriminability, ensuring effectiveness. Specifically, the proposed EMP-Net starts with a skip-fusion involving cached multi-stage features extracted by CLIP. After that, the fused feature is decoupled into multi-level representations, including global-level, patch-level, and frame-level. The ensuing spatiotemporal reasoning module operates on multi-level representations to generate discriminative features. As for matching, the contrasts between text-visual and support-query are integrated to provide comprehensive guidance. The experimental results demonstrate that EMP-Net can unlock the potential performance of CLIP in a more efficient manner. The code and supplementary material can be found at https://github.com/cong-wu/EMP-Net.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Ben-Ari, R., Nacson, M.S., Azulai, O., Barzelay, U., Rotman, D.: TAEN: temporal aware embedding network for few-shot action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 2786–2794 (2021). https://doi.org/10.1109/CVPRW53098.2021.00313

  2. Bishay, M., Zoumpourlis, G., Patras, I.: Tarn: temporal attentive relation network for few-shot and zero-shot action recognition. arXiv preprint arXiv:1907.09021 (2019)

  3. Cao, K., Ji, J., Cao, Z., Chang, C.Y., Niebles, J.C.: Few-shot video classification via temporal alignment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10618–10627 (2020). https://doi.org/10.1109/CVPR42600.2020.01063

  4. Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017). https://doi.org/10.1109/CVPR.2017.502

  5. Chen, W.Y., Liu, Y.C., Kira, Z., Wang, Y.C.F., Huang, J.B.: A closer look at few-shot classification. In: International Conference on Learning Representations (2019)

    Google Scholar 

  6. Dalal, N., Triggs, B., Schmid, C.: Human detection using oriented histograms of flow and appearance. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 428–441. Springer, Heidelberg (2006). https://doi.org/10.1007/11744047_33

    Chapter  Google Scholar 

  7. Dhillon, G.S., Chaudhari, P., Ravichandran, A., Soatto, S.: A baseline for few-shot image classification. In: International Conference on Learning Representations (2020)

    Google Scholar 

  8. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  9. Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: International Conference on Machine Learning, vol. 70, pp. 1126–1135. PMLR (2017)

    Google Scholar 

  10. Fu, M., Zhu, K., Wu, J.: DTL: disentangled transfer learning for visual recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 12082–12090 (2024). https://doi.org/10.1609/aaai.v38i11.29096

  11. Goyal, R., et al.: The “something something” video database for learning and evaluating visual common sense. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5842–5850 (2017). https://doi.org/10.1109/ICCV.2017.622

  12. Guo, F., Zhu, L., Wang, Y., Qi, H.: Consistency prototype module and motion compensation for few-shot action recognition (clip-cp\(m^2\)c). arXiv preprint arXiv:2312.01083 (2023)

  13. Huang, Y., Yang, L., Sato, Y.: Compound prototype matching for few-shot action recognition. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13664, pp. 351–368. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19772-7_21

    Chapter  Google Scholar 

  14. Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3192–3199 (2013). https://doi.org/10.1109/ICCV.2013.396

  15. Kang, B., Liu, Z., Wang, X., Yu, F., Feng, J., Darrell, T.: Few-shot object detection via feature reweighting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8420–8429 (2019). https://doi.org/10.1109/ICCV.2019.00851

  16. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: 2011 International Conference on Computer Vision, pp. 2556–2563. IEEE (2011). https://doi.org/10.1109/ICCV.2011.6126543

  17. Lan, T., Zhu, Y., Zamir, A.R., Savarese, S.: Action recognition by hierarchical mid-level action elements. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4552–4560 (2015). https://doi.org/10.1109/ICCV.2015.517

  18. Li, S., et al.: Ta2n: two-stage action alignment network for few-shot action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 1404–1411 (2022). https://doi.org/10.1609/aaai.v36i2.20029

  19. Li, W., Wang, L., Xu, J., Huo, J., Gao, Y., Luo, J.: Revisiting local descriptor based image-to-class measure for few-shot learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7260–7268 (2019). https://doi.org/10.1109/CVPR.2019.00743

  20. Li, W., et al.: LibFewShot: a comprehensive library for few-shot learning. IEEE Trans. Pattern Anal. Mach. Intell. 45(12), 14938–14955 (2023). https://doi.org/10.1109/TPAMI.2023.3312125

    Article  Google Scholar 

  21. Lin, J., Gan, C., Han, S.: TSM: temporal shift module for efficient video understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7083–7093 (2019). https://doi.org/10.1109/ICCV.2019.00718

  22. Nguyen, K.D., Tran, Q.H., Nguyen, K., Hua, B.S., Nguyen, R.: Inductive and transductive few-shot video classification via appearance and temporal alignments. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13680, pp. 471–487. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20044-1_27

    Chapter  Google Scholar 

  23. Pan, J., Lin, Z., Zhu, X., Shao, J., Li, H.: St-adapter: parameter-efficient image-to-video transfer learning. Adv. Neural. Inf. Process. Syst. 35, 26462–26477 (2022)

    Google Scholar 

  24. Pei, W., Tan, Q., Lu, G., Tian, J.: D\(^2\)st-adapter: Disentangled-and-deformable spatio-temporal adapter for few-shot action recognition. arXiv preprint arXiv:2312.01431 (2023)

  25. Perrett, T., Masullo, A., Burghardt, T., Mirmehdi, M., Damen, D.: Temporal-relational crosstransformers for few-shot action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 475–484 (2021). https://doi.org/10.1109/CVPR46437.2021.00054

  26. Qian, R., Lin, W., See, J., Li, D.: Controllable augmentations for video representation learning. Visual Intell. 2(1), 1 (2024). https://doi.org/10.1007/s44267-023-00034-7

    Article  Google Scholar 

  27. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  28. Scovanner, P., Ali, S., Shah, M.: A 3-dimensional sift descriptor and its application to action recognition. In: Proceedings of the 15th ACM international conference on Multimedia, pp. 357–360 (2007). https://doi.org/10.1145/1291233.1291311

  29. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, vol. 27 (2014)

    Google Scholar 

  30. Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  31. Soomro, K., Zamir, A.R., Shah, M.: Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)

  32. Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208 (2018). https://doi.org/10.1109/CVPR.2018.00131

  33. Sung, Y.L., Cho, J., Bansal, M.: LST: ladder side-tuning for parameter and memory efficient transfer learning. Adv. Neural. Inf. Process. Syst. 35, 12991–13005 (2022)

    Google Scholar 

  34. Thatipelli, A., Narayan, S., Khan, S., Anwer, R.M., Khan, F.S., Ghanem, B.: Spatio-temporal relation modeling for few-shot action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19958–19967 (2022). https://doi.org/10.1109/CVPR52688.2022.01933

  35. Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.: Matching networks for one shot learning. In: Advances in Neural Information Processing Systems, vol. 29 (2016)

    Google Scholar 

  36. Wang, X., et al.: Clip-guided prototype modulating for few-shot action recognition. Int. J. Comput. Vision 132(6), 1899–1912 (2024). https://doi.org/10.1007/s11263-023-01917-4

    Article  Google Scholar 

  37. Wang, X., et al.: MoLo: motion-augmented long-short contrastive learning for few-shot action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18011–18021 (2023). https://doi.org/10.1109/CVPR52729.2023.01727

  38. Wang, X., et al.: Hybrid relation guided set matching for few-shot action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19948–19957 (2022). https://doi.org/10.1109/CVPR52688.2022.01932

  39. Wang, X., et al.: Few-shot action recognition with captioning foundation models. arXiv preprint arXiv:2310.10125 (2023)

  40. Wang, Y., Yao, Q., Kwok, J.T., Ni, L.M.: Generalizing from a few examples: a survey on few-shot learning. ACM Comput. Surv. (CSUR) 53(3), 1–34 (2020). https://doi.org/10.1145/3386252

    Article  Google Scholar 

  41. Willems, G., Tuytelaars, T., Van Gool, L.: An efficient dense and scale-invariant spatio-temporal interest point detector. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 650–663. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-88688-4_48

    Chapter  Google Scholar 

  42. Wu, C., et al.: SCD-Net: spatiotemporal clues disentanglement network for self-supervised skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 5949–5957 (2024). https://doi.org/10.1609/aaai.v38i6.28409

  43. Wu, J., Zhang, T., Zhang, Z., Wu, F., Zhang, Y.: Motion-modulated temporal fragment alignment network for few-shot action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9151–9160 (2022). https://doi.org/10.1109/CVPR52688.2022.00894

  44. Xiang, W., Li, C., Wang, B., Wei, X., Hua, X.S., Zhang, L.: Spatiotemporal self-attention modeling with temporal patch shift for action recognition. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13663, pp. 627–644. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20062-5_36

    Chapter  Google Scholar 

  45. Xing, J., Wang, M., Hou, X., Dai, G., Wang, J., Liu, Y.: Multimodal adaptation of clip for few-shot action recognition. arXiv preprint arXiv:2308.01532 (2023)

  46. Xing, J., Wang, M., Liu, Y., Mu, B.: Revisiting the spatial and temporal modeling for few-shot action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 3001–3009 (2023). https://doi.org/10.1609/aaai.v37i3.25403

  47. Xing, J., et al.: Boosting few-shot action recognition with graph-guided hybrid matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1740–1750 (2023). https://doi.org/10.1109/ICCV51070.2023.00167

  48. Xu, M., Zhang, Z., Wei, F., Hu, H., Bai, X.: Side adapter network for open-vocabulary semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2945–2954 (2023). https://doi.org/10.1109/CVPR52729.2023.00288

  49. Xu, W., Xu, Y., Wang, H., Tu, Z.: Attentional constellation nets for few-shot learning. In: International Conference on Learning Representations (2021)

    Google Scholar 

  50. Yang, Y., Cui, Z., Xu, J., Zhong, C., Zheng, W.S., Wang, R.: Continual learning with Bayesian model based on a fixed pre-trained feature extractor. Visual Intell. 1(1), 5 (2023). https://doi.org/10.1007/s44267-023-00005-y

    Article  Google Scholar 

  51. Zhang, H., Zhang, L., Qi, X., Li, H., Torr, P.H.S., Koniusz, P.: Few-shot action recognition with permutation-invariant attention. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 525–542. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_31

    Chapter  Google Scholar 

  52. Zhang, S., Zhou, J., He, X.: Learning implicit temporal alignment for few-shot video classification. In: Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, pp. 1309–1315 (2021). https://doi.org/10.24963/ijcai.2021/181

  53. Zheng, S., Chen, S., Jin, Q.: Few-shot action recognition with hierarchical matching and contrastive learning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13664, pp. 297–313. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19772-7_18

    Chapter  Google Scholar 

  54. Zhu, L., Yang, Y.: Compound memory networks for few-shot video classification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 782–797. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_46

    Chapter  Google Scholar 

Download references

Acknowledgements

This work is supported in part by the National Key Research and Development Program of China (2023YFF1105102, 2023YFF1105105), the National Natural Science Foundation of China (62020106012, 62332008, 62106089, U1836218, 62336004), the 111 Project of Ministry of Education of China (B12018), and the UK EPSRC (EP/V002856/1, EP/T022205/1).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiao-Jun Wu .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 422 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wu, C., Wu, XJ., Li, L., Xu, T., Feng, Z., Kittler, J. (2025). Efficient Few-Shot Action Recognition via Multi-level Post-reasoning. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15061. Springer, Cham. https://doi.org/10.1007/978-3-031-72646-0_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72646-0_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72645-3

  • Online ISBN: 978-3-031-72646-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics