Skip to main content

ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Adapting image models to the video domain has emerged as an efficient paradigm for solving video recognition tasks. Due to the huge number of parameters and effective transferability of image models, performing full fine-tuning is less efficient and even unnecessary. Thus, recent research is shifting its focus toward parameter-efficient image-to-video adaptation. However, these adaptation strategies inevitably introduce extra computational costs to deal with the domain gap and temporal modeling in videos. In this paper, we present a new adaptation paradigm (ZeroI2V) to transfer the image transformers to video recognition tasks (i.e., introduce zero extra cost to the original models during inference). To achieve this goal, we present two core designs. First, to capture the dynamics in videos and reduce the difficulty of image-to-video adaptation, we exploit the flexibility of self-attention and introduce spatial-temporal dual-headed attention (STDHA). This approach efficiently endows the image transformers with temporal modeling capability at zero extra parameters and computation. Second, to handle the domain gap between images and videos, we propose a linear adaption strategy that utilizes lightweight densely placed linear adapters to fully transfer the frozen image models to video recognition. Thanks to the customized linear design, all newly added adapters could be easily merged with the original modules through structural reparameterization after training, enabling zero extra cost during inference. Extensive experiments on representative fully-supervised and few-shot video recognition benchmarks showcase that ZeroI2V can match or even outperform previous state-of-the-art methods while enjoying superior parameter and inference efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lucic, M., Schmid, C.: ViViT: a video vision transformer. In: International Conference on Computer Vision, pp. 6816–6826 (2021)

    Google Scholar 

  2. Ba, L.J., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)

  3. Bao, H., Dong, L., Piao, S., Wei, F.: Beit: bert pre-training of image transformers. In: International Conference on Learning Representations (2022)

    Google Scholar 

  4. Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: International Conference on Machine Learning, vol. 139, pp. 813–824 (2021)

    Google Scholar 

  5. Brown, T., et al.: Language models are few-shot learners. Adv. Neural Inform. Process. Syst. 33, 1877–1901 (2020)

    Google Scholar 

  6. Bulat, A., Pérez-Rúa, J., Sudhakaran, S., Martínez, B., Tzimiropoulos, G.: Space-time mixing attention for video transformer. Adv. Neural Inform. Process. Syst. 19594–19607 (2021)

    Google Scholar 

  7. Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: International Conference on Computer Vision, pp. 9630–9640 (2021)

    Google Scholar 

  8. Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4733 (2017)

    Google Scholar 

  9. Chen, S., et al.: Adaptformer: adapting vision transformers for scalable visual recognition. Adv. Neural Inform. Process. Syst. (2022)

    Google Scholar 

  10. Cherti, M., et al.: Reproducible scaling laws for contrastive language-image learning. arXiv preprint arXiv:2212.07143 (2022)

  11. Devlin, J., Chang, M., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)

    Google Scholar 

  12. Dosovitskiy, A., et al.: An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR) (2021)

    Google Scholar 

  13. Fan, H., et al.: Multiscale vision transformers. In: International Conference on Computer Vision (ICCV), pp. 6804–6815 (2021)

    Google Scholar 

  14. Feichtenhofer, C.: X3D: expanding architectures for efficient video recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 200–210 (2020)

    Google Scholar 

  15. Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: International Conference on Computer Vision (ICCV), pp. 6201–6210 (2019)

    Google Scholar 

  16. Goyal, R., et al.: The something something video database for learning and evaluating visual common sense. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5843–5851. IEEE Computer Society (2017)

    Google Scholar 

  17. Gu, C., et al.: AVA: a video dataset of spatio-temporally localized atomic visual actions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6047–6056 (2018)

    Google Scholar 

  18. He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.B.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 15979–15988 (2022)

    Google Scholar 

  19. He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.B.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9726–9735 (2020)

    Google Scholar 

  20. He, X., Li, C., Zhang, P., Yang, J., Wang, X.E.: Parameter-efficient model adaptation for vision transformers. arXiv preprint arXiv:2203.16329 (2022)

  21. Houlsby, N., et al.: Parameter-efficient transfer learning for NLP. In: International Conference on Machine Learning, vol. 97, pp. 2790–2799 (2019)

    Google Scholar 

  22. Hu, E.J., et al.: Lora: low-rank adaptation of large language models. In: International Conference on Learning Representations (ICLR) (2022)

    Google Scholar 

  23. Jia, M., et al.: Visual prompt tuning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13693, pp. 709–727. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19827-4_41

  24. Ju, C., Han, T., Zheng, K., Zhang, Y., Xie, W.: Prompting visual-language models for efficient video understanding. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13695, pp. 105–124. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19833-5_7

  25. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: n 2011 International Conference on Computer Vision, pp. 2556–2563. IEEE (2011)

    Google Scholar 

  26. Lester, B., Al-Rfou, R., Constant, N.: The power of scale for parameter-efficient prompt tuning. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3045–3059 (2021)

    Google Scholar 

  27. Li, J., Li, D., Xiong, C., Hoi, S.C.H.: BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning (ICML), vol. 162, pp. 12888–12900 (2022)

    Google Scholar 

  28. Li, K., et al.: Uniformer: unified transformer for efficient spatial-temporal representation learning. In: International Conference on Learning Representations (ICLR) (2022)

    Google Scholar 

  29. Li, K., et al.: Uniformerv2: unlocking the potential of image vits for video understanding. In: International Conference on Computer Vision (ICCV), pp. 1632–1643 (2023)

    Google Scholar 

  30. Li, T., Wang, L.: Learning spatiotemporal features via video and text pair discrimination. arXiv preprint arXiv:2001.05691 (2020)

  31. Li, X.L., Liang, P.: Prefix-tuning: optimizing continuous prompts for generation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 4582–4597 (2021)

    Google Scholar 

  32. Li, X., Huang, Z., Wang, J., Li, K., Wang, L.: Videoeval: comprehensive benchmark suite for low-cost evaluation of video foundation model. arXiv preprint arXiv:2407.06491 (2024)

  33. Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., Wang, L.: TEA: temporal excitation and aggregation for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 906–915 (2020)

    Google Scholar 

  34. Li, Y., et al.: Mvitv2: improved multiscale vision transformers for classification and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4794–4804 (2022)

    Google Scholar 

  35. Li, Y., Li, Y., Vasconcelos, N.: Resound: towards action recognition without representation bias. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 513–528 (2018)

    Google Scholar 

  36. Lian, D., Zhou, D., Feng, J., Wang, X.: Scaling & shifting your features: a new baseline for efficient model tuning. In: Adv. Neural Inform. Process. Syst. (2022)

    Google Scholar 

  37. Lin, J., Gan, C., Wang, K., Han, S.: TSM: temporal shift module for efficient and scalable video understanding on edge devices. IEEE Trans. Pattern Anal. Mach. Intell. 44(5), 2760–2774 (2022)

    Google Scholar 

  38. Lin, Z., et al.: Frozen CLIP models are efficient video learners. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13695, pp. 388–404. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19833-5_23

  39. Liu, M., Wang, Z., Ji, S.: Non-local graph neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 44(12), 10270–10276 (2022)

    Article  Google Scholar 

  40. Liu, Z., et al.: Swin transformer V2: scaling up capacity and resolution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11999–12009 (2022)

    Google Scholar 

  41. Liu, Z., et al.: Video swin transformer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3192–3201 (2022)

    Google Scholar 

  42. Liu, Z., Wang, L., Wu, W., Qian, C., Lu, T.: TAM: temporal adaptive module for video recognition. In: International Conference on Computer Vision, pp. 13688–13698 (2021)

    Google Scholar 

  43. Lu, C., Jin, X., Huang, Z., Hou, Q., Cheng, M., Feng, J.: CMAE-V: contrastive masked autoencoders for video action recognition. arXiv preprint arXiv:2301.06018 (2023)

  44. Michel, P., Levy, O., Neubig, G.: Are sixteen heads really better than one? Adv. Neural Inf. Process. Syst. 14014–14024 (2019)

    Google Scholar 

  45. Ni, B., et al.: Expanding language-image pretrained models for general video recognition. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13664, pp. 1–18. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19772-7_1

  46. Nie, X., et al.: Pro-tuning: unified prompt tuning for vision tasks. arXiv preprint arXiv:2207.14381 (2022)

  47. Oquab, M., et al.: Dinov2: learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

  48. Pan, J., Lin, Z., Zhu, X., Shao, J., Li, H.: St-adapter: parameter-efficient image-to-video transfer learning. Adv. Neural Inf. Process. Syst. (2022)

    Google Scholar 

  49. Pfeiffer, J., Kamath, A., Rücklé, A., Cho, K., Gurevych, I.: Adapterfusion: non-destructive task composition for transfer learning. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 487–503 (2021)

    Google Scholar 

  50. Pfeiffer, J., et al.: Adapterhub: a framework for adapting transformers. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 46–54 (2020)

    Google Scholar 

  51. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (ICML), vol. 139, pp. 8748–8763 (2021)

    Google Scholar 

  52. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training. OpenAI blog (2018)

    Google Scholar 

  53. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)

    Google Scholar 

  54. Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)

  55. Tan, J., Zhao, X., Shi, X., Kang, B., Wang, L.: Pointtad: multi-label temporal action detection with learnable query points. NIPS 35, 15268–15280 (2022)

    Google Scholar 

  56. Tong, Z., Song, Y., Wang, J., Wang, L.: Videomae: masked autoencoders are data-efficient learners for self-supervised video pre-training. Adv. Neural Inf. Process. Syst. (2022)

    Google Scholar 

  57. Tschannen, M., Mustafa, B., Houlsby, N.: Clippo: image-and-language understanding from pixels only. arXiv preprint arXiv:2212.08045 (2022)

  58. Tu, S., Dai, Q., Wu, Z., Cheng, Z., Hu, H., Jiang, Y.: Implicit temporal modeling with learnable alignment for video recognition. In: International Conference on Computer Vision (ICCV) (2023)

    Google Scholar 

  59. Wang, L., et al.: Videomae V2: scaling video masked autoencoders with dual masking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2023)

    Google Scholar 

  60. Wang, L., Tong, Z., Ji, B., Wu, G.: TDN: temporal difference networks for efficient action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1895–1904 (2021)

    Google Scholar 

  61. Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision – ECCV 2016. ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2

  62. Wang, M., Xing, J., Liu, Y.: Actionclip: a new paradigm for video action recognition. arXiv preprint arXiv:2109.08472 (2021)

  63. Wang, R., et al.: BEVT: bert pretraining of video transformers. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 14713–14723 (2022)

    Google Scholar 

  64. Wang, Y., et al.: Internvid: a large-scale video-text dataset for multimodal understanding and generation. In: ICLR (2024)

    Google Scholar 

  65. Wu, W., Sun, Z., Ouyang, W.: Revisiting classifier: transferring vision-language models for video recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 2847–2855 (2023)

    Google Scholar 

  66. Xiang, W., Li, C., Wang, B., Wei, X., Hua, X.S., Zhang, L.: Spatiotemporal self-attention modeling with temporal patch shift for action recognition. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13663, pp. 627–644. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20062-5_36

  67. Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 305–321 (2018)

    Google Scholar 

  68. Xu, C., et al.: Progressive visual prompt learning with contrastive feature re-formation. arXiv preprint arXiv:2304.08386 (2023)

  69. Xu, C., et al.: DPL: decoupled prompt learning for vision-language models. arXiv preprint arXiv:2308.10061 (2023)

  70. Yan, S., et al.: Multiview transformers for video recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3323–3333 (2022)

    Google Scholar 

  71. Yang, T., Zhu, Y., Xie, Y., Zhang, A., Chen, C., Li, M.: Aim: adapting image models for efficient video action recognition. In: International Conference on Learning Representations (ICLR) (2023)

    Google Scholar 

  72. Zaken, E.B., Goldberg, Y., Ravfogel, S.: Bitfit: simple parameter-efficient fine-tuning for transformer-based masked language-models. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 1–9 (2022)

    Google Scholar 

  73. Zhai, X., Kolesnikov, A., Houlsby, N., Beyer, L.: Scaling vision transformers. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1204–1213 (2022)

    Google Scholar 

  74. Zhang, G., Zhu, Y., Wang, H., Chen, Y., Wu, G., Wang, L.: Extracting motion and appearance via inter-frame attention for efficient video frame interpolation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2023)

    Google Scholar 

  75. Zhang, H., Hao, Y., Ngo, C.: Token shift transformer for video classification. In: ACM International Conference on Multimedia, pp. 917–925 (2021)

    Google Scholar 

  76. Zhang, Y., Zhou, K., Liu, Z.: Neural prompt search. arXiv preprint arXiv:2206.04673 (2022)

  77. Zhou, B., Andonian, A., Oliva, A., Torralba, A.: Temporal relational reasoning in videos. In: Proceedings of the European Conference on Computer Vision (ECCV), vol. 11205, pp. 831–846 (2018)

    Google Scholar 

  78. Zhu, Y., Ji, Y., Zhao, Z., Wu, G., Wang, L.: AWT: transferring vision-language models via augmentation, weighting, and transportation. arXiv preprint arXiv:2407.04603 (2024)

  79. Zhu, Y., Zhang, G., Tan, J., Wu, G., Wang, L.: Dual detrs for multi-label temporal action detection. In: CVPR, pp. 18559–18569 (2024)

    Google Scholar 

Download references

Acknowledgements

This work is supported by the National Key R&D Program of China (No. 2022ZD0160900), the National Natural Science Foundation of China (No. 62076119, No. 61921006), the Fundamental Research Funds for the Central Universities (No. 020214380119), and the Collaborative Innovation Center of Novel Software Technology and Industrialization.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Limin Wang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3356 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, X., Zhu, Y., Wang, L. (2025). ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15141. Springer, Cham. https://doi.org/10.1007/978-3-031-73010-8_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73010-8_25

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73009-2

  • Online ISBN: 978-3-031-73010-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics