ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video

Li, Xinhao; Zhu, Yuhan; Wang, Limin

doi:10.1007/978-3-031-73010-8_25

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15141))

Included in the following conference series:

European Conference on Computer Vision

179 Accesses

Abstract

Adapting image models to the video domain has emerged as an efficient paradigm for solving video recognition tasks. Due to the huge number of parameters and effective transferability of image models, performing full fine-tuning is less efficient and even unnecessary. Thus, recent research is shifting its focus toward parameter-efficient image-to-video adaptation. However, these adaptation strategies inevitably introduce extra computational costs to deal with the domain gap and temporal modeling in videos. In this paper, we present a new adaptation paradigm (ZeroI2V) to transfer the image transformers to video recognition tasks (i.e., introduce zero extra cost to the original models during inference). To achieve this goal, we present two core designs. First, to capture the dynamics in videos and reduce the difficulty of image-to-video adaptation, we exploit the flexibility of self-attention and introduce spatial-temporal dual-headed attention (STDHA). This approach efficiently endows the image transformers with temporal modeling capability at zero extra parameters and computation. Second, to handle the domain gap between images and videos, we propose a linear adaption strategy that utilizes lightweight densely placed linear adapters to fully transfer the frozen image models to video recognition. Thanks to the customized linear design, all newly added adapters could be easily merged with the original modules through structural reparameterization after training, enabling zero extra cost during inference. Extensive experiments on representative fully-supervised and few-shot video recognition benchmarks showcase that ZeroI2V can match or even outperform previous state-of-the-art methods while enjoying superior parameter and inference efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.99; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Rethinking Image-to-Video Adaptation: An Object-Centric Perspective

Spatiotemporal Representation Enhanced ViT for Video Recognition

TempFormer: Temporally Consistent Transformer for Video Denoising

References

Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lucic, M., Schmid, C.: ViViT: a video vision transformer. In: International Conference on Computer Vision, pp. 6816–6826 (2021)
Google Scholar
Ba, L.J., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
Bao, H., Dong, L., Piao, S., Wei, F.: Beit: bert pre-training of image transformers. In: International Conference on Learning Representations (2022)
Google Scholar
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: International Conference on Machine Learning, vol. 139, pp. 813–824 (2021)
Google Scholar
Brown, T., et al.: Language models are few-shot learners. Adv. Neural Inform. Process. Syst. 33, 1877–1901 (2020)
Google Scholar
Bulat, A., Pérez-Rúa, J., Sudhakaran, S., Martínez, B., Tzimiropoulos, G.: Space-time mixing attention for video transformer. Adv. Neural Inform. Process. Syst. 19594–19607 (2021)
Google Scholar
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: International Conference on Computer Vision, pp. 9630–9640 (2021)
Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4733 (2017)
Google Scholar
Chen, S., et al.: Adaptformer: adapting vision transformers for scalable visual recognition. Adv. Neural Inform. Process. Syst. (2022)
Google Scholar
Cherti, M., et al.: Reproducible scaling laws for contrastive language-image learning. arXiv preprint arXiv:2212.07143 (2022)
Devlin, J., Chang, M., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16 $\times $ 16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR) (2021)
Google Scholar
Fan, H., et al.: Multiscale vision transformers. In: International Conference on Computer Vision (ICCV), pp. 6804–6815 (2021)
Google Scholar
Feichtenhofer, C.: X3D: expanding architectures for efficient video recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 200–210 (2020)
Google Scholar
Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: International Conference on Computer Vision (ICCV), pp. 6201–6210 (2019)
Google Scholar
Goyal, R., et al.: The something something video database for learning and evaluating visual common sense. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5843–5851. IEEE Computer Society (2017)
Google Scholar
Gu, C., et al.: AVA: a video dataset of spatio-temporally localized atomic visual actions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6047–6056 (2018)
Google Scholar
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.B.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 15979–15988 (2022)
Google Scholar
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.B.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9726–9735 (2020)
Google Scholar
He, X., Li, C., Zhang, P., Yang, J., Wang, X.E.: Parameter-efficient model adaptation for vision transformers. arXiv preprint arXiv:2203.16329 (2022)
Houlsby, N., et al.: Parameter-efficient transfer learning for NLP. In: International Conference on Machine Learning, vol. 97, pp. 2790–2799 (2019)
Google Scholar
Hu, E.J., et al.: Lora: low-rank adaptation of large language models. In: International Conference on Learning Representations (ICLR) (2022)
Google Scholar
Jia, M., et al.: Visual prompt tuning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13693, pp. 709–727. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19827-4_41
Ju, C., Han, T., Zheng, K., Zhang, Y., Xie, W.: Prompting visual-language models for efficient video understanding. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13695, pp. 105–124. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19833-5_7
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: n 2011 International Conference on Computer Vision, pp. 2556–2563. IEEE (2011)
Google Scholar
Lester, B., Al-Rfou, R., Constant, N.: The power of scale for parameter-efficient prompt tuning. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3045–3059 (2021)
Google Scholar
Li, J., Li, D., Xiong, C., Hoi, S.C.H.: BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning (ICML), vol. 162, pp. 12888–12900 (2022)
Google Scholar
Li, K., et al.: Uniformer: unified transformer for efficient spatial-temporal representation learning. In: International Conference on Learning Representations (ICLR) (2022)
Google Scholar
Li, K., et al.: Uniformerv2: unlocking the potential of image vits for video understanding. In: International Conference on Computer Vision (ICCV), pp. 1632–1643 (2023)
Google Scholar
Li, T., Wang, L.: Learning spatiotemporal features via video and text pair discrimination. arXiv preprint arXiv:2001.05691 (2020)
Li, X.L., Liang, P.: Prefix-tuning: optimizing continuous prompts for generation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 4582–4597 (2021)
Google Scholar
Li, X., Huang, Z., Wang, J., Li, K., Wang, L.: Videoeval: comprehensive benchmark suite for low-cost evaluation of video foundation model. arXiv preprint arXiv:2407.06491 (2024)
Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., Wang, L.: TEA: temporal excitation and aggregation for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 906–915 (2020)
Google Scholar
Li, Y., et al.: Mvitv2: improved multiscale vision transformers for classification and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4794–4804 (2022)
Google Scholar
Li, Y., Li, Y., Vasconcelos, N.: Resound: towards action recognition without representation bias. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 513–528 (2018)
Google Scholar
Lian, D., Zhou, D., Feng, J., Wang, X.: Scaling & shifting your features: a new baseline for efficient model tuning. In: Adv. Neural Inform. Process. Syst. (2022)
Google Scholar
Lin, J., Gan, C., Wang, K., Han, S.: TSM: temporal shift module for efficient and scalable video understanding on edge devices. IEEE Trans. Pattern Anal. Mach. Intell. 44(5), 2760–2774 (2022)
Google Scholar
Lin, Z., et al.: Frozen CLIP models are efficient video learners. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13695, pp. 388–404. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19833-5_23
Liu, M., Wang, Z., Ji, S.: Non-local graph neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 44(12), 10270–10276 (2022)
Article Google Scholar
Liu, Z., et al.: Swin transformer V2: scaling up capacity and resolution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11999–12009 (2022)
Google Scholar
Liu, Z., et al.: Video swin transformer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3192–3201 (2022)
Google Scholar
Liu, Z., Wang, L., Wu, W., Qian, C., Lu, T.: TAM: temporal adaptive module for video recognition. In: International Conference on Computer Vision, pp. 13688–13698 (2021)
Google Scholar
Lu, C., Jin, X., Huang, Z., Hou, Q., Cheng, M., Feng, J.: CMAE-V: contrastive masked autoencoders for video action recognition. arXiv preprint arXiv:2301.06018 (2023)
Michel, P., Levy, O., Neubig, G.: Are sixteen heads really better than one? Adv. Neural Inf. Process. Syst. 14014–14024 (2019)
Google Scholar
Ni, B., et al.: Expanding language-image pretrained models for general video recognition. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13664, pp. 1–18. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19772-7_1
Nie, X., et al.: Pro-tuning: unified prompt tuning for vision tasks. arXiv preprint arXiv:2207.14381 (2022)
Oquab, M., et al.: Dinov2: learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)
Pan, J., Lin, Z., Zhu, X., Shao, J., Li, H.: St-adapter: parameter-efficient image-to-video transfer learning. Adv. Neural Inf. Process. Syst. (2022)
Google Scholar
Pfeiffer, J., Kamath, A., Rücklé, A., Cho, K., Gurevych, I.: Adapterfusion: non-destructive task composition for transfer learning. In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 487–503 (2021)
Google Scholar
Pfeiffer, J., et al.: Adapterhub: a framework for adapting transformers. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 46–54 (2020)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (ICML), vol. 139, pp. 8748–8763 (2021)
Google Scholar
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training. OpenAI blog (2018)
Google Scholar
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)
Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
Tan, J., Zhao, X., Shi, X., Kang, B., Wang, L.: Pointtad: multi-label temporal action detection with learnable query points. NIPS 35, 15268–15280 (2022)
Google Scholar
Tong, Z., Song, Y., Wang, J., Wang, L.: Videomae: masked autoencoders are data-efficient learners for self-supervised video pre-training. Adv. Neural Inf. Process. Syst. (2022)
Google Scholar
Tschannen, M., Mustafa, B., Houlsby, N.: Clippo: image-and-language understanding from pixels only. arXiv preprint arXiv:2212.08045 (2022)
Tu, S., Dai, Q., Wu, Z., Cheng, Z., Hu, H., Jiang, Y.: Implicit temporal modeling with learnable alignment for video recognition. In: International Conference on Computer Vision (ICCV) (2023)
Google Scholar
Wang, L., et al.: Videomae V2: scaling video masked autoencoders with dual masking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2023)
Google Scholar
Wang, L., Tong, Z., Ji, B., Wu, G.: TDN: temporal difference networks for efficient action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1895–1904 (2021)
Google Scholar
Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision – ECCV 2016. ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2
Wang, M., Xing, J., Liu, Y.: Actionclip: a new paradigm for video action recognition. arXiv preprint arXiv:2109.08472 (2021)
Wang, R., et al.: BEVT: bert pretraining of video transformers. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 14713–14723 (2022)
Google Scholar
Wang, Y., et al.: Internvid: a large-scale video-text dataset for multimodal understanding and generation. In: ICLR (2024)
Google Scholar
Wu, W., Sun, Z., Ouyang, W.: Revisiting classifier: transferring vision-language models for video recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 2847–2855 (2023)
Google Scholar
Xiang, W., Li, C., Wang, B., Wei, X., Hua, X.S., Zhang, L.: Spatiotemporal self-attention modeling with temporal patch shift for action recognition. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. ECCV 2022. LNCS, vol. 13663, pp. 627–644. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20062-5_36
Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 305–321 (2018)
Google Scholar
Xu, C., et al.: Progressive visual prompt learning with contrastive feature re-formation. arXiv preprint arXiv:2304.08386 (2023)
Xu, C., et al.: DPL: decoupled prompt learning for vision-language models. arXiv preprint arXiv:2308.10061 (2023)
Yan, S., et al.: Multiview transformers for video recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3323–3333 (2022)
Google Scholar
Yang, T., Zhu, Y., Xie, Y., Zhang, A., Chen, C., Li, M.: Aim: adapting image models for efficient video action recognition. In: International Conference on Learning Representations (ICLR) (2023)
Google Scholar
Zaken, E.B., Goldberg, Y., Ravfogel, S.: Bitfit: simple parameter-efficient fine-tuning for transformer-based masked language-models. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 1–9 (2022)
Google Scholar
Zhai, X., Kolesnikov, A., Houlsby, N., Beyer, L.: Scaling vision transformers. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1204–1213 (2022)
Google Scholar
Zhang, G., Zhu, Y., Wang, H., Chen, Y., Wu, G., Wang, L.: Extracting motion and appearance via inter-frame attention for efficient video frame interpolation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2023)
Google Scholar
Zhang, H., Hao, Y., Ngo, C.: Token shift transformer for video classification. In: ACM International Conference on Multimedia, pp. 917–925 (2021)
Google Scholar
Zhang, Y., Zhou, K., Liu, Z.: Neural prompt search. arXiv preprint arXiv:2206.04673 (2022)
Zhou, B., Andonian, A., Oliva, A., Torralba, A.: Temporal relational reasoning in videos. In: Proceedings of the European Conference on Computer Vision (ECCV), vol. 11205, pp. 831–846 (2018)
Google Scholar
Zhu, Y., Ji, Y., Zhao, Z., Wu, G., Wang, L.: AWT: transferring vision-language models via augmentation, weighting, and transportation. arXiv preprint arXiv:2407.04603 (2024)
Zhu, Y., Zhang, G., Tan, J., Wu, G., Wang, L.: Dual detrs for multi-label temporal action detection. In: CVPR, pp. 18559–18569 (2024)
Google Scholar

Download references

Acknowledgements

This work is supported by the National Key R&D Program of China (No. 2022ZD0160900), the National Natural Science Foundation of China (No. 62076119, No. 61921006), the Fundamental Research Funds for the Central Universities (No. 020214380119), and the Collaborative Innovation Center of Novel Software Technology and Industrialization.

Author information

Authors and Affiliations

State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Xinhao Li, Yuhan Zhu & Limin Wang
Shanghai AI Laboratory, Nanjing, China
Xinhao Li & Limin Wang

Authors

Xinhao Li
View author publications
You can also search for this author in PubMed Google Scholar
Yuhan Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Limin Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Limin Wang .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3356 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, X., Zhu, Y., Wang, L. (2025). ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15141. Springer, Cham. https://doi.org/10.1007/978-3-031-73010-8_25

Download citation

DOI: https://doi.org/10.1007/978-3-031-73010-8_25
Published: 10 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73009-2
Online ISBN: 978-3-031-73010-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Rethinking Image-to-Video Adaptation: An Object-Centric Perspective

Spatiotemporal Representation Enhanced ViT for Video Recognition

TempFormer: Temporally Consistent Transformer for Video Denoising

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 3356 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Rethinking Image-to-Video Adaptation: An Object-Centric Perspective

Spatiotemporal Representation Enhanced ViT for Video Recognition

TempFormer: Temporally Consistent Transformer for Video Denoising

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 3356 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation