Abstract
Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data, demonstrating remarkable “zero-shot” generalization ability for various image tasks. However, how to effectively expand such new language-image pretraining methods to video domains is still an open problem. In this work, we present a simple yet effective approach that adapts the pretrained language-image models to video recognition directly, instead of pretraining a new model from scratch. More concretely, to capture the long-range dependencies of frames along the temporal dimension, we propose a cross-frame attention mechanism that explicitly exchanges information across frames. Such module is lightweight and can be plugged into pretrained language-image models seamlessly. Moreover, we propose a video-specific prompting scheme, which leverages video content information for generating discriminative textual prompts. Extensive experiments demonstrate that our approach is effective and can be generalized to different video recognition scenarios. In particular, under fully-supervised settings, our approach achieves a top-1 accuracy of 87.1% on Kinectics-400, while using 12\(\times \) fewer FLOPs compared with Swin-L and ViViT-H. In zero-shot experiments, our approach surpasses the current state-of-the-art methods by +7.6% and +14.9% in terms of top-1 accuracy under two popular protocols. In few-shot scenarios, our approach outperforms previous best methods by +32.1% and +23.1% when the labeled data is extremely limited. Code and models are available at here.
B. Ni and M. Chen—Work done during internship at Microsoft Research.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Akata, Z., Perronnin, F., Harchaoui, Z., Schmid, C.: Label-embedding for image classification. IEEE T-PAMI 38, 1425–1438 (2015)
Akata, Z., Reed, S., Walter, D., Lee, H., Schiele, B.: Evaluation of output embeddings for fine-grained image classification. In: CVPR, pp. 2927–2936 (2015)
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: ViViT: a video vision transformer. In: ICCV, pp. 6836–6846 (2021)
Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML, pp. 813–824 (2021)
Brattoli, B., Tighe, J., Zhdanov, F., Perona, P., Chalupka, K.: Rethinking zero-shot video classification: end-to-end training for realistic applications. In: CVPR, pp. 4613–4623 (2020)
Carreira, J., Noland, E., Banki-Horvath, A., Hillier, C., Zisserman, A.: A short note about kinetics-600. arXiv preprint arXiv:1808.01340 (2018)
Chen, S., Huang, D.: Elaborative rehearsal for zero-shot action recognition. In: ICCV, pp. 13638–13647 (2021)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR, pp. 248–255. IEEE (2009)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)
Fan, H., et al.: Multiscale vision transformers. In: ICCV, pp. 6824–6835 (2021)
Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: ICCV, pp. 6202–6211 (2019)
Frome, A., et al.: Devise: a deep visual-semantic embedding model. In: NIPS, pp. 2121–2129 (2013)
Gao, J., Zhang, T., Xu, C.: I know the relationships: zero-shot action recognition via two-stream graph convolutional networks and knowledge graphs. In: AAAI, vol. 33, pp. 8303–8311 (2019)
Gao, P., et al.: Clip-adapter: better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544 (2021)
Ghosh, P., Saini, N., Davis, L.S., Shrivastava, A.: All about knowledge graphs for actions. arXiv preprint arXiv:2008.12432 (2020)
Girdhar, R., Grauman, K.: Anticipative video transformer. In: ICCV (2021)
Herath, S., Harandi, M., Porikli, F.: Going deeper into action recognition: a survey. Image Vis. Comput. 60, 4–21 (2017)
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: ICML, pp. 4904–4916 (2021)
Ju, C., Han, T., Zheng, K., Zhang, Y., Xie, W.: Prompting visual-language models for efficient video understanding. In: CVPR (2022)
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR, pp. 1725–1732 (2014)
Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: ICCV, pp. 2556–2563 (2011)
Li, K., et al.: Uniformer: unifying convolution and self-attention for visual recognition. In: ICLR (2022)
Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., Wang, L.: Tea: temporal excitation and aggregation for action recognition. In: CVPR, pp. 909–918 (2020)
Li, Y., et al.: Improved multiscale vision transformers for classification and detection. In: CVPR (2022)
Lin, J., Gan, C., Han, S.: TSM: temporal shift module for efficient video understanding. In: ICCV, pp. 7083–7093 (2019)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV (2021)
Liu, Z., et al.: Video swin transformer. In: CVPR (2022)
Liu, Z., Wang, L., Wu, W., Qian, C., Lu, T.: TAM: temporal adaptive module for video recognition. In: ICCV, pp. 13708–13718 (2021)
Miech, A., Zhukov, D., Alayrac, J.B., Tapaswi, M., Laptev, I., Sivic, J.: HowTo100M: learning a text-video embedding by watching hundred million narrated video clips. In: ICCV, pp. 2630–2640 (2019)
Neimark, D., Bar, O., Zohar, M., Asselmann, D.: Video transformer network. arXiv preprint arXiv:2102.00719 (2021)
Patrick, M., et al.: Keeping your eye on the ball: trajectory attention in video transformers. In: NIPS (2021)
Qin, J., et al.: Zero-shot action recognition with error-correcting output codes. In: CVPR, pp. 2833–2842 (2017)
Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3D residual networks. In: ICCV, pp. 5533–5541 (2017)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Rao, Y., et al.: DenseCLIP: language-guided dense prediction with context-aware prompting. In: CVPR (2022)
Romera-Paredes, B., Torr, P.: An embarrassingly simple approach to zero-shot learning. In: ICML, pp. 2152–2161 (2015)
Ryoo, M., Piergiovanni, A., Arnab, A., Dehghani, M., Angelova, A.: Tokenlearner: adaptive space-time tokenization for videos. In: NIPS, vol. 34 (2021)
Selva, J., Johansen, A.S., Escalera, S., Nasrollahi, K., Moeslund, T.B., Clapés, A.: Video transformers: a survey. arXiv preprint arXiv:2201.05991 (2022)
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
Sun, C., Baradel, F., Murphy, K., Schmid, C.: Learning video representations using contrastive bidirectional transformer. In: ECCV (2020)
Sun, C., Myers, A., Vondrick, C., Murphy, K., Schmid, C.: Videobert: a joint model for video and language representation learning. In: ICCV, pp. 7464–7473 (2019)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV, pp. 4489–4497 (2015)
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: CVPR (2018)
Vaswani, A., et al.: Attention is all you need. In: NIPS, pp. 5998–6008 (2017)
Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2
Wang, M., Xing, J., Liu, Y.: ActionCLIP: a new paradigm for video action recognition. arXiv preprint arXiv:2109.08472 (2021)
Wang, Q., Chen, K.: Alternative semantic representations for zero-shot human action recognition. In: ECML PKDD, pp. 87–102 (2017)
Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: ECCV, pp. 305–321 (2018)
Xu, H., et al.: Videoclip: contrastive pre-training for zero-shot video-text understanding. In: EMNLP (2021)
Xu, X., Hospedales, T.M., Gong, S.: Multi-task zero-shot action recognition with prioritised data augmentation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 343–359. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_22
Yan, S., et al.: Multiview transformers for video recognition. In: CVPR (2022)
Yuan, L., et al.: Florence: a new foundation model for computer vision. arXiv preprint arXiv:2111.11432 (2021)
Zhang, B., et al.: Co-training transformer with videos and images improves action recognition. arXiv preprint arXiv:2112.07175 (2021)
Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: CVPR, pp. 2021–2030 (2017)
Zhang, R., et al.: Tip-adapter: training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930 (2021)
Zhang, R., et al.: Pointclip: point cloud understanding by clip. In: CVPR (2021)
Zhou, C., Loy, C.C., Dai, B.: Denseclip: extract free dense labels from clip. arXiv preprint arXiv:2112.01071 (2021)
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. arXiv preprint arXiv:2109.01134 (2021)
Zhu, L., Yang, Y.: Actbert: learning global-local video-text representations. In: CVPR, pp. 8746–8755 (2020)
Zhu, Y., et al.: A comprehensive study of deep video action recognition. arXiv preprint arXiv:2012.06567 (2020)
Zhu, Y., Long, Y., Guan, Y., Newsam, S., Shao, L.: Towards universal representation for unseen action recognition. In: CVPR, pp. 9436–9445 (2018)
Acknowledgements
This research was supported in part by the National Key Research and Development Program of China under Grant No. 2018AAA0100400, and the National Natural Science Foundation of China under Grants 61976208, 62071466 and 62076242, and the InnoHK project. HL was not supported by any fund for this research.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Ni, B. et al. (2022). Expanding Language-Image Pretrained Models for General Video Recognition. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13664. Springer, Cham. https://doi.org/10.1007/978-3-031-19772-7_1
Download citation
DOI: https://doi.org/10.1007/978-3-031-19772-7_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19771-0
Online ISBN: 978-3-031-19772-7
eBook Packages: Computer ScienceComputer Science (R0)