Abstract
Vision Transformers (ViTs) are promising for solving video-related tasks, but often suffer from computational bottlenecks or insufficient temporal information. Recent advances in large-scale pre-training show great potential for high-quality video representation, providing new remedies to transformer limitations. Inspired by this, we propose a SpatioTemporal Representation Enhanced Vision Transformer (STRE-ViT), which follows a two-stream paradigm to fuse large-scale pre-training visual prior knowledge and video-level temporal biases in a simple and effective manner. Specifically, one stream employs a well-pretrained ViT with rich vision priors to alleviate data requirements and learning workload. Another stream is our designed spatiotemporal interaction stream, which first models video-level temporal dynamics and then extracts fine-grained and salient spatiotemporal representations by introducing appropriate temporal bias. Through this interaction stream, the model capacity of ViT is enhanced for video spatiotemporal representations. Moreover, we provide a fresh perspective to adapt well-pretrained ViT for video recognition. Experimental results show that STRE-ViT learns high-quality video representations and achieves competitive performance on two popular video benchmarks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: ViViT: a video vision transformer. ArXiv, abs/2103.15691 (2021)
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML, vol. 2, p. 4 (2021)
Dosovitskiy, A., et al.: An image is worth \(16\times 16\) words: transformers for image recognition at scale. ArXiv abs/2010.11929 (2020)
Feichtenhofer, C., Fan, H., Malik, J., He, K.: SlowFast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019)
Girdhar, R., Carreira, J., Doersch, C., Zisserman, A.: Video action transformer network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 244–253 (2019)
Goyal, R., et al.: The “something something” video database for learning and evaluating visual common sense. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 5843–5851 (2017)
Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6546–6555 (2017)
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning (2021)
Ju, C., Han, T., Zheng, K., Zhang, Y., Xie, W.: Prompting visual-language models for efficient video understanding. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13695, pp. 105–124. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19833-5_7
Li, K., et al.: UniFormer: unified transformer for efficient spatiotemporal representation learning. ArXiv abs/2201.04676 (2022)
Li, K., et al.: UniFormerV2: spatiotemporal learning by arming image ViTs with video UniFormer. arXiv preprint arXiv:2211.09552 (2022)
Li, Y., et al.: MViTv2: improved multiscale vision transformers for classification and detection. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4794–4804 (2021). https://api.semanticscholar.org/CorpusID:244799268
Lin, Z., et al.: Frozen clip models are efficient video learners. ArXiv abs/2208.03550 (2022)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9992–10002 (2021)
Liu, Z., et al.: Video swin transformer. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3192–3201 (2021). https://api.semanticscholar.org/CorpusID:235624247
Loshchilov, I., Hutter, F.: Fixing weight decay regularization in adam. ArXiv abs/1711.05101 (2017)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Yan, S., et al.: Multiview transformers for video recognition. In: CVPR (2022)
Vaswani, A., et al.: Attention is all you need. In: NIPS (2017)
Wang, M., Xing, J., Liu, Y.: ActionCLIP: a new paradigm for video action recognition. ArXiv abs/2109.08472 (2021)
Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 548–558 (2021)
Xiao, T., Singh, M., Mintun, E., Darrell, T., Dollár, P., Girshick, R.B.: Early convolutions help transformers see better. In: Neural Information Processing Systems (2021)
Yuan, K., Guo, S., Liu, Z., Zhou, A., Yu, F., Wu, W.: Incorporating convolution designs into visual transformers. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 559–568 (2021)
Zhou, B., Khosla, A., Lapedriza, À., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2921–2929 (2015)
Zisserman, A., et al.: The kinetics human action video dataset (2017)
Acknowledgements
This work was supported by the National Key Research and Development Program of China under Grant 2021YFB2910109).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Li, M. et al. (2024). Spatiotemporal Representation Enhanced ViT for Video Recognition. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14554. Springer, Cham. https://doi.org/10.1007/978-3-031-53305-1_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-53305-1_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53304-4
Online ISBN: 978-3-031-53305-1
eBook Packages: Computer ScienceComputer Science (R0)