Skip to main content

Spatiotemporal Representation Enhanced ViT for Video Recognition

  • Conference paper
  • First Online:
MultiMedia Modeling (MMM 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14554))

Included in the following conference series:

  • 350 Accesses

Abstract

Vision Transformers (ViTs) are promising for solving video-related tasks, but often suffer from computational bottlenecks or insufficient temporal information. Recent advances in large-scale pre-training show great potential for high-quality video representation, providing new remedies to transformer limitations. Inspired by this, we propose a SpatioTemporal Representation Enhanced Vision Transformer (STRE-ViT), which follows a two-stream paradigm to fuse large-scale pre-training visual prior knowledge and video-level temporal biases in a simple and effective manner. Specifically, one stream employs a well-pretrained ViT with rich vision priors to alleviate data requirements and learning workload. Another stream is our designed spatiotemporal interaction stream, which first models video-level temporal dynamics and then extracts fine-grained and salient spatiotemporal representations by introducing appropriate temporal bias. Through this interaction stream, the model capacity of ViT is enhanced for video spatiotemporal representations. Moreover, we provide a fresh perspective to adapt well-pretrained ViT for video recognition. Experimental results show that STRE-ViT learns high-quality video representations and achieves competitive performance on two popular video benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: ViViT: a video vision transformer. ArXiv, abs/2103.15691 (2021)

    Google Scholar 

  2. Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML, vol. 2, p. 4 (2021)

    Google Scholar 

  3. Dosovitskiy, A., et al.: An image is worth \(16\times 16\) words: transformers for image recognition at scale. ArXiv abs/2010.11929 (2020)

    Google Scholar 

  4. Feichtenhofer, C., Fan, H., Malik, J., He, K.: SlowFast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019)

    Google Scholar 

  5. Girdhar, R., Carreira, J., Doersch, C., Zisserman, A.: Video action transformer network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 244–253 (2019)

    Google Scholar 

  6. Goyal, R., et al.: The “something something” video database for learning and evaluating visual common sense. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 5843–5851 (2017)

    Google Scholar 

  7. Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6546–6555 (2017)

    Google Scholar 

  8. Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning (2021)

    Google Scholar 

  9. Ju, C., Han, T., Zheng, K., Zhang, Y., Xie, W.: Prompting visual-language models for efficient video understanding. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13695, pp. 105–124. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19833-5_7

    Chapter  Google Scholar 

  10. Li, K., et al.: UniFormer: unified transformer for efficient spatiotemporal representation learning. ArXiv abs/2201.04676 (2022)

    Google Scholar 

  11. Li, K., et al.: UniFormerV2: spatiotemporal learning by arming image ViTs with video UniFormer. arXiv preprint arXiv:2211.09552 (2022)

  12. Li, Y., et al.: MViTv2: improved multiscale vision transformers for classification and detection. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4794–4804 (2021). https://api.semanticscholar.org/CorpusID:244799268

  13. Lin, Z., et al.: Frozen clip models are efficient video learners. ArXiv abs/2208.03550 (2022)

    Google Scholar 

  14. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9992–10002 (2021)

    Google Scholar 

  15. Liu, Z., et al.: Video swin transformer. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3192–3201 (2021). https://api.semanticscholar.org/CorpusID:235624247

  16. Loshchilov, I., Hutter, F.: Fixing weight decay regularization in adam. ArXiv abs/1711.05101 (2017)

    Google Scholar 

  17. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  18. Yan, S., et al.: Multiview transformers for video recognition. In: CVPR (2022)

    Google Scholar 

  19. Vaswani, A., et al.: Attention is all you need. In: NIPS (2017)

    Google Scholar 

  20. Wang, M., Xing, J., Liu, Y.: ActionCLIP: a new paradigm for video action recognition. ArXiv abs/2109.08472 (2021)

    Google Scholar 

  21. Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 548–558 (2021)

    Google Scholar 

  22. Xiao, T., Singh, M., Mintun, E., Darrell, T., Dollár, P., Girshick, R.B.: Early convolutions help transformers see better. In: Neural Information Processing Systems (2021)

    Google Scholar 

  23. Yuan, K., Guo, S., Liu, Z., Zhou, A., Yu, F., Wu, W.: Incorporating convolution designs into visual transformers. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 559–568 (2021)

    Google Scholar 

  24. Zhou, B., Khosla, A., Lapedriza, À., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2921–2929 (2015)

    Google Scholar 

  25. Zisserman, A., et al.: The kinetics human action video dataset (2017)

    Google Scholar 

Download references

Acknowledgements

This work was supported by the National Key Research and Development Program of China under Grant 2021YFB2910109).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Min Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, M. et al. (2024). Spatiotemporal Representation Enhanced ViT for Video Recognition. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14554. Springer, Cham. https://doi.org/10.1007/978-3-031-53305-1_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-53305-1_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-53304-4

  • Online ISBN: 978-3-031-53305-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics