Skip to main content
Log in

Space or time for video classification transformers

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Spatial and temporal attention plays an important role in video classification tasks. However, there are few studies about the mechanism of spatial and temporal attention behind classification problems. Transformer owns excellent capabilities at training scalability and capturing long-range dependencies among sequences because of its self-attention mechanism, which has achieved great success in many fields, especially in video classifications. The spatio-temporal attention is separated into a temporal attention module and a spatial attention module through Divided-Space-Time Attention, which makes it more conveniently to configure the attention module and adjust the way of attention interaction. Then single-stream models and two-stream models are designed to study the laws of information interaction between spatial attention and temporal attention with a lot of carefully designed experiments. Experiments show that the spatial attention is more critical than the temporal attention, thus the balanced strategy that is commonly used is not always the best choice. Furthermore, there is a necessity to consider the classical two-stream structure models in some cases, which can get better results than the popular single-stream structure models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data Availibility

The data used in this paper are all from public datasets.

References

  1. Wu, X., Tang, B., Zhao, M., Wang, J., Guo, Y.: Str transformer: A cross-domain transformer for scene text recognition. Applied Intelligence, 1–15 (2022)

  2. Wu, X., Zhang, Y., Li, Q., Qi, Y., Wang, J., Guo, Y.: Face aging with pixel-level alignment gan. Applied Intelligence, 1–14 (2022)

  3. Kong Y, Fu Y (2022) Human action recognition and prediction: A survey. International Journal of Computer Vision 130(5):1366–1401

    Article  Google Scholar 

  4. Islam, M.M., Nooruddin, S., Karray, F., Muhammad, G.: Human activity recognition using tools of convolutional neural networks: A state of the art review, data sets, challenges, and future prospects. Computers in Biology and Medicine, 106060 (2022)

  5. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017)

  6. Islam, M.M., Bertasius, G.: Long movie clip classification with state-space video models. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 87–104 (2022). Springer

  7. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)

  8. Wang, X., Xiong, X., Neumann, M., Piergiovanni, A., Ryoo, M.S., Angelova, A., Kitani, K.M., Hua, W.: Attentionnas: Spatiotemporal attention cell search for video classification. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16, pp. 449–465 (2020). Springer

  9. Kenton, J.D.M.-W.C., Toutanova, L.K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)

  10. Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A et al (2020) Language models are few-shot learners. Advances in neural information processing systems 33:1877–1901

    Google Scholar 

  11. Fedus, W., Zoph, B., Shazeer, N.: Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity (2021)

  12. Wu X, Chen C, Zhong M, Wang J, Shi J (2021) Covid-al: The diagnosis of covid-19 with deep active learning. Medical Image Analysis 68:101913

    Article  Google Scholar 

  13. Wu, X., Ji, S., Wang, J., Guo, Y.: Speech synthesis with face embeddings. Applied Intelligence, 1–14 (2022)

  14. Lan X, Gu X, Gu X (2022) Mmnet: Multi-modal multi-stage network for rgb-t image semantic segmentation. Applied Intelligence 52(5):5817–5829

    Article  Google Scholar 

  15. Leng C, Ding Q, Wu C, Chen A (2021) Augmented two stream network for robust action recognition adaptive to various action videos. Journal of Visual Communication and Image Representation 81:103344

    Article  Google Scholar 

  16. Abdelbaky A, Aly S (2021) Two-stream spatiotemporal feature fusion for human action recognition. The Visual Computer 37(7):1821–1835

    Article  Google Scholar 

  17. Zhang Z, Lv Z, Gan C, Zhu Q (2020) Human action recognition using convolutional lstm and fully-connected lstm with different attentions. Neurocomputing 410:304–316

    Article  Google Scholar 

  18. Zhang B, Wang Q, Gao Z, Zeng R, Li P (2022) Temporal grafter network: Rethinking lstm for effective video recognition. Neurocomputing 505:276–288

    Article  Google Scholar 

  19. Liu Q, Cai M, Liu D, Ma S, Zhang Q, Liu Z, Yang J (2022) Two stream non-local cnn-lstm network for the auxiliary assessment of mental retardation. Computers in Biology and Medicine 147:105803

    Article  Google Scholar 

  20. Özyer T, Ak DS, Alhajj R (2021) Human action recognition approaches with video datasets-a survey. Knowledge-Based Systems 222:106995

    Article  Google Scholar 

  21. Wu X, Chen C, Li P, Zhong M, Wang J, Qian Q, Ding P, Yao J, Guo Y (2022) Ftap: Feature transferring autonomous machine learning pipeline. Information Sciences 593:385–397

    Article  Google Scholar 

  22. Vrskova R, Hudec R, Kamencay P, Sykora P (2022) Human activity classification using the 3dcnn architecture. Applied Sciences 12(2):931

    Article  Google Scholar 

  23. Cai J, Hu J (2020) 3d rans: 3d residual attention networks for action recognition. The Visual Computer 36:1261–1270

    Article  Google Scholar 

  24. Ming Y, Feng F, Li C, Xue J-H (2021) 3d-tdc: A 3d temporal dilation convolution framework for video action recognition. Neurocomputing 450:362–371

    Article  Google Scholar 

  25. Yang, C., Xu, Y., Shi, J., Dai, B., Zhou, B.: Temporal pyramid network for action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 591–600 (2020)

  26. Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6546–6555 (2018)

  27. Neimark, D., Bar, O., Zohar, M., Asselmann, D.: Video transformer network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3163–3172 (2021)

  28. Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML, vol. 2, p. 4 (2021)

  29. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: Vivit: A video vision transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6836–6846 (2021)

  30. Zhang, Y., Li, X., Liu, C., Shuai, B., Zhu, Y., Brattoli, B., Chen, H., Marsic, I., Tighe, J.: Vidtr: Video transformer without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13577–13587 (2021)

  31. Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., Feichtenhofer, C.: Multiscale vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6824–6835 (2021)

  32. Li, L., Zhuang, L.: Mevit: Motion enhanced video transformer for video classification. In: International Conference on Multimedia Modeling, pp. 419–430 (2022). Springer

  33. Mazzia V, Angarano S, Salvetti F, Angelini F, Chiaberge M (2022) Action transformer: A self-attention model for short-time pose-based human action recognition. Pattern Recognition 124:108487

    Article  Google Scholar 

  34. Girdhar, R., Grauman, K.: Anticipative video transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13505–13515 (2021)

  35. Borgli H, Thambawita V, Smedsrud PH, Hicks S, Jha D, Eskeland SL, Randel KR, Pogorelov K, Lux M, Nguyen DTD et al (2020) Hyperkvasir, a comprehensive multi-class image and video dataset for gastrointestinal endoscopy. Scientific data 7(1):1–14

    Article  Google Scholar 

  36. Fan, Q., Chen, C.-F.R., Kuehne, H., Pistoia, M., Cox, D.: More is less: Learning efficient video representations by big-little network and depthwise temporal aggregation. Advances in Neural Information Processing Systems 32 (2019)

  37. Jiang, B., Wang, M., Gan, W., Wu, W., Yan, J.: Stm: Spatiotemporal and motion encoding for action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2000–2009 (2019)

  38. Lin, J., Gan, C., Han, S.: Tsm: Temporal shift module for efficient video understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7083–7093 (2019)

  39. Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019)

  40. Chen, Y., Fan, H., Xu, B., Yan, Z., Kalantidis, Y., Rohrbach, M., Yan, S., Feng, J.: Drop an octave: Reducing spatial redundancy in convolutional neural networks with octave convolution. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3435–3444 (2019)

  41. Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., Wang, L.: Tea: Temporal excitation and aggregation for action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 909–918 (2020)

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (Grant No. 62172267), the Natural Science Foundation of Shanghai, China (Grant No. 20ZR1420400), the State Key Program of National Natural Science Foundation of China (Grant No. 61936001), the Shanghai Pujiang Program (Grant No. 21PJ1404200), the Key Research Project of Zhejiang Laboratory (No. 2021PE0AC02).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xing Wu.

Ethics declarations

Conflict of Interests

The authors declare that there is no conflict of interests with anybody or any institution regarding the publication of this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, X., Tao, C., Zhang, J. et al. Space or time for video classification transformers. Appl Intell 53, 23039–23048 (2023). https://doi.org/10.1007/s10489-023-04756-5

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-023-04756-5

Keywords

Navigation