Abstract
Spatial and temporal attention plays an important role in video classification tasks. However, there are few studies about the mechanism of spatial and temporal attention behind classification problems. Transformer owns excellent capabilities at training scalability and capturing long-range dependencies among sequences because of its self-attention mechanism, which has achieved great success in many fields, especially in video classifications. The spatio-temporal attention is separated into a temporal attention module and a spatial attention module through Divided-Space-Time Attention, which makes it more conveniently to configure the attention module and adjust the way of attention interaction. Then single-stream models and two-stream models are designed to study the laws of information interaction between spatial attention and temporal attention with a lot of carefully designed experiments. Experiments show that the spatial attention is more critical than the temporal attention, thus the balanced strategy that is commonly used is not always the best choice. Furthermore, there is a necessity to consider the classical two-stream structure models in some cases, which can get better results than the popular single-stream structure models.
Similar content being viewed by others
Data Availibility
The data used in this paper are all from public datasets.
References
Wu, X., Tang, B., Zhao, M., Wang, J., Guo, Y.: Str transformer: A cross-domain transformer for scene text recognition. Applied Intelligence, 1–15 (2022)
Wu, X., Zhang, Y., Li, Q., Qi, Y., Wang, J., Guo, Y.: Face aging with pixel-level alignment gan. Applied Intelligence, 1–14 (2022)
Kong Y, Fu Y (2022) Human action recognition and prediction: A survey. International Journal of Computer Vision 130(5):1366–1401
Islam, M.M., Nooruddin, S., Karray, F., Muhammad, G.: Human activity recognition using tools of convolutional neural networks: A state of the art review, data sets, challenges, and future prospects. Computers in Biology and Medicine, 106060 (2022)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017)
Islam, M.M., Bertasius, G.: Long movie clip classification with state-space video models. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 87–104 (2022). Springer
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)
Wang, X., Xiong, X., Neumann, M., Piergiovanni, A., Ryoo, M.S., Angelova, A., Kitani, K.M., Hua, W.: Attentionnas: Spatiotemporal attention cell search for video classification. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16, pp. 449–465 (2020). Springer
Kenton, J.D.M.-W.C., Toutanova, L.K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)
Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A et al (2020) Language models are few-shot learners. Advances in neural information processing systems 33:1877–1901
Fedus, W., Zoph, B., Shazeer, N.: Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity (2021)
Wu X, Chen C, Zhong M, Wang J, Shi J (2021) Covid-al: The diagnosis of covid-19 with deep active learning. Medical Image Analysis 68:101913
Wu, X., Ji, S., Wang, J., Guo, Y.: Speech synthesis with face embeddings. Applied Intelligence, 1–14 (2022)
Lan X, Gu X, Gu X (2022) Mmnet: Multi-modal multi-stage network for rgb-t image semantic segmentation. Applied Intelligence 52(5):5817–5829
Leng C, Ding Q, Wu C, Chen A (2021) Augmented two stream network for robust action recognition adaptive to various action videos. Journal of Visual Communication and Image Representation 81:103344
Abdelbaky A, Aly S (2021) Two-stream spatiotemporal feature fusion for human action recognition. The Visual Computer 37(7):1821–1835
Zhang Z, Lv Z, Gan C, Zhu Q (2020) Human action recognition using convolutional lstm and fully-connected lstm with different attentions. Neurocomputing 410:304–316
Zhang B, Wang Q, Gao Z, Zeng R, Li P (2022) Temporal grafter network: Rethinking lstm for effective video recognition. Neurocomputing 505:276–288
Liu Q, Cai M, Liu D, Ma S, Zhang Q, Liu Z, Yang J (2022) Two stream non-local cnn-lstm network for the auxiliary assessment of mental retardation. Computers in Biology and Medicine 147:105803
Özyer T, Ak DS, Alhajj R (2021) Human action recognition approaches with video datasets-a survey. Knowledge-Based Systems 222:106995
Wu X, Chen C, Li P, Zhong M, Wang J, Qian Q, Ding P, Yao J, Guo Y (2022) Ftap: Feature transferring autonomous machine learning pipeline. Information Sciences 593:385–397
Vrskova R, Hudec R, Kamencay P, Sykora P (2022) Human activity classification using the 3dcnn architecture. Applied Sciences 12(2):931
Cai J, Hu J (2020) 3d rans: 3d residual attention networks for action recognition. The Visual Computer 36:1261–1270
Ming Y, Feng F, Li C, Xue J-H (2021) 3d-tdc: A 3d temporal dilation convolution framework for video action recognition. Neurocomputing 450:362–371
Yang, C., Xu, Y., Shi, J., Dai, B., Zhou, B.: Temporal pyramid network for action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 591–600 (2020)
Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6546–6555 (2018)
Neimark, D., Bar, O., Zohar, M., Asselmann, D.: Video transformer network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3163–3172 (2021)
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML, vol. 2, p. 4 (2021)
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: Vivit: A video vision transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6836–6846 (2021)
Zhang, Y., Li, X., Liu, C., Shuai, B., Zhu, Y., Brattoli, B., Chen, H., Marsic, I., Tighe, J.: Vidtr: Video transformer without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13577–13587 (2021)
Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., Feichtenhofer, C.: Multiscale vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6824–6835 (2021)
Li, L., Zhuang, L.: Mevit: Motion enhanced video transformer for video classification. In: International Conference on Multimedia Modeling, pp. 419–430 (2022). Springer
Mazzia V, Angarano S, Salvetti F, Angelini F, Chiaberge M (2022) Action transformer: A self-attention model for short-time pose-based human action recognition. Pattern Recognition 124:108487
Girdhar, R., Grauman, K.: Anticipative video transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13505–13515 (2021)
Borgli H, Thambawita V, Smedsrud PH, Hicks S, Jha D, Eskeland SL, Randel KR, Pogorelov K, Lux M, Nguyen DTD et al (2020) Hyperkvasir, a comprehensive multi-class image and video dataset for gastrointestinal endoscopy. Scientific data 7(1):1–14
Fan, Q., Chen, C.-F.R., Kuehne, H., Pistoia, M., Cox, D.: More is less: Learning efficient video representations by big-little network and depthwise temporal aggregation. Advances in Neural Information Processing Systems 32 (2019)
Jiang, B., Wang, M., Gan, W., Wu, W., Yan, J.: Stm: Spatiotemporal and motion encoding for action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2000–2009 (2019)
Lin, J., Gan, C., Han, S.: Tsm: Temporal shift module for efficient video understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7083–7093 (2019)
Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019)
Chen, Y., Fan, H., Xu, B., Yan, Z., Kalantidis, Y., Rohrbach, M., Yan, S., Feng, J.: Drop an octave: Reducing spatial redundancy in convolutional neural networks with octave convolution. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3435–3444 (2019)
Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., Wang, L.: Tea: Temporal excitation and aggregation for action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 909–918 (2020)
Acknowledgements
This work is supported by the National Natural Science Foundation of China (Grant No. 62172267), the Natural Science Foundation of Shanghai, China (Grant No. 20ZR1420400), the State Key Program of National Natural Science Foundation of China (Grant No. 61936001), the Shanghai Pujiang Program (Grant No. 21PJ1404200), the Key Research Project of Zhejiang Laboratory (No. 2021PE0AC02).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
The authors declare that there is no conflict of interests with anybody or any institution regarding the publication of this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wu, X., Tao, C., Zhang, J. et al. Space or time for video classification transformers. Appl Intell 53, 23039–23048 (2023). https://doi.org/10.1007/s10489-023-04756-5
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-023-04756-5