Space or time for video classification transformers

Wu, Xing; Tao, Chenjie; Zhang, Jian; Sun, Qun; Wang, Jianjia; Li, Weimin; Liu, Yue; Guo, Yike

doi:10.1007/s10489-023-04756-5

Space or time for video classification transformers

Published: 05 July 2023

Volume 53, pages 23039–23048, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Xing Wu^1,2,
Chenjie Tao¹,
Jian Zhang^3,4,
Qun Sun⁵,
Jianjia Wang^1,2,
Weimin Li¹,
Yue Liu¹ &
…
Yike Guo⁶

246 Accesses
Explore all metrics

Abstract

Spatial and temporal attention plays an important role in video classification tasks. However, there are few studies about the mechanism of spatial and temporal attention behind classification problems. Transformer owns excellent capabilities at training scalability and capturing long-range dependencies among sequences because of its self-attention mechanism, which has achieved great success in many fields, especially in video classifications. The spatio-temporal attention is separated into a temporal attention module and a spatial attention module through Divided-Space-Time Attention, which makes it more conveniently to configure the attention module and adjust the way of attention interaction. Then single-stream models and two-stream models are designed to study the laws of information interaction between spatial attention and temporal attention with a lot of carefully designed experiments. Experiments show that the spatial attention is more critical than the temporal attention, thus the balanced strategy that is commonly used is not always the best choice. Furthermore, there is a necessity to consider the classical two-stream structure models in some cases, which can get better results than the popular single-stream structure models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Spatial-Temporal Attention for Action Recognition

MEViT: Motion Enhanced Video Transformer for Video Classification

A Multi-scale Multi-modal Multi-dimension Joint Transformer for Two-Stream Action Classification

Data Availibility

The data used in this paper are all from public datasets.

References

Wu, X., Tang, B., Zhao, M., Wang, J., Guo, Y.: Str transformer: A cross-domain transformer for scene text recognition. Applied Intelligence, 1–15 (2022)
Wu, X., Zhang, Y., Li, Q., Qi, Y., Wang, J., Guo, Y.: Face aging with pixel-level alignment gan. Applied Intelligence, 1–14 (2022)
Kong Y, Fu Y (2022) Human action recognition and prediction: A survey. International Journal of Computer Vision 130(5):1366–1401
Article Google Scholar
Islam, M.M., Nooruddin, S., Karray, F., Muhammad, G.: Human activity recognition using tools of convolutional neural networks: A state of the art review, data sets, challenges, and future prospects. Computers in Biology and Medicine, 106060 (2022)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017)
Islam, M.M., Bertasius, G.: Long movie clip classification with state-space video models. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, pp. 87–104 (2022). Springer
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)
Wang, X., Xiong, X., Neumann, M., Piergiovanni, A., Ryoo, M.S., Angelova, A., Kitani, K.M., Hua, W.: Attentionnas: Spatiotemporal attention cell search for video classification. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16, pp. 449–465 (2020). Springer
Kenton, J.D.M.-W.C., Toutanova, L.K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)
Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A et al (2020) Language models are few-shot learners. Advances in neural information processing systems 33:1877–1901
Google Scholar
Fedus, W., Zoph, B., Shazeer, N.: Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity (2021)
Wu X, Chen C, Zhong M, Wang J, Shi J (2021) Covid-al: The diagnosis of covid-19 with deep active learning. Medical Image Analysis 68:101913
Article Google Scholar
Wu, X., Ji, S., Wang, J., Guo, Y.: Speech synthesis with face embeddings. Applied Intelligence, 1–14 (2022)
Lan X, Gu X, Gu X (2022) Mmnet: Multi-modal multi-stage network for rgb-t image semantic segmentation. Applied Intelligence 52(5):5817–5829
Article Google Scholar
Leng C, Ding Q, Wu C, Chen A (2021) Augmented two stream network for robust action recognition adaptive to various action videos. Journal of Visual Communication and Image Representation 81:103344
Article Google Scholar
Abdelbaky A, Aly S (2021) Two-stream spatiotemporal feature fusion for human action recognition. The Visual Computer 37(7):1821–1835
Article Google Scholar
Zhang Z, Lv Z, Gan C, Zhu Q (2020) Human action recognition using convolutional lstm and fully-connected lstm with different attentions. Neurocomputing 410:304–316
Article Google Scholar
Zhang B, Wang Q, Gao Z, Zeng R, Li P (2022) Temporal grafter network: Rethinking lstm for effective video recognition. Neurocomputing 505:276–288
Article Google Scholar
Liu Q, Cai M, Liu D, Ma S, Zhang Q, Liu Z, Yang J (2022) Two stream non-local cnn-lstm network for the auxiliary assessment of mental retardation. Computers in Biology and Medicine 147:105803
Article Google Scholar
Özyer T, Ak DS, Alhajj R (2021) Human action recognition approaches with video datasets-a survey. Knowledge-Based Systems 222:106995
Article Google Scholar
Wu X, Chen C, Li P, Zhong M, Wang J, Qian Q, Ding P, Yao J, Guo Y (2022) Ftap: Feature transferring autonomous machine learning pipeline. Information Sciences 593:385–397
Article Google Scholar
Vrskova R, Hudec R, Kamencay P, Sykora P (2022) Human activity classification using the 3dcnn architecture. Applied Sciences 12(2):931
Article Google Scholar
Cai J, Hu J (2020) 3d rans: 3d residual attention networks for action recognition. The Visual Computer 36:1261–1270
Article Google Scholar
Ming Y, Feng F, Li C, Xue J-H (2021) 3d-tdc: A 3d temporal dilation convolution framework for video action recognition. Neurocomputing 450:362–371
Article Google Scholar
Yang, C., Xu, Y., Shi, J., Dai, B., Zhou, B.: Temporal pyramid network for action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 591–600 (2020)
Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6546–6555 (2018)
Neimark, D., Bar, O., Zohar, M., Asselmann, D.: Video transformer network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3163–3172 (2021)
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML, vol. 2, p. 4 (2021)
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: Vivit: A video vision transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6836–6846 (2021)
Zhang, Y., Li, X., Liu, C., Shuai, B., Zhu, Y., Brattoli, B., Chen, H., Marsic, I., Tighe, J.: Vidtr: Video transformer without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13577–13587 (2021)
Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., Feichtenhofer, C.: Multiscale vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6824–6835 (2021)
Li, L., Zhuang, L.: Mevit: Motion enhanced video transformer for video classification. In: International Conference on Multimedia Modeling, pp. 419–430 (2022). Springer
Mazzia V, Angarano S, Salvetti F, Angelini F, Chiaberge M (2022) Action transformer: A self-attention model for short-time pose-based human action recognition. Pattern Recognition 124:108487
Article Google Scholar
Girdhar, R., Grauman, K.: Anticipative video transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13505–13515 (2021)
Borgli H, Thambawita V, Smedsrud PH, Hicks S, Jha D, Eskeland SL, Randel KR, Pogorelov K, Lux M, Nguyen DTD et al (2020) Hyperkvasir, a comprehensive multi-class image and video dataset for gastrointestinal endoscopy. Scientific data 7(1):1–14
Article Google Scholar
Fan, Q., Chen, C.-F.R., Kuehne, H., Pistoia, M., Cox, D.: More is less: Learning efficient video representations by big-little network and depthwise temporal aggregation. Advances in Neural Information Processing Systems 32 (2019)
Jiang, B., Wang, M., Gan, W., Wu, W., Yan, J.: Stm: Spatiotemporal and motion encoding for action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2000–2009 (2019)
Lin, J., Gan, C., Han, S.: Tsm: Temporal shift module for efficient video understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7083–7093 (2019)
Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019)
Chen, Y., Fan, H., Xu, B., Yan, Z., Kalantidis, Y., Rohrbach, M., Yan, S., Feng, J.: Drop an octave: Reducing spatial redundancy in convolutional neural networks with octave convolution. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3435–3444 (2019)
Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., Wang, L.: Tea: Temporal excitation and aggregation for action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 909–918 (2020)

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (Grant No. 62172267), the Natural Science Foundation of Shanghai, China (Grant No. 20ZR1420400), the State Key Program of National Natural Science Foundation of China (Grant No. 61936001), the Shanghai Pujiang Program (Grant No. 21PJ1404200), the Key Research Project of Zhejiang Laboratory (No. 2021PE0AC02).

Author information

Authors and Affiliations

School of Computer Engineering and Science, Shanghai University, Shanghai, 200444, China
Xing Wu, Chenjie Tao, Jianjia Wang, Weimin Li & Yue Liu
Shanghai Institute for Advanced Communication and Data Science, Shanghai University, Shanghai, 200444, China
Xing Wu & Jianjia Wang
Medical College of Shanghai University, Shanghai University, Shanghai, 200444, China
Jian Zhang
Shanghai Universal Medical Imaging Diagnostic Center, Shanghai, 201103, China
Jian Zhang
Shanghai Sixth People’s Hospital Affiliated to Shanghai Jiaotong University, Shanghai, 200233, China
Qun Sun
Hong Kong Baptist University, Hong Kong, 999077, China
Yike Guo

Authors

Xing Wu
View author publications
You can also search for this author in PubMed Google Scholar
Chenjie Tao
View author publications
You can also search for this author in PubMed Google Scholar
Jian Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Qun Sun
View author publications
You can also search for this author in PubMed Google Scholar
Jianjia Wang
View author publications
You can also search for this author in PubMed Google Scholar
Weimin Li
View author publications
You can also search for this author in PubMed Google Scholar
Yue Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yike Guo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xing Wu.

Ethics declarations

Conflict of Interests

The authors declare that there is no conflict of interests with anybody or any institution regarding the publication of this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wu, X., Tao, C., Zhang, J. et al. Space or time for video classification transformers. Appl Intell 53, 23039–23048 (2023). https://doi.org/10.1007/s10489-023-04756-5

Download citation

Accepted: 02 June 2023
Published: 05 July 2023
Issue Date: October 2023
DOI: https://doi.org/10.1007/s10489-023-04756-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Space or time for video classification transformers

Abstract

Access this article

Similar content being viewed by others

Spatial-Temporal Attention for Action Recognition

MEViT: Motion Enhanced Video Transformer for Video Classification

A Multi-scale Multi-modal Multi-dimension Joint Transformer for Two-Stream Action Classification

Data Availibility

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Space or time for video classification transformers

Abstract

Access this article

Similar content being viewed by others

Spatial-Temporal Attention for Action Recognition

MEViT: Motion Enhanced Video Transformer for Video Classification

A Multi-scale Multi-modal Multi-dimension Joint Transformer for Two-Stream Action Classification

Data Availibility

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation