An efficient video transformer network with token discard and keyframe enhancement for action recognition

Zhang, Qian; Yang, Zuosui; Shao, Mingwen; Liang, Hong

doi:10.1007/s11227-025-06927-2

An efficient video transformer network with token discard and keyframe enhancement for action recognition

Published: 21 January 2025

Volume 81, article number 408, (2025)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Qian Zhang¹,
Zuosui Yang¹,
Mingwen Shao¹ &
…
Hong Liang¹

196 Accesses
Explore all metrics

Abstract

Existing video transformers divide the video frames into a long sequence of tokens and perform self-attention computation among all tokens. However, the tokens corresponding to background information have little effect, and complete utilization of these tokens would generate a lot of computational redundancy. Based on this observation, we design an efficient video transformer named TDKE, which adaptively discards unimportant tokens and solely utilizes important tokens during the inference process. Specifically, the backbone of the TDKE is a 12-layer video transformer, which can be divided into two main parts. The first part, named the Scanner module, is composed of the first two transformer layers. The second part, named the Delicacy module, is composed of the remaining ten transformer layers. The Scanner module quickly discards unimportant tokens of each frame and selects a keyframe. Among them, the importance of each token is measured by pre-calculated attention maps, and the frame with the highest importance score is defined as the keyframe. The Delicacy module uses a weight enhancement technology to the keyframe and further discards redundant tokens based on our enhanced attention maps. We evaluate TDKE on two action recognition datasets, Kinetics-400 and SSv2. The results confirm that TDKE is efficient. For example, TDKE achieved a Top-1 accuracy of 77.8% on the Kinetics-400 dataset using only 306 GFLOPs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sparse Dense Transformer Network for Video Action Recognition

DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition

ESTI: an action recognition network with enhanced spatio-temporal information

Article 22 March 2023

Data availability

The data are available from the corresponding author.

Code availability

The code is available from the corresponding author.

References

Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems 27
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: Towards good practices for deep action recognition. In: European Conference on Computer Vision, pp. 20–36. Springer
Christoph R, Pinz FA (2016) Spatiotemporal residual networks for video action recognition. Advanc Neural Inf Process Syst 2:3468–3476
MATH Google Scholar
Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211
Ji S, Xu W, Yang M, Yu K (2012) 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231
Article MATH Google Scholar
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497
Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308
Feichtenhofer C (2020) X3d: Expanding architectures for efficient video recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 203–213
Vaswani A (2017) Attention is all you need. Advances in Neural Information Processing Systems
Devlin J (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Dosovitskiy A (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
Arnab A, Dehghani M, Heigold G, Sun C, Lučić M, Schmid C (2021) Vivit: A video vision transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6836–6846
Bertasius G, Wang H, Torresani L (2021) Is space-time attention all you need for video understanding? In: ICML, 2 p. 4
Park SH, Tack J, Heo B, Ha J-W, Shin J (2022) K-centered patch sampling for efficient video recognition. In: European Conference on Computer Vision, pp. 160–176 . Springer
Wang J, Yang X, Li H, Liu L, Wu Z, Jiang Y-G (2022) Efficient video transformers with spatial-temporal token selection. In: European Conference on Computer Vision, pp. 69–86. Springer
Wang L, Huang B, Zhao Z, Tong Z, He Y, Wang Y, Wang Y, Qiao Y (2023) Videomae v2: Scaling video masked autoencoders with dual masking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14549–14560
Feng Z, Xu J, Ma L, Zhang S (2024) Efficient video transformers via spatial-temporal token merging for action recognition. ACM Trans Multimed Comput, Commun Appl 20(4):1–21
Article MATH Google Scholar
Wu Q, Cui R, Li Y, Zhu H (2024) Haltingvt: Adaptive token halting transformer for efficient video recognition. In: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4305–4309. IEEE
Choi J, Lee S, Chu J, Choi M, Kim HJ (2024) vid-tldr: Training free token merging for light-weight video transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18771–18781
Qing Z, Zhang S, Huang Z, Wang X, Wang Y, Lv Y, Gao C, Sang N (2023) Mar: masked autoencoders for efficient action recognition. IEEE Trans Multimed 26:218–233
Article MATH Google Scholar
Fan H, Xiong B, Mangalam K, Li Y, Yan Z, Malik J, Feichtenhofer C (2021) Multiscale vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6824–6835
Neimark D, Bar O, Zohar M, Asselmann D (2021) Video transformer network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3163–3172
Patrick M, Campbell D, Asano Y, Misra I, Metze F, Feichtenhofer C, Vedaldi A, Henriques JF (2021) Keeping your eye on the ball: trajectory attention in video transformers. Advan Neural Inf Process Syst 34:12493–12506
Google Scholar
Liu Z, Ning J, Cao Y, Wei Y, Zhang Z, Lin S, Hu H (2022) Video swin transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3202–3211
Truong T-D, Bui Q-H, Duong CN, Seo H-S, Phung SL, Li X, Luu K (2022) Direcformer: A directed attention in transformer approach to robust action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20030–20040
Li K, Wang Y, Gao P, Song G, Liu Y, Li H, Qiao Y (2022) Uniformer: unified transformer for efficient spatiotemporal representation learning. arXiv preprint arXiv:2201.04676
Yan S, Xiong X, Arnab A, Lu Z, Zhang M, Sun C, Schmid C (2022) Multiview transformers for video recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3333–3343
Herzig R, Ben-Avraham E, Mangalam K, Bar A, Chechik G, Rohrbach A, Darrell T, Globerson A (2022) Object-region video transformers. In: Proceedings of the Ieee/cvf Conference on Computer Vision and Pattern Recognition, pp. 3148–3159
Rao Y, Zhao W, Liu B, Lu J, Zhou J, Hsieh C-J (2021) Dynamicvit: efficient vision transformers with dynamic token sparsification. Advan Neural Inf Process Syst 34:13937–13949
Google Scholar
Liang Y, Ge C, Tong Z, Song Y, Wang J, Xie P (2022) Not all patches are what you need: Expediting vision transformers via token reorganizations. arXiv preprint arXiv:2202.07800
Fayyaz M, Koohpayegani SA, Jafari FR, Sengupta S, Joze HRV, Sommerlade E, Pirsiavash H, Gall J (2022) Adaptive token sampling for efficient vision transformers. In: European Conference on Computer Vision, pp. 396–414. Springer
Kong Z, Dong P, Ma X, Meng X, Niu W, Sun M, Shen X, Yuan G, Ren B, Tang H et al (2022) Spvit: Enabling faster vision transformers via latency-aware soft token pruning. In: European Conference on Computer Vision, pp. 620–640. Springer
Yin H, Vahdat A, Alvarez JM, Mallya A, Kautz J, Molchanov P (2022) A-vit: Adaptive tokens for efficient vision transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10809–10818
Meng L, Li H, Chen B-C, Lan S, Wu Z, Jiang Y-G, Lim S-N (2022) Adavit: Adaptive vision transformers for efficient image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12309–12318
Bolya D, Fu C-Y, Dai X, Zhang P, Feichtenhofer C, Hoffman J (2022) Token merging: Your vit but faster. arXiv preprint arXiv:2210.09461
Feng Z, Zhang S (2023) Efficient vision transformer via token merger. IEEE Transactions on Image Processing
Chen L, Tong Z, Song Y, Wu G, Wang L (2023) Efficient video action detection with token dropout and context refinement. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10388–10399
Chen J, Ho CM (2022) Mm-vit: Multi-modal video transformer for compressed video action recognition. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1910–1921
Goyal R, Ebrahimi Kahou S, Michalski V, Materzynska J, Westphal S, Kim H, Haenel V, Fruend I, Yianilos P, Mueller-Freitag M et al (2017) The" something something" video database for learning and evaluating visual common sense. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5842–5850
Lin J, Gan C, Han S (2019) Tsm: Temporal shift module for efficient video understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7083–7093
Fan Q, Chen C-FR, Kuehne H, Pistoia M, Cox D (2019) More is less: Learning efficient video representations by big-little network and depthwise temporal aggregation. Advances in Neural Information Processing Systems. 32
Fan Q, Panda R et al (2021) An image classifier can suffice for video understanding. arXiv preprint arXiv:2106.141042
Jiang B, Wang M, Gan W, Wu W, Yan J (2019) Stm: spatiotemporal and motion encoding for action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2000–2009

Download references

Acknowledgements

The author greatly appreciates the guidance and suggestions provided by the editor and reviewers for this article. This work is supported by the National Natural Science Foundation of China(No.61673396) and the Natural Science Foundation of Shandong Province(No.ZR2022MF260).

Funding

This work is supported by the National Natural Science Foundation of China(No.61673396) and the Natural Science Foundation of Shandong Province(No.ZR2022MF260).

Author information

Authors and Affiliations

Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum (East China), Qingdao, 266580, Shandong, China
Qian Zhang, Zuosui Yang, Mingwen Shao & Hong Liang

Authors

Qian Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Zuosui Yang
View author publications
You can also search for this author inPubMed Google Scholar
Mingwen Shao
View author publications
You can also search for this author inPubMed Google Scholar
Hong Liang
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Qian Zhang contributed to ideas, design of methodology, verification, writing-review. Zuosui Yang contributed to ideas, coding, data collation, writing. Mingwen Shao contributed to writing-review and editing and funding acquisition. Hong Liang contributed to ideas, experimental supervision, and writing-review.

Corresponding author

Correspondence to Zuosui Yang.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Materials availability

The materials are available from the corresponding author.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, Q., Yang, Z., Shao, M. et al. An efficient video transformer network with token discard and keyframe enhancement for action recognition. J Supercomput 81, 408 (2025). https://doi.org/10.1007/s11227-025-06927-2

Download citation

Accepted: 07 January 2025
Published: 21 January 2025
DOI: https://doi.org/10.1007/s11227-025-06927-2

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An efficient video transformer network with token discard and keyframe enhancement for action recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Sparse Dense Transformer Network for Video Action Recognition

DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition

ESTI: an action recognition network with enhanced spatio-temporal information

Data availability

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval and consent to participate

Consent for publication

Materials availability

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now