Abstract
Existing video transformers divide the video frames into a long sequence of tokens and perform self-attention computation among all tokens. However, the tokens corresponding to background information have little effect, and complete utilization of these tokens would generate a lot of computational redundancy. Based on this observation, we design an efficient video transformer named TDKE, which adaptively discards unimportant tokens and solely utilizes important tokens during the inference process. Specifically, the backbone of the TDKE is a 12-layer video transformer, which can be divided into two main parts. The first part, named the Scanner module, is composed of the first two transformer layers. The second part, named the Delicacy module, is composed of the remaining ten transformer layers. The Scanner module quickly discards unimportant tokens of each frame and selects a keyframe. Among them, the importance of each token is measured by pre-calculated attention maps, and the frame with the highest importance score is defined as the keyframe. The Delicacy module uses a weight enhancement technology to the keyframe and further discards redundant tokens based on our enhanced attention maps. We evaluate TDKE on two action recognition datasets, Kinetics-400 and SSv2. The results confirm that TDKE is efficient. For example, TDKE achieved a Top-1 accuracy of 77.8% on the Kinetics-400 dataset using only 306 GFLOPs.








Similar content being viewed by others
Data availability
The data are available from the corresponding author.
Code availability
The code is available from the corresponding author.
References
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems 27
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: Towards good practices for deep action recognition. In: European Conference on Computer Vision, pp. 20–36. Springer
Christoph R, Pinz FA (2016) Spatiotemporal residual networks for video action recognition. Advanc Neural Inf Process Syst 2:3468–3476
Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211
Ji S, Xu W, Yang M, Yu K (2012) 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497
Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308
Feichtenhofer C (2020) X3d: Expanding architectures for efficient video recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 203–213
Vaswani A (2017) Attention is all you need. Advances in Neural Information Processing Systems
Devlin J (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Dosovitskiy A (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
Arnab A, Dehghani M, Heigold G, Sun C, Lučić M, Schmid C (2021) Vivit: A video vision transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6836–6846
Bertasius G, Wang H, Torresani L (2021) Is space-time attention all you need for video understanding? In: ICML, 2 p. 4
Park SH, Tack J, Heo B, Ha J-W, Shin J (2022) K-centered patch sampling for efficient video recognition. In: European Conference on Computer Vision, pp. 160–176 . Springer
Wang J, Yang X, Li H, Liu L, Wu Z, Jiang Y-G (2022) Efficient video transformers with spatial-temporal token selection. In: European Conference on Computer Vision, pp. 69–86. Springer
Wang L, Huang B, Zhao Z, Tong Z, He Y, Wang Y, Wang Y, Qiao Y (2023) Videomae v2: Scaling video masked autoencoders with dual masking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14549–14560
Feng Z, Xu J, Ma L, Zhang S (2024) Efficient video transformers via spatial-temporal token merging for action recognition. ACM Trans Multimed Comput, Commun Appl 20(4):1–21
Wu Q, Cui R, Li Y, Zhu H (2024) Haltingvt: Adaptive token halting transformer for efficient video recognition. In: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4305–4309. IEEE
Choi J, Lee S, Chu J, Choi M, Kim HJ (2024) vid-tldr: Training free token merging for light-weight video transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18771–18781
Qing Z, Zhang S, Huang Z, Wang X, Wang Y, Lv Y, Gao C, Sang N (2023) Mar: masked autoencoders for efficient action recognition. IEEE Trans Multimed 26:218–233
Fan H, Xiong B, Mangalam K, Li Y, Yan Z, Malik J, Feichtenhofer C (2021) Multiscale vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6824–6835
Neimark D, Bar O, Zohar M, Asselmann D (2021) Video transformer network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3163–3172
Patrick M, Campbell D, Asano Y, Misra I, Metze F, Feichtenhofer C, Vedaldi A, Henriques JF (2021) Keeping your eye on the ball: trajectory attention in video transformers. Advan Neural Inf Process Syst 34:12493–12506
Liu Z, Ning J, Cao Y, Wei Y, Zhang Z, Lin S, Hu H (2022) Video swin transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3202–3211
Truong T-D, Bui Q-H, Duong CN, Seo H-S, Phung SL, Li X, Luu K (2022) Direcformer: A directed attention in transformer approach to robust action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20030–20040
Li K, Wang Y, Gao P, Song G, Liu Y, Li H, Qiao Y (2022) Uniformer: unified transformer for efficient spatiotemporal representation learning. arXiv preprint arXiv:2201.04676
Yan S, Xiong X, Arnab A, Lu Z, Zhang M, Sun C, Schmid C (2022) Multiview transformers for video recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3333–3343
Herzig R, Ben-Avraham E, Mangalam K, Bar A, Chechik G, Rohrbach A, Darrell T, Globerson A (2022) Object-region video transformers. In: Proceedings of the Ieee/cvf Conference on Computer Vision and Pattern Recognition, pp. 3148–3159
Rao Y, Zhao W, Liu B, Lu J, Zhou J, Hsieh C-J (2021) Dynamicvit: efficient vision transformers with dynamic token sparsification. Advan Neural Inf Process Syst 34:13937–13949
Liang Y, Ge C, Tong Z, Song Y, Wang J, Xie P (2022) Not all patches are what you need: Expediting vision transformers via token reorganizations. arXiv preprint arXiv:2202.07800
Fayyaz M, Koohpayegani SA, Jafari FR, Sengupta S, Joze HRV, Sommerlade E, Pirsiavash H, Gall J (2022) Adaptive token sampling for efficient vision transformers. In: European Conference on Computer Vision, pp. 396–414. Springer
Kong Z, Dong P, Ma X, Meng X, Niu W, Sun M, Shen X, Yuan G, Ren B, Tang H et al (2022) Spvit: Enabling faster vision transformers via latency-aware soft token pruning. In: European Conference on Computer Vision, pp. 620–640. Springer
Yin H, Vahdat A, Alvarez JM, Mallya A, Kautz J, Molchanov P (2022) A-vit: Adaptive tokens for efficient vision transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10809–10818
Meng L, Li H, Chen B-C, Lan S, Wu Z, Jiang Y-G, Lim S-N (2022) Adavit: Adaptive vision transformers for efficient image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12309–12318
Bolya D, Fu C-Y, Dai X, Zhang P, Feichtenhofer C, Hoffman J (2022) Token merging: Your vit but faster. arXiv preprint arXiv:2210.09461
Feng Z, Zhang S (2023) Efficient vision transformer via token merger. IEEE Transactions on Image Processing
Chen L, Tong Z, Song Y, Wu G, Wang L (2023) Efficient video action detection with token dropout and context refinement. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10388–10399
Chen J, Ho CM (2022) Mm-vit: Multi-modal video transformer for compressed video action recognition. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1910–1921
Goyal R, Ebrahimi Kahou S, Michalski V, Materzynska J, Westphal S, Kim H, Haenel V, Fruend I, Yianilos P, Mueller-Freitag M et al (2017) The" something something" video database for learning and evaluating visual common sense. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5842–5850
Lin J, Gan C, Han S (2019) Tsm: Temporal shift module for efficient video understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7083–7093
Fan Q, Chen C-FR, Kuehne H, Pistoia M, Cox D (2019) More is less: Learning efficient video representations by big-little network and depthwise temporal aggregation. Advances in Neural Information Processing Systems. 32
Fan Q, Panda R et al (2021) An image classifier can suffice for video understanding. arXiv preprint arXiv:2106.141042
Jiang B, Wang M, Gan W, Wu W, Yan J (2019) Stm: spatiotemporal and motion encoding for action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2000–2009
Acknowledgements
The author greatly appreciates the guidance and suggestions provided by the editor and reviewers for this article. This work is supported by the National Natural Science Foundation of China(No.61673396) and the Natural Science Foundation of Shandong Province(No.ZR2022MF260).
Funding
This work is supported by the National Natural Science Foundation of China(No.61673396) and the Natural Science Foundation of Shandong Province(No.ZR2022MF260).
Author information
Authors and Affiliations
Contributions
Qian Zhang contributed to ideas, design of methodology, verification, writing-review. Zuosui Yang contributed to ideas, coding, data collation, writing. Mingwen Shao contributed to writing-review and editing and funding acquisition. Hong Liang contributed to ideas, experimental supervision, and writing-review.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Materials availability
The materials are available from the corresponding author.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, Q., Yang, Z., Shao, M. et al. An efficient video transformer network with token discard and keyframe enhancement for action recognition. J Supercomput 81, 408 (2025). https://doi.org/10.1007/s11227-025-06927-2
Accepted:
Published:
DOI: https://doi.org/10.1007/s11227-025-06927-2