Skip to main content

Advertisement

Log in

An efficient video transformer network with token discard and keyframe enhancement for action recognition

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Existing video transformers divide the video frames into a long sequence of tokens and perform self-attention computation among all tokens. However, the tokens corresponding to background information have little effect, and complete utilization of these tokens would generate a lot of computational redundancy. Based on this observation, we design an efficient video transformer named TDKE, which adaptively discards unimportant tokens and solely utilizes important tokens during the inference process. Specifically, the backbone of the TDKE is a 12-layer video transformer, which can be divided into two main parts. The first part, named the Scanner module, is composed of the first two transformer layers. The second part, named the Delicacy module, is composed of the remaining ten transformer layers. The Scanner module quickly discards unimportant tokens of each frame and selects a keyframe. Among them, the importance of each token is measured by pre-calculated attention maps, and the frame with the highest importance score is defined as the keyframe. The Delicacy module uses a weight enhancement technology to the keyframe and further discards redundant tokens based on our enhanced attention maps. We evaluate TDKE on two action recognition datasets, Kinetics-400 and SSv2. The results confirm that TDKE is efficient. For example, TDKE achieved a Top-1 accuracy of 77.8% on the Kinetics-400 dataset using only 306 GFLOPs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data availability

The data are available from the corresponding author.

Code availability

The code is available from the corresponding author.

References

  1. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems 27

  2. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732

  3. Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941

  4. Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: Towards good practices for deep action recognition. In: European Conference on Computer Vision, pp. 20–36. Springer

  5. Christoph R, Pinz FA (2016) Spatiotemporal residual networks for video action recognition. Advanc Neural Inf Process Syst 2:3468–3476

    MATH  Google Scholar 

  6. Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211

  7. Ji S, Xu W, Yang M, Yu K (2012) 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231

    Article  MATH  Google Scholar 

  8. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497

  9. Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308

  10. Feichtenhofer C (2020) X3d: Expanding architectures for efficient video recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 203–213

  11. Vaswani A (2017) Attention is all you need. Advances in Neural Information Processing Systems

  12. Devlin J (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

  13. Dosovitskiy A (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

  14. Arnab A, Dehghani M, Heigold G, Sun C, Lučić M, Schmid C (2021) Vivit: A video vision transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6836–6846

  15. Bertasius G, Wang H, Torresani L (2021) Is space-time attention all you need for video understanding? In: ICML, 2 p. 4

  16. Park SH, Tack J, Heo B, Ha J-W, Shin J (2022) K-centered patch sampling for efficient video recognition. In: European Conference on Computer Vision, pp. 160–176 . Springer

  17. Wang J, Yang X, Li H, Liu L, Wu Z, Jiang Y-G (2022) Efficient video transformers with spatial-temporal token selection. In: European Conference on Computer Vision, pp. 69–86. Springer

  18. Wang L, Huang B, Zhao Z, Tong Z, He Y, Wang Y, Wang Y, Qiao Y (2023) Videomae v2: Scaling video masked autoencoders with dual masking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14549–14560

  19. Feng Z, Xu J, Ma L, Zhang S (2024) Efficient video transformers via spatial-temporal token merging for action recognition. ACM Trans Multimed Comput, Commun Appl 20(4):1–21

    Article  MATH  Google Scholar 

  20. Wu Q, Cui R, Li Y, Zhu H (2024) Haltingvt: Adaptive token halting transformer for efficient video recognition. In: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4305–4309. IEEE

  21. Choi J, Lee S, Chu J, Choi M, Kim HJ (2024) vid-tldr: Training free token merging for light-weight video transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18771–18781

  22. Qing Z, Zhang S, Huang Z, Wang X, Wang Y, Lv Y, Gao C, Sang N (2023) Mar: masked autoencoders for efficient action recognition. IEEE Trans Multimed 26:218–233

    Article  MATH  Google Scholar 

  23. Fan H, Xiong B, Mangalam K, Li Y, Yan Z, Malik J, Feichtenhofer C (2021) Multiscale vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6824–6835

  24. Neimark D, Bar O, Zohar M, Asselmann D (2021) Video transformer network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3163–3172

  25. Patrick M, Campbell D, Asano Y, Misra I, Metze F, Feichtenhofer C, Vedaldi A, Henriques JF (2021) Keeping your eye on the ball: trajectory attention in video transformers. Advan Neural Inf Process Syst 34:12493–12506

    Google Scholar 

  26. Liu Z, Ning J, Cao Y, Wei Y, Zhang Z, Lin S, Hu H (2022) Video swin transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3202–3211

  27. Truong T-D, Bui Q-H, Duong CN, Seo H-S, Phung SL, Li X, Luu K (2022) Direcformer: A directed attention in transformer approach to robust action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20030–20040

  28. Li K, Wang Y, Gao P, Song G, Liu Y, Li H, Qiao Y (2022) Uniformer: unified transformer for efficient spatiotemporal representation learning. arXiv preprint arXiv:2201.04676

  29. Yan S, Xiong X, Arnab A, Lu Z, Zhang M, Sun C, Schmid C (2022) Multiview transformers for video recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3333–3343

  30. Herzig R, Ben-Avraham E, Mangalam K, Bar A, Chechik G, Rohrbach A, Darrell T, Globerson A (2022) Object-region video transformers. In: Proceedings of the Ieee/cvf Conference on Computer Vision and Pattern Recognition, pp. 3148–3159

  31. Rao Y, Zhao W, Liu B, Lu J, Zhou J, Hsieh C-J (2021) Dynamicvit: efficient vision transformers with dynamic token sparsification. Advan Neural Inf Process Syst 34:13937–13949

    Google Scholar 

  32. Liang Y, Ge C, Tong Z, Song Y, Wang J, Xie P (2022) Not all patches are what you need: Expediting vision transformers via token reorganizations. arXiv preprint arXiv:2202.07800

  33. Fayyaz M, Koohpayegani SA, Jafari FR, Sengupta S, Joze HRV, Sommerlade E, Pirsiavash H, Gall J (2022) Adaptive token sampling for efficient vision transformers. In: European Conference on Computer Vision, pp. 396–414. Springer

  34. Kong Z, Dong P, Ma X, Meng X, Niu W, Sun M, Shen X, Yuan G, Ren B, Tang H et al (2022) Spvit: Enabling faster vision transformers via latency-aware soft token pruning. In: European Conference on Computer Vision, pp. 620–640. Springer

  35. Yin H, Vahdat A, Alvarez JM, Mallya A, Kautz J, Molchanov P (2022) A-vit: Adaptive tokens for efficient vision transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10809–10818

  36. Meng L, Li H, Chen B-C, Lan S, Wu Z, Jiang Y-G, Lim S-N (2022) Adavit: Adaptive vision transformers for efficient image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12309–12318

  37. Bolya D, Fu C-Y, Dai X, Zhang P, Feichtenhofer C, Hoffman J (2022) Token merging: Your vit but faster. arXiv preprint arXiv:2210.09461

  38. Feng Z, Zhang S (2023) Efficient vision transformer via token merger. IEEE Transactions on Image Processing

  39. Chen L, Tong Z, Song Y, Wu G, Wang L (2023) Efficient video action detection with token dropout and context refinement. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10388–10399

  40. Chen J, Ho CM (2022) Mm-vit: Multi-modal video transformer for compressed video action recognition. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1910–1921

  41. Goyal R, Ebrahimi Kahou S, Michalski V, Materzynska J, Westphal S, Kim H, Haenel V, Fruend I, Yianilos P, Mueller-Freitag M et al (2017) The" something something" video database for learning and evaluating visual common sense. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5842–5850

  42. Lin J, Gan C, Han S (2019) Tsm: Temporal shift module for efficient video understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7083–7093

  43. Fan Q, Chen C-FR, Kuehne H, Pistoia M, Cox D (2019) More is less: Learning efficient video representations by big-little network and depthwise temporal aggregation. Advances in Neural Information Processing Systems. 32

  44. Fan Q, Panda R et al (2021) An image classifier can suffice for video understanding. arXiv preprint arXiv:2106.141042

  45. Jiang B, Wang M, Gan W, Wu W, Yan J (2019) Stm: spatiotemporal and motion encoding for action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2000–2009

Download references

Acknowledgements

The author greatly appreciates the guidance and suggestions provided by the editor and reviewers for this article. This work is supported by the National Natural Science Foundation of China(No.61673396) and the Natural Science Foundation of Shandong Province(No.ZR2022MF260).

Funding

This work is supported by the National Natural Science Foundation of China(No.61673396) and the Natural Science Foundation of Shandong Province(No.ZR2022MF260).

Author information

Authors and Affiliations

Authors

Contributions

Qian Zhang contributed to ideas, design of methodology, verification, writing-review. Zuosui Yang contributed to ideas, coding, data collation, writing. Mingwen Shao contributed to writing-review and editing and funding acquisition. Hong Liang contributed to ideas, experimental supervision, and writing-review.

Corresponding author

Correspondence to Zuosui Yang.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Materials availability

The materials are available from the corresponding author.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Q., Yang, Z., Shao, M. et al. An efficient video transformer network with token discard and keyframe enhancement for action recognition. J Supercomput 81, 408 (2025). https://doi.org/10.1007/s11227-025-06927-2

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11227-025-06927-2

Keywords