skip to main content
10.1145/3556223.3556241acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicccmConference Proceedingsconference-collections

Transformer Video Classification algorithm based on video token-to-token

Published: 16 October 2022 Publication History


The expression of video content presents the combination of audio-visual aspects. How to effectively combine audio and video features and generate robust content representation is still a problem to be explored. In this paper, we propose a multi-modal fusion video classification algorithm combined with audio and video (TVC-T2T: Transformer Video Classification algorithm based on video token-to-token). In order to improve the training efficiency and enrich the feature expression, we perform video token-to-token operation on the video frames and the converted audio spectrum images. In spatial-temporal dimension, the convolution operation is used to reduce the amount of parameters, and the design of deep and narrow structure is used to enhance the recognition performance of the model. In order to improve the classification effect, the two stream modeling is changed to the spliced single stream modeling, and the model calculation is focused on modal interaction. Extensitve experiments show that the proposed method can effectively improve the results on two commonly used benchmarks (Kinetics 400 and kinetics 600).


Krizhevsky A, Sutskever I, Hinton G E. 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25.
Simonyan K, Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
Szegedy C, Liu W, Jia Y, 2015. Going deeper with convolutions. Proceedings of the IEEE conference on computer vision and pattern recognition, 1-9.
He K, Zhang X, Ren S, 2016. Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition, 770-778.
Iandola F N, Han S, Moskewicz M W, 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and < 0.5 MB model size. arXiv preprint arXiv:1602.07360. https: // //
Howard A G, Zhu M, Chen B, 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861.
Touvron H, Cord M, Douze M, 2021. Training data-efficient image transformers & distillation through attention. International Conference on Machine Learning. PMLR, 10 7-10357.
Dosovitskiy A, Beyer L, Kolesnikov A, 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. https: // //
Liu Z, Lin Y, Cao Y, 2021. Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF International Conference on Computer Vision, 10012-10022.
Yuan L, Chen Y, Wang T, 2021. Tokens-to-token vit: Training vision transformers from scratch on imagenet. Proceedings of the IEEE/CVF International Conference on Computer Vision, 558-567.
Arnab A, Dehghani M, Heigold G, 2021. Vivit: A video vision transformer. Proceedings of the IEEE/CVF International Conference on Computer Vision, 6836-6846.
Bertasius G, Wang H, Torresani L. 2021. Is space-time attention all you need for video understanding. arXiv preprint arXiv:2102.05095, 2(3):4.
Liu Z, Ning J, Cao Y, 2021. Video swin transformer. arXiv preprint arXiv:2106.13230.
Tran D, Bourdev L, Fergus R, 2015. Learning spatiotemporal features with 3d convolutional networks. Proceedings of the IEEE international conference on computer vision, 4489-4497. NAPSMpP
Qiu Z, Yao T, Mei T. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. Proceedings of the IEEE International Conference on Computer Vision, 5533-5541. 12OmNvAAtvl
Xie S, Sun C, Huang J, 2018. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. Proceedings of the European conference on computer vision, 305-321.
Akbari H, Yuan L, Qian R, 2021. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. Neural Information Processing Systems, 34.
Alwassel H, Mahajan D, Korbar B, 2020. Self-supervised learning by cross-modal audio-video clustering. Neural Information Processing Systems, 33: 9758-9770.
Jain S, Yarlagadda P, Jyoti S, 2020. Vinet: Pushing the limits of visual modality for audio-visual saliency prediction. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 3520-3527.
Kazakos E, Nagrani A, Zisserman A, 2019. Epic-fusion: Audio-visual temporal binding for egocentric action recognition. Proceedings of the IEEE/CVF International Conference on Computer Vision, 5492-5501.
Kim W, Son B, Kim I. 2021. Vilt: Vision-and-language transformer without convolution or region supervision. International Conference on Machine Learning. PMLR, 5583-5594.
Carreira J, Zisserman A. 2017. Quo vadis action recognition? a new model and the kinetics dataset. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6299-6308.
Qiu Z, Yao T, Mei T. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. Proceedings of the IEEE International Conference on Computer Vision, 5533-5541.
Tran D, Wang H, Torresani L, 2018. A closer look at spatiotemporal convolutions for action recognition. Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 6450-6459.
Wang X, Girshick R, Gupta A, 2018. Non-local neural networks. Proceedings of the IEEE conference on computer vision and pattern recognition, 7794-7803.
Cao Y, Xu J, Lin S, 2019. Gcnet: Non-local networks meet squeeze-excitation networks and beyond. Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops.
Feichtenhofer C, Pinz A, Zisserman A. 2016. Convolutional two-stream network fusion for video action recognition. Proceedings of the IEEE conference on computer vision and pattern recognition, 1933-1941.
Neimark D, Bar O, Zohar M, 2021. Video transformer network. Proceedings of the IEEE/CVF International Conference on Computer Vision, 3163-3172.
Fan H, Xiong B, Mangalam K, 2021. Multiscale vision transformers. Proceedings of the IEEE/CVF International Conference on Computer Vision, 6824-6835.
Hu J, Shen L, Sun G. 2018. Squeeze-and-excitation networks. Proceedings of the IEEE conference on computer vision and pattern recognition, 7132-7141.
Kong Q, Cao Y, Iqbal T, 2020. Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28: 2880-2894.
Palanisamy K, Singhania D, Yao A. 2020. Rethinking CNN models for audio classification. arXiv preprint arXiv:2007.11154.
Kondratyuk D, Yuan L, Li Y, 2021. Movinets: Mobile video networks for efficient video recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16020-16030.
Qiu Z, Yao T, Ngo C W, 2019. Learning spatio-temporal representation with local and global diffusion. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12056-12065.
Feichtenhofer C, Fan H, Malik J, 2019. Slowfast networks for video recognition. Proceedings of the IEEE/CVF international conference on computer vision. 2019: 6202-6211.
Feichtenhofer C. 2020. X3d: Expanding architectures for efficient video recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 203-213.
Wang X, Xiong X, Neumann M, 2020. Attentionnas: Spatiotemporal attention cell search for video classification. European Conference on Computer Vision. Springer, Cham, 449-465.

Cited By

View all

Index Terms

  1. Transformer Video Classification algorithm based on video token-to-token



    Information & Contributors


    Published In

    cover image ACM Other conferences
    ICCCM '22: Proceedings of the 10th International Conference on Computer and Communications Management
    July 2022
    289 pages
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].


    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 16 October 2022


    Request permissions for this article.

    Check for updates

    Author Tags

    1. computer vision
    2. convolution
    3. feature fusion
    4. transformer
    5. video classification


    • Research-article
    • Research
    • Refereed limited


    ICCCM 2022


    Other Metrics

    Bibliometrics & Citations


    Article Metrics

    • 0
      Total Citations
    • 42
      Total Downloads
    • Downloads (Last 12 months)6
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 30 Jan 2025

    Other Metrics


    Cited By

    View all

    View Options

    Login options

    View options


    View or Download as a PDF file.



    View online with eReader.


    HTML Format

    View this article in HTML Format.

    HTML Format






    Share this Publication link

    Share on social media