skip to main content
10.1145/3556223.3556241acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicccmConference Proceedingsconference-collections
research-article

Transformer Video Classification algorithm based on video token-to-token

Authors Info & Claims
Published:16 October 2022Publication History

ABSTRACT

The expression of video content presents the combination of audio-visual aspects. How to effectively combine audio and video features and generate robust content representation is still a problem to be explored. In this paper, we propose a multi-modal fusion video classification algorithm combined with audio and video (TVC-T2T: Transformer Video Classification algorithm based on video token-to-token). In order to improve the training efficiency and enrich the feature expression, we perform video token-to-token operation on the video frames and the converted audio spectrum images. In spatial-temporal dimension, the convolution operation is used to reduce the amount of parameters, and the design of deep and narrow structure is used to enhance the recognition performance of the model. In order to improve the classification effect, the two stream modeling is changed to the spliced single stream modeling, and the model calculation is focused on modal interaction. Extensitve experiments show that the proposed method can effectively improve the results on two commonly used benchmarks (Kinetics 400 and kinetics 600).

References

  1. Krizhevsky A, Sutskever I, Hinton G E. 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25. http://clgiles.ist.psu.edu/IST597/materials/papers-cnn/ALEX-net.pdfGoogle ScholarGoogle Scholar
  2. Simonyan K, Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. https://arxiv.53yu.com/pdf/1409.1556.pdf%E3%80%82Google ScholarGoogle Scholar
  3. Szegedy C, Liu W, Jia Y, 2015. Going deeper with convolutions. Proceedings of the IEEE conference on computer vision and pattern recognition, 1-9. https://arxiv.53yu.com/abs/1409.4842Google ScholarGoogle ScholarCross RefCross Ref
  4. He K, Zhang X, Ren S, 2016. Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition, 770-778. https://ieeexplore.ieee.org/abstract/document/7780459Google ScholarGoogle ScholarCross RefCross Ref
  5. Iandola F N, Han S, Moskewicz M W, 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and < 0.5 MB model size. arXiv preprint arXiv:1602.07360. https: //arxiv.53yu.com/pdf/1602.07360.pdf?ref=https: //githubhelp.comGoogle ScholarGoogle Scholar
  6. Howard A G, Zhu M, Chen B, 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. https://arxiv.53yu.com/pdf/1704.04861.pdf%EF%BC%89Google ScholarGoogle Scholar
  7. Touvron H, Cord M, Douze M, 2021. Training data-efficient image transformers & distillation through attention. International Conference on Machine Learning. PMLR, 10 7-10357. https://proceedings.mlr.press/v139/touvron21a.htmlGoogle ScholarGoogle Scholar
  8. Dosovitskiy A, Beyer L, Kolesnikov A, 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. https: //arxiv.53yu.com/pdf/2010.11929?ref=https: //githubhelp.comGoogle ScholarGoogle Scholar
  9. Liu Z, Lin Y, Cao Y, 2021. Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF International Conference on Computer Vision, 10012-10022. https://arxiv.53yu.com/abs/2103.14030Google ScholarGoogle ScholarCross RefCross Ref
  10. Yuan L, Chen Y, Wang T, 2021. Tokens-to-token vit: Training vision transformers from scratch on imagenet. Proceedings of the IEEE/CVF International Conference on Computer Vision, 558-567. https://www.cxyzjd.com/article/GXU_PJM/114385586Google ScholarGoogle ScholarCross RefCross Ref
  11. Arnab A, Dehghani M, Heigold G, 2021. Vivit: A video vision transformer. Proceedings of the IEEE/CVF International Conference on Computer Vision, 6836-6846. https://ui.adsabs.harvard.edu/abs/2021arXiv210315691A/abstractGoogle ScholarGoogle ScholarCross RefCross Ref
  12. Bertasius G, Wang H, Torresani L. 2021. Is space-time attention all you need for video understanding. arXiv preprint arXiv:2102.05095, 2(3):4. https://arxiv.53yu.com/abs/2102.05095Google ScholarGoogle Scholar
  13. Liu Z, Ning J, Cao Y, 2021. Video swin transformer. arXiv preprint arXiv:2106.13230. https://arxiv.53yu.com/pdf/2106.13230Google ScholarGoogle Scholar
  14. Tran D, Bourdev L, Fergus R, 2015. Learning spatiotemporal features with 3d convolutional networks. Proceedings of the IEEE international conference on computer vision, 4489-4497. https://store.computer.org/csdl/proceedings-article/iccv/2015/8391e489/12Om NAPSMpPGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  15. Qiu Z, Yao T, Mei T. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. Proceedings of the IEEE International Conference on Computer Vision, 5533-5541. https://store.computer.org/csdl/proceedings-article/iccv/2017/1032f534/ 12OmNvAAtvlGoogle ScholarGoogle ScholarCross RefCross Ref
  16. Xie S, Sun C, Huang J, 2018. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. Proceedings of the European conference on computer vision, 305-321. https://arxiv.53yu.com/abs/1712.04851Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Akbari H, Yuan L, Qian R, 2021. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. Neural Information Processing Systems, 34. https://openreview.net/forum?id=RzYrn625bu8Google ScholarGoogle Scholar
  18. Alwassel H, Mahajan D, Korbar B, 2020. Self-supervised learning by cross-modal audio-video clustering. Neural Information Processing Systems, 33: 9758-9770. https://repository.kaust.edu.sa/handle/10754/660737Google ScholarGoogle Scholar
  19. Jain S, Yarlagadda P, Jyoti S, 2020. Vinet: Pushing the limits of visual modality for audio-visual saliency prediction. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 3520-3527. https://arxiv.53yu.com/pdf/2012.06170Google ScholarGoogle Scholar
  20. Kazakos E, Nagrani A, Zisserman A, 2019. Epic-fusion: Audio-visual temporal binding for egocentric action recognition. Proceedings of the IEEE/CVF International Conference on Computer Vision, 5492-5501. https://ora.ox.ac.uk/objects/uuid:4c4ef89e-4342-46d2-924a-3819443dfff9Google ScholarGoogle ScholarCross RefCross Ref
  21. Kim W, Son B, Kim I. 2021. Vilt: Vision-and-language transformer without convolution or region supervision. International Conference on Machine Learning. PMLR, 5583-5594. http://proceedings.mlr.press/v139/kim21k.htmlGoogle ScholarGoogle Scholar
  22. Carreira J, Zisserman A. 2017. Quo vadis action recognition? a new model and the kinetics dataset. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6299-6308. https://ieeexplore.ieee.org/abstract/document/8099985Google ScholarGoogle ScholarCross RefCross Ref
  23. Qiu Z, Yao T, Mei T. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. Proceedings of the IEEE International Conference on Computer Vision, 5533-5541. https://www.computer.org/csdl/proceedings-article/csmr/2005/23040102/12OmNvAAtvLGoogle ScholarGoogle ScholarCross RefCross Ref
  24. Tran D, Wang H, Torresani L, 2018. A closer look at spatiotemporal convolutions for action recognition. Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 6450-6459. https://openreview.net/forum?id=SyZquJfuWSGoogle ScholarGoogle ScholarCross RefCross Ref
  25. Wang X, Girshick R, Gupta A, 2018. Non-local neural networks. Proceedings of the IEEE conference on computer vision and pattern recognition, 7794-7803. https://www.computer.org/csdl/proceedings-article/cvpr/2018/642000h794/17D45VTRoCTGoogle ScholarGoogle ScholarCross RefCross Ref
  26. Cao Y, Xu J, Lin S, 2019. Gcnet: Non-local networks meet squeeze-excitation networks and beyond. Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. https://repository.ust.hk/ir/Record/1783.1-103523Google ScholarGoogle ScholarCross RefCross Ref
  27. Feichtenhofer C, Pinz A, Zisserman A. 2016. Convolutional two-stream network fusion for video action recognition. Proceedings of the IEEE conference on computer vision and pattern recognition, 1933-1941. https://ieeexplore.ieee.org/abstract/document/7780582Google ScholarGoogle ScholarCross RefCross Ref
  28. Neimark D, Bar O, Zohar M, 2021. Video transformer network. Proceedings of the IEEE/CVF International Conference on Computer Vision, 3163-3172. https://ieeexplore.ieee.org/abstract/document/9607406Google ScholarGoogle ScholarCross RefCross Ref
  29. Fan H, Xiong B, Mangalam K, 2021. Multiscale vision transformers. Proceedings of the IEEE/CVF International Conference on Computer Vision, 6824-6835. https://ui.adsabs.harvard.edu/abs/2021arXiv210411227F/abstractGoogle ScholarGoogle ScholarCross RefCross Ref
  30. Hu J, Shen L, Sun G. 2018. Squeeze-and-excitation networks. Proceedings of the IEEE conference on computer vision and pattern recognition, 7132-7141. https://ui.adsabs.harvard.edu/abs/2017arXiv170901507H/abstractGoogle ScholarGoogle ScholarCross RefCross Ref
  31. Kong Q, Cao Y, Iqbal T, 2020. Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28: 2880-2894. https://ieeexplore.ieee.org/abstract/document/9229505Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Palanisamy K, Singhania D, Yao A. 2020. Rethinking CNN models for audio classification. arXiv preprint arXiv:2007.11154. https://arxiv.53yu.com/abs/2007.11154Google ScholarGoogle Scholar
  33. Kondratyuk D, Yuan L, Li Y, 2021. Movinets: Mobile video networks for efficient video recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16020-16030. https://www.computer.org/csdl/proceedings-article/cvpr/2021/450900q6015/1yeJgtSIuM8Google ScholarGoogle ScholarCross RefCross Ref
  34. Qiu Z, Yao T, Ngo C W, 2019. Learning spatio-temporal representation with local and global diffusion. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12056-12065. https://ink.library.smu.edu.sg/sis_research/6458/Google ScholarGoogle ScholarCross RefCross Ref
  35. Feichtenhofer C, Fan H, Malik J, 2019. Slowfast networks for video recognition. Proceedings of the IEEE/CVF international conference on computer vision. 2019: 6202-6211. https://www.computer.org/csdl/proceedings-article/iccv/2019/480300g201/1hVl8R9LhQIGoogle ScholarGoogle ScholarCross RefCross Ref
  36. Feichtenhofer C. 2020. X3d: Expanding architectures for efficient video recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 203-213. https://arxiv.53yu.com/abs/2004.04730Google ScholarGoogle ScholarCross RefCross Ref
  37. Wang X, Xiong X, Neumann M, 2020. Attentionnas: Spatiotemporal attention cell search for video classification. European Conference on Computer Vision. Springer, Cham, 449-465. https://linkspringer.53yu.com/chapter/10.1007/978-3-030-58598-3_27Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Transformer Video Classification algorithm based on video token-to-token

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      ICCCM '22: Proceedings of the 10th International Conference on Computer and Communications Management
      July 2022
      289 pages
      ISBN:9781450396349
      DOI:10.1145/3556223

      Copyright © 2022 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 16 October 2022

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited
    • Article Metrics

      • Downloads (Last 12 months)19
      • Downloads (Last 6 weeks)1

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format