research-article

Transformer Video Classification algorithm based on video token-to-token

Authors:

Zongmin LiAuthors Info & Claims

ICCCM '22: Proceedings of the 10th International Conference on Computer and Communications Management

Pages 118 - 124

https://doi.org/10.1145/3556223.3556241

Published: 16 October 2022 Publication History

Abstract

The expression of video content presents the combination of audio-visual aspects. How to effectively combine audio and video features and generate robust content representation is still a problem to be explored. In this paper, we propose a multi-modal fusion video classification algorithm combined with audio and video (TVC-T2T: Transformer Video Classification algorithm based on video token-to-token). In order to improve the training efficiency and enrich the feature expression, we perform video token-to-token operation on the video frames and the converted audio spectrum images. In spatial-temporal dimension, the convolution operation is used to reduce the amount of parameters, and the design of deep and narrow structure is used to enhance the recognition performance of the model. In order to improve the classification effect, the two stream modeling is changed to the spliced single stream modeling, and the model calculation is focused on modal interaction. Extensitve experiments show that the proposed method can effectively improve the results on two commonly used benchmarks (Kinetics 400 and kinetics 600).

References

[1]

Krizhevsky A, Sutskever I, Hinton G E. 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25. http://clgiles.ist.psu.edu/IST597/materials/papers-cnn/ALEX-net.pdf

[2]

Simonyan K, Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. https://arxiv.53yu.com/pdf/1409.1556.pdf%E3%80%82

[3]

Szegedy C, Liu W, Jia Y, 2015. Going deeper with convolutions. Proceedings of the IEEE conference on computer vision and pattern recognition, 1-9. https://arxiv.53yu.com/abs/1409.4842

[4]

He K, Zhang X, Ren S, 2016. Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition, 770-778. https://ieeexplore.ieee.org/abstract/document/7780459

[5]

Iandola F N, Han S, Moskewicz M W, 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and < 0.5 MB model size. arXiv preprint arXiv:1602.07360. https: //arxiv.53yu.com/pdf/1602.07360.pdf?ref=https: //githubhelp.com

[6]

Howard A G, Zhu M, Chen B, 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. https://arxiv.53yu.com/pdf/1704.04861.pdf%EF%BC%89

[7]

Touvron H, Cord M, Douze M, 2021. Training data-efficient image transformers & distillation through attention. International Conference on Machine Learning. PMLR, 10 7-10357. https://proceedings.mlr.press/v139/touvron21a.html

[8]

Dosovitskiy A, Beyer L, Kolesnikov A, 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. https: //arxiv.53yu.com/pdf/2010.11929?ref=https: //githubhelp.com

[9]

Liu Z, Lin Y, Cao Y, 2021. Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF International Conference on Computer Vision, 10012-10022. https://arxiv.53yu.com/abs/2103.14030

[10]

Yuan L, Chen Y, Wang T, 2021. Tokens-to-token vit: Training vision transformers from scratch on imagenet. Proceedings of the IEEE/CVF International Conference on Computer Vision, 558-567. https://www.cxyzjd.com/article/GXU_PJM/114385586

[11]

Arnab A, Dehghani M, Heigold G, 2021. Vivit: A video vision transformer. Proceedings of the IEEE/CVF International Conference on Computer Vision, 6836-6846. https://ui.adsabs.harvard.edu/abs/2021arXiv210315691A/abstract

[12]

Bertasius G, Wang H, Torresani L. 2021. Is space-time attention all you need for video understanding. arXiv preprint arXiv:2102.05095, 2(3):4. https://arxiv.53yu.com/abs/2102.05095

[13]

Liu Z, Ning J, Cao Y, 2021. Video swin transformer. arXiv preprint arXiv:2106.13230. https://arxiv.53yu.com/pdf/2106.13230

[14]

Tran D, Bourdev L, Fergus R, 2015. Learning spatiotemporal features with 3d convolutional networks. Proceedings of the IEEE international conference on computer vision, 4489-4497. https://store.computer.org/csdl/proceedings-article/iccv/2015/8391e489/12Om NAPSMpP

Digital Library

[15]

Qiu Z, Yao T, Mei T. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. Proceedings of the IEEE International Conference on Computer Vision, 5533-5541. https://store.computer.org/csdl/proceedings-article/iccv/2017/1032f534/ 12OmNvAAtvl

[16]

Xie S, Sun C, Huang J, 2018. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. Proceedings of the European conference on computer vision, 305-321. https://arxiv.53yu.com/abs/1712.04851

Digital Library

[17]

Akbari H, Yuan L, Qian R, 2021. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. Neural Information Processing Systems, 34. https://openreview.net/forum?id=RzYrn625bu8

[18]

Alwassel H, Mahajan D, Korbar B, 2020. Self-supervised learning by cross-modal audio-video clustering. Neural Information Processing Systems, 33: 9758-9770. https://repository.kaust.edu.sa/handle/10754/660737

[19]

Jain S, Yarlagadda P, Jyoti S, 2020. Vinet: Pushing the limits of visual modality for audio-visual saliency prediction. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 3520-3527. https://arxiv.53yu.com/pdf/2012.06170

[20]

Kazakos E, Nagrani A, Zisserman A, 2019. Epic-fusion: Audio-visual temporal binding for egocentric action recognition. Proceedings of the IEEE/CVF International Conference on Computer Vision, 5492-5501. https://ora.ox.ac.uk/objects/uuid:4c4ef89e-4342-46d2-924a-3819443dfff9

[21]

Kim W, Son B, Kim I. 2021. Vilt: Vision-and-language transformer without convolution or region supervision. International Conference on Machine Learning. PMLR, 5583-5594. http://proceedings.mlr.press/v139/kim21k.html

[22]

Carreira J, Zisserman A. 2017. Quo vadis action recognition? a new model and the kinetics dataset. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6299-6308. https://ieeexplore.ieee.org/abstract/document/8099985

[23]

Qiu Z, Yao T, Mei T. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. Proceedings of the IEEE International Conference on Computer Vision, 5533-5541. https://www.computer.org/csdl/proceedings-article/csmr/2005/23040102/12OmNvAAtvL

[24]

Tran D, Wang H, Torresani L, 2018. A closer look at spatiotemporal convolutions for action recognition. Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 6450-6459. https://openreview.net/forum?id=SyZquJfuWS

[25]

Wang X, Girshick R, Gupta A, 2018. Non-local neural networks. Proceedings of the IEEE conference on computer vision and pattern recognition, 7794-7803. https://www.computer.org/csdl/proceedings-article/cvpr/2018/642000h794/17D45VTRoCT

[26]

Cao Y, Xu J, Lin S, 2019. Gcnet: Non-local networks meet squeeze-excitation networks and beyond. Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. https://repository.ust.hk/ir/Record/1783.1-103523

[27]

Feichtenhofer C, Pinz A, Zisserman A. 2016. Convolutional two-stream network fusion for video action recognition. Proceedings of the IEEE conference on computer vision and pattern recognition, 1933-1941. https://ieeexplore.ieee.org/abstract/document/7780582

[28]

Neimark D, Bar O, Zohar M, 2021. Video transformer network. Proceedings of the IEEE/CVF International Conference on Computer Vision, 3163-3172. https://ieeexplore.ieee.org/abstract/document/9607406

[29]

Fan H, Xiong B, Mangalam K, 2021. Multiscale vision transformers. Proceedings of the IEEE/CVF International Conference on Computer Vision, 6824-6835. https://ui.adsabs.harvard.edu/abs/2021arXiv210411227F/abstract

[30]

Hu J, Shen L, Sun G. 2018. Squeeze-and-excitation networks. Proceedings of the IEEE conference on computer vision and pattern recognition, 7132-7141. https://ui.adsabs.harvard.edu/abs/2017arXiv170901507H/abstract

[31]

Kong Q, Cao Y, Iqbal T, 2020. Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28: 2880-2894. https://ieeexplore.ieee.org/abstract/document/9229505

Digital Library

[32]

Palanisamy K, Singhania D, Yao A. 2020. Rethinking CNN models for audio classification. arXiv preprint arXiv:2007.11154. https://arxiv.53yu.com/abs/2007.11154

[33]

Kondratyuk D, Yuan L, Li Y, 2021. Movinets: Mobile video networks for efficient video recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16020-16030. https://www.computer.org/csdl/proceedings-article/cvpr/2021/450900q6015/1yeJgtSIuM8

[34]

Qiu Z, Yao T, Ngo C W, 2019. Learning spatio-temporal representation with local and global diffusion. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12056-12065. https://ink.library.smu.edu.sg/sis_research/6458/

[35]

Feichtenhofer C, Fan H, Malik J, 2019. Slowfast networks for video recognition. Proceedings of the IEEE/CVF international conference on computer vision. 2019: 6202-6211. https://www.computer.org/csdl/proceedings-article/iccv/2019/480300g201/1hVl8R9LhQI

[36]

Feichtenhofer C. 2020. X3d: Expanding architectures for efficient video recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 203-213. https://arxiv.53yu.com/abs/2004.04730

[37]

Wang X, Xiong X, Neumann M, 2020. Attentionnas: Spatiotemporal attention cell search for video classification. European Conference on Computer Vision. Springer, Cham, 449-465. https://linkspringer.53yu.com/chapter/10.1007/978-3-030-58598-3_27

Digital Library

Cited By

Index Terms

Transformer Video Classification algorithm based on video token-to-token
1. Human-centered computing
  1. Visualization
    1. Visualization theory, concepts and paradigms

Recommendations

Music Genre Classification with Transformer Classifier
ICDSP '20: Proceedings of the 2020 4th International Conference on Digital Signal Processing

Music Genre Classification is a significant and practical field of Music Information Retrieval. Deep learning is increasingly being applied to Music Genre Classification for two main reasons. Firstly, it avoids the manual selection of audio signal ...
Automatic classification of personal video recordings based on audiovisual features

A system for automatic classification of personal videos is presented.Personal video recordings are classified into 24 categories frame by frame.Features derived from both audio and image data are used to classify.The system learns the most appropriate ...
Multi-semantic Representation with Transformer Network for Video Classification
ICMLC '23: Proceedings of the 2023 15th International Conference on Machine Learning and Computing

Video classification is an important and challenging task. Videos usually contain a series of key actions and motion patterns. Video classifiers need to learn and describe them with an embedding vector. Generally, these actions and patterns imply ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICCCM '22: Proceedings of the 10th International Conference on Computer and Communications Management

July 2022

289 pages

ISBN:9781450396349

DOI:10.1145/3556223

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 October 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICCCM 2022

ICCCM 2022: The 10th International Conference on Computer and Communications Management

July 29 - 31, 2022

Okayama, Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
42
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)0

Reflects downloads up to 30 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten