ABSTRACT
The expression of video content presents the combination of audio-visual aspects. How to effectively combine audio and video features and generate robust content representation is still a problem to be explored. In this paper, we propose a multi-modal fusion video classification algorithm combined with audio and video (TVC-T2T: Transformer Video Classification algorithm based on video token-to-token). In order to improve the training efficiency and enrich the feature expression, we perform video token-to-token operation on the video frames and the converted audio spectrum images. In spatial-temporal dimension, the convolution operation is used to reduce the amount of parameters, and the design of deep and narrow structure is used to enhance the recognition performance of the model. In order to improve the classification effect, the two stream modeling is changed to the spliced single stream modeling, and the model calculation is focused on modal interaction. Extensitve experiments show that the proposed method can effectively improve the results on two commonly used benchmarks (Kinetics 400 and kinetics 600).
- Krizhevsky A, Sutskever I, Hinton G E. 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25. http://clgiles.ist.psu.edu/IST597/materials/papers-cnn/ALEX-net.pdfGoogle Scholar
- Simonyan K, Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. https://arxiv.53yu.com/pdf/1409.1556.pdf%E3%80%82Google Scholar
- Szegedy C, Liu W, Jia Y, 2015. Going deeper with convolutions. Proceedings of the IEEE conference on computer vision and pattern recognition, 1-9. https://arxiv.53yu.com/abs/1409.4842Google ScholarCross Ref
- He K, Zhang X, Ren S, 2016. Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition, 770-778. https://ieeexplore.ieee.org/abstract/document/7780459Google ScholarCross Ref
- Iandola F N, Han S, Moskewicz M W, 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and < 0.5 MB model size. arXiv preprint arXiv:1602.07360. https: //arxiv.53yu.com/pdf/1602.07360.pdf?ref=https: //githubhelp.comGoogle Scholar
- Howard A G, Zhu M, Chen B, 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. https://arxiv.53yu.com/pdf/1704.04861.pdf%EF%BC%89Google Scholar
- Touvron H, Cord M, Douze M, 2021. Training data-efficient image transformers & distillation through attention. International Conference on Machine Learning. PMLR, 10 7-10357. https://proceedings.mlr.press/v139/touvron21a.htmlGoogle Scholar
- Dosovitskiy A, Beyer L, Kolesnikov A, 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. https: //arxiv.53yu.com/pdf/2010.11929?ref=https: //githubhelp.comGoogle Scholar
- Liu Z, Lin Y, Cao Y, 2021. Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF International Conference on Computer Vision, 10012-10022. https://arxiv.53yu.com/abs/2103.14030Google ScholarCross Ref
- Yuan L, Chen Y, Wang T, 2021. Tokens-to-token vit: Training vision transformers from scratch on imagenet. Proceedings of the IEEE/CVF International Conference on Computer Vision, 558-567. https://www.cxyzjd.com/article/GXU_PJM/114385586Google ScholarCross Ref
- Arnab A, Dehghani M, Heigold G, 2021. Vivit: A video vision transformer. Proceedings of the IEEE/CVF International Conference on Computer Vision, 6836-6846. https://ui.adsabs.harvard.edu/abs/2021arXiv210315691A/abstractGoogle ScholarCross Ref
- Bertasius G, Wang H, Torresani L. 2021. Is space-time attention all you need for video understanding. arXiv preprint arXiv:2102.05095, 2(3):4. https://arxiv.53yu.com/abs/2102.05095Google Scholar
- Liu Z, Ning J, Cao Y, 2021. Video swin transformer. arXiv preprint arXiv:2106.13230. https://arxiv.53yu.com/pdf/2106.13230Google Scholar
- Tran D, Bourdev L, Fergus R, 2015. Learning spatiotemporal features with 3d convolutional networks. Proceedings of the IEEE international conference on computer vision, 4489-4497. https://store.computer.org/csdl/proceedings-article/iccv/2015/8391e489/12Om NAPSMpPGoogle ScholarDigital Library
- Qiu Z, Yao T, Mei T. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. Proceedings of the IEEE International Conference on Computer Vision, 5533-5541. https://store.computer.org/csdl/proceedings-article/iccv/2017/1032f534/ 12OmNvAAtvlGoogle ScholarCross Ref
- Xie S, Sun C, Huang J, 2018. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. Proceedings of the European conference on computer vision, 305-321. https://arxiv.53yu.com/abs/1712.04851Google ScholarDigital Library
- Akbari H, Yuan L, Qian R, 2021. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. Neural Information Processing Systems, 34. https://openreview.net/forum?id=RzYrn625bu8Google Scholar
- Alwassel H, Mahajan D, Korbar B, 2020. Self-supervised learning by cross-modal audio-video clustering. Neural Information Processing Systems, 33: 9758-9770. https://repository.kaust.edu.sa/handle/10754/660737Google Scholar
- Jain S, Yarlagadda P, Jyoti S, 2020. Vinet: Pushing the limits of visual modality for audio-visual saliency prediction. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 3520-3527. https://arxiv.53yu.com/pdf/2012.06170Google Scholar
- Kazakos E, Nagrani A, Zisserman A, 2019. Epic-fusion: Audio-visual temporal binding for egocentric action recognition. Proceedings of the IEEE/CVF International Conference on Computer Vision, 5492-5501. https://ora.ox.ac.uk/objects/uuid:4c4ef89e-4342-46d2-924a-3819443dfff9Google ScholarCross Ref
- Kim W, Son B, Kim I. 2021. Vilt: Vision-and-language transformer without convolution or region supervision. International Conference on Machine Learning. PMLR, 5583-5594. http://proceedings.mlr.press/v139/kim21k.htmlGoogle Scholar
- Carreira J, Zisserman A. 2017. Quo vadis action recognition? a new model and the kinetics dataset. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6299-6308. https://ieeexplore.ieee.org/abstract/document/8099985Google ScholarCross Ref
- Qiu Z, Yao T, Mei T. 2017. Learning spatio-temporal representation with pseudo-3d residual networks. Proceedings of the IEEE International Conference on Computer Vision, 5533-5541. https://www.computer.org/csdl/proceedings-article/csmr/2005/23040102/12OmNvAAtvLGoogle ScholarCross Ref
- Tran D, Wang H, Torresani L, 2018. A closer look at spatiotemporal convolutions for action recognition. Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 6450-6459. https://openreview.net/forum?id=SyZquJfuWSGoogle ScholarCross Ref
- Wang X, Girshick R, Gupta A, 2018. Non-local neural networks. Proceedings of the IEEE conference on computer vision and pattern recognition, 7794-7803. https://www.computer.org/csdl/proceedings-article/cvpr/2018/642000h794/17D45VTRoCTGoogle ScholarCross Ref
- Cao Y, Xu J, Lin S, 2019. Gcnet: Non-local networks meet squeeze-excitation networks and beyond. Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. https://repository.ust.hk/ir/Record/1783.1-103523Google ScholarCross Ref
- Feichtenhofer C, Pinz A, Zisserman A. 2016. Convolutional two-stream network fusion for video action recognition. Proceedings of the IEEE conference on computer vision and pattern recognition, 1933-1941. https://ieeexplore.ieee.org/abstract/document/7780582Google ScholarCross Ref
- Neimark D, Bar O, Zohar M, 2021. Video transformer network. Proceedings of the IEEE/CVF International Conference on Computer Vision, 3163-3172. https://ieeexplore.ieee.org/abstract/document/9607406Google ScholarCross Ref
- Fan H, Xiong B, Mangalam K, 2021. Multiscale vision transformers. Proceedings of the IEEE/CVF International Conference on Computer Vision, 6824-6835. https://ui.adsabs.harvard.edu/abs/2021arXiv210411227F/abstractGoogle ScholarCross Ref
- Hu J, Shen L, Sun G. 2018. Squeeze-and-excitation networks. Proceedings of the IEEE conference on computer vision and pattern recognition, 7132-7141. https://ui.adsabs.harvard.edu/abs/2017arXiv170901507H/abstractGoogle ScholarCross Ref
- Kong Q, Cao Y, Iqbal T, 2020. Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28: 2880-2894. https://ieeexplore.ieee.org/abstract/document/9229505Google ScholarDigital Library
- Palanisamy K, Singhania D, Yao A. 2020. Rethinking CNN models for audio classification. arXiv preprint arXiv:2007.11154. https://arxiv.53yu.com/abs/2007.11154Google Scholar
- Kondratyuk D, Yuan L, Li Y, 2021. Movinets: Mobile video networks for efficient video recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16020-16030. https://www.computer.org/csdl/proceedings-article/cvpr/2021/450900q6015/1yeJgtSIuM8Google ScholarCross Ref
- Qiu Z, Yao T, Ngo C W, 2019. Learning spatio-temporal representation with local and global diffusion. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12056-12065. https://ink.library.smu.edu.sg/sis_research/6458/Google ScholarCross Ref
- Feichtenhofer C, Fan H, Malik J, 2019. Slowfast networks for video recognition. Proceedings of the IEEE/CVF international conference on computer vision. 2019: 6202-6211. https://www.computer.org/csdl/proceedings-article/iccv/2019/480300g201/1hVl8R9LhQIGoogle ScholarCross Ref
- Feichtenhofer C. 2020. X3d: Expanding architectures for efficient video recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 203-213. https://arxiv.53yu.com/abs/2004.04730Google ScholarCross Ref
- Wang X, Xiong X, Neumann M, 2020. Attentionnas: Spatiotemporal attention cell search for video classification. European Conference on Computer Vision. Springer, Cham, 449-465. https://linkspringer.53yu.com/chapter/10.1007/978-3-030-58598-3_27Google ScholarDigital Library
Index Terms
- Transformer Video Classification algorithm based on video token-to-token
Recommendations
Music Genre Classification with Transformer Classifier
ICDSP '20: Proceedings of the 2020 4th International Conference on Digital Signal ProcessingMusic Genre Classification is a significant and practical field of Music Information Retrieval. Deep learning is increasingly being applied to Music Genre Classification for two main reasons. Firstly, it avoids the manual selection of audio signal ...
Automatic classification of personal video recordings based on audiovisual features
A system for automatic classification of personal videos is presented.Personal video recordings are classified into 24 categories frame by frame.Features derived from both audio and image data are used to classify.The system learns the most appropriate ...
Multi-semantic Representation with Transformer Network for Video Classification
ICMLC '23: Proceedings of the 2023 15th International Conference on Machine Learning and ComputingVideo classification is an important and challenging task. Videos usually contain a series of key actions and motion patterns. Video classifiers need to learn and describe them with an embedding vector. Generally, these actions and patterns imply ...
Comments