research-article

Multi-semantic Representation with Transformer Network for Video Classification

Authors:

Yuxi Sun,

Tonghuan Xiao,

Xutao Li,

Yunming YeAuthors Info & Claims

ICMLC '23: Proceedings of the 2023 15th International Conference on Machine Learning and Computing

Pages 274 - 278

https://doi.org/10.1145/3587716.3587761

Published: 07 September 2023 Publication History

Get Access

Abstract

Video classification is an important and challenging task. Videos usually contain a series of key actions and motion patterns. Video classifiers need to learn and describe them with an embedding vector. Generally, these actions and patterns imply different semantic information. However, existing methods usually only consider single-level semantic features, such as the last pooling layer, to represent the entire video. Complex video content cannot be effectively represented, and classification accuracy is not good enough. To address the limitation, we propose a novel multi-semantic representation method for video classification. Our method consists of several transformer network blocks, semantic graph attention modules, and a feature fusion module. Each transformer block extracts the visual features of video frames and the features of the last block are transformed into an embedding vector. These blocks indicate different levels of visual features. The graph attention module uses these features to generate multi-semantic vectors of a video. Finally, these multi-semantic vectors and the previous embedding vector are fused by a feature fusion module. The fused vector is applied to classify video. Extensive experiments on a benchmark video classification dataset demonstrate that our method outperforms various state-of-the-art methods.

References

[1]

Kuehne H, Jhuang H, Garrote E, HMDB: A large video database for human motion recognition[C]//Proceedings of the IEEE International Conference on Computer Vision. 2011: 2556-2563.

Google Scholar

[2]

Wang H, Schmid C. Action recognition with improved trajectories[C]//Proceedings of the IEEE International Conference on Computer Vision. 2013: 3551-3558.

Google Scholar

[3]

Fan L, Huang W, Gan C, End-to-end learning of motion representation for video understanding[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 6016-6025.

Google Scholar

[4]

Wang L, Xiong Y, Wang Z, Temporal segment networks: Towards good practices for deep action recognition[C]//Proceedings of the European Conference on Computer Vision. 2016: 20-36.

Google Scholar

[5]

Tran D, Bourdev L, Fergus R, Learning spatiotemporal features with 3d convolutional networks[C]//Proceedings of the IEEE International Conference on Computer Vision. 2015: 4489-4497.

Google Scholar

[6]

Wang X, Girshick R, Gupta A, Non-local neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 7794-7803.

Google Scholar

[7]

Dosovitskiy A, Beyer L, Kolesnikov A, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale[C]//International Conference on Learning Representations. 2020.

Google Scholar

[8]

Fan H, Xiong B, Mangalam K, Multiscale vision transformers[C]//Proceedings of the IEEE International Conference on Computer Vision. 2021: 6824-6835.

Google Scholar

[9]

Wang L, Qiao Y, Tang X. Action recognition with trajectory-pooled deep-convolutional descriptors[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 4305-4314.

Google Scholar

[10]

Diba A, Fayyaz M, Sharma V, Spatio-temporal channel correlation networks for action classification[C]//Proceedings of the European Conference on Computer Vision. 2018: 284-299.

Google Scholar

[11]

Lin J, Gan C, Han S. Tsm: Temporal shift module for efficient video understanding[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019: 7083-7093.

Google Scholar

[12]

Jiang B, Wang M M, Gan W, Stm: Spatiotemporal and motion encoding for action recognition[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019: 2000-2009.

Google Scholar

Index Terms

Multi-semantic Representation with Transformer Network for Video Classification
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
      2. Computer vision tasks
  2. Machine learning
    1. Learning paradigms
      1. Supervised learning
    2. Machine learning approaches

Index terms have been assigned to the content through auto-classification.

Recommendations

MEViT: Motion Enhanced Video Transformer for Video Classification
MultiMedia Modeling
Abstract
Due to the advantages in extracting the long-range dependencies, self-attention based transformers are widely used to model the spatio-temporal features for video classification, which achieves competitive performance compared to 3D CNNs. To ...
D²F: discriminative dense fusion of appearance and motion modalities for end-to-end video classification
Abstract
Recently, two-stream networks with multi-modality inputs have shown to be of vital importance for state-of-the-art video understanding. Previous deep systems typically employ a late fusion strategy, however, despite its simplicity and ...
Enriching media fragments with named entities for video classification
WWW '13 Companion: Proceedings of the 22nd International Conference on World Wide Web

With the steady increase of videos published on media sharing platforms such as Dailymotion and YouTube, more and more efforts are spent to automatically annotate and organize these videos. In this paper, we propose a framework for classifying video ...

Comments

Information & Contributors

Information

Published In

ICMLC '23: Proceedings of the 2023 15th International Conference on Machine Learning and Computing

February 2023

619 pages

ISBN:9781450398411

DOI:10.1145/3587716

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 September 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Shenzhen Science and Technology Program

Conference

ICMLC 2023

ICMLC 2023: 2023 15th International Conference on Machine Learning and Computing

February 17 - 20, 2023

Zhuhai, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
41
Total Downloads

Downloads (Last 12 months)23
Downloads (Last 6 weeks)7

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Abstract

References

Index Terms

Recommendations

MEViT: Motion Enhanced Video Transformer for Video Classification

D2F: discriminative dense fusion of appearance and motion modalities for end-to-end video classification

Enriching media fragments with named entities for video classification

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Login options

Full Access

View options

PDF

eReader

HTML Format

Share

Share this Publication link

Share on social media

Affiliations

D²F: discriminative dense fusion of appearance and motion modalities for end-to-end video classification