skip to main content
10.1145/3587716.3587761acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicmlcConference Proceedingsconference-collections
research-article

Multi-semantic Representation with Transformer Network for Video Classification

Published: 07 September 2023 Publication History

Abstract

Video classification is an important and challenging task. Videos usually contain a series of key actions and motion patterns. Video classifiers need to learn and describe them with an embedding vector. Generally, these actions and patterns imply different semantic information. However, existing methods usually only consider single-level semantic features, such as the last pooling layer, to represent the entire video. Complex video content cannot be effectively represented, and classification accuracy is not good enough. To address the limitation, we propose a novel multi-semantic representation method for video classification. Our method consists of several transformer network blocks, semantic graph attention modules, and a feature fusion module. Each transformer block extracts the visual features of video frames and the features of the last block are transformed into an embedding vector. These blocks indicate different levels of visual features. The graph attention module uses these features to generate multi-semantic vectors of a video. Finally, these multi-semantic vectors and the previous embedding vector are fused by a feature fusion module. The fused vector is applied to classify video. Extensive experiments on a benchmark video classification dataset demonstrate that our method outperforms various state-of-the-art methods.

References

[1]
Kuehne H, Jhuang H, Garrote E, HMDB: A large video database for human motion recognition[C]//Proceedings of the IEEE International Conference on Computer Vision. 2011: 2556-2563.
[2]
Wang H, Schmid C. Action recognition with improved trajectories[C]//Proceedings of the IEEE International Conference on Computer Vision. 2013: 3551-3558.
[3]
Fan L, Huang W, Gan C, End-to-end learning of motion representation for video understanding[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 6016-6025.
[4]
Wang L, Xiong Y, Wang Z, Temporal segment networks: Towards good practices for deep action recognition[C]//Proceedings of the European Conference on Computer Vision. 2016: 20-36.
[5]
Tran D, Bourdev L, Fergus R, Learning spatiotemporal features with 3d convolutional networks[C]//Proceedings of the IEEE International Conference on Computer Vision. 2015: 4489-4497.
[6]
Wang X, Girshick R, Gupta A, Non-local neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 7794-7803.
[7]
Dosovitskiy A, Beyer L, Kolesnikov A, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale[C]//International Conference on Learning Representations. 2020.
[8]
Fan H, Xiong B, Mangalam K, Multiscale vision transformers[C]//Proceedings of the IEEE International Conference on Computer Vision. 2021: 6824-6835.
[9]
Wang L, Qiao Y, Tang X. Action recognition with trajectory-pooled deep-convolutional descriptors[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 4305-4314.
[10]
Diba A, Fayyaz M, Sharma V, Spatio-temporal channel correlation networks for action classification[C]//Proceedings of the European Conference on Computer Vision. 2018: 284-299.
[11]
Lin J, Gan C, Han S. Tsm: Temporal shift module for efficient video understanding[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019: 7083-7093.
[12]
Jiang B, Wang M M, Gan W, Stm: Spatiotemporal and motion encoding for action recognition[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019: 2000-2009.

Index Terms

  1. Multi-semantic Representation with Transformer Network for Video Classification
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image ACM Other conferences
          ICMLC '23: Proceedings of the 2023 15th International Conference on Machine Learning and Computing
          February 2023
          619 pages
          ISBN:9781450398411
          DOI:10.1145/3587716
          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 07 September 2023

          Permissions

          Request permissions for this article.

          Check for updates

          Author Tags

          1. Video classification
          2. action recognition
          3. and video representation learning

          Qualifiers

          • Research-article
          • Research
          • Refereed limited

          Funding Sources

          • Shenzhen Science and Technology Program

          Conference

          ICMLC 2023

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • 0
            Total Citations
          • 41
            Total Downloads
          • Downloads (Last 12 months)23
          • Downloads (Last 6 weeks)7
          Reflects downloads up to 05 Mar 2025

          Other Metrics

          Citations

          View Options

          Login options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format.

          HTML Format

          Figures

          Tables

          Media

          Share

          Share

          Share this Publication link

          Share on social media