Demsasa: micro-video scene classification based on denoising multi-shots association self-attention

Gong, Rui; Zhang, Yu; Zhang, Yanhui; Liu, Yue; Guo, Jie; Nie, Xiushan

doi:10.1007/s10044-024-01378-6

Demsasa: micro-video scene classification based on denoising multi-shots association self-attention

Original Paper
Published: 29 November 2024

Volume 27, article number 155, (2024)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

Rui Gong¹,
Yu Zhang¹,
Yanhui Zhang²,
Yue Liu¹,
Jie Guo¹ &
…
Xiushan Nie¹

107 Accesses
Explore all metrics

Abstract

Due to segmentation and splicing in micro-videos when user upload videos to platform, the content of different shots in the same scene is discontinuous, which leads to the problem of large content differences between different shots. At the same time, due to the low resolution of the shooting equipment or jitter and other factors, the video has noise information. In view of the above problems, the conventional and serialized scene feature learning in micro-video cannot learn the content difference and correlation between different shots, which will weaken the semantic representation of scene features. Therefore, this paper proposes a micro-video scene classification method based on De-noising Multi-shots Association Self-attention (DeMsASa) model. In this method, the shot boundary detection algorithm segments micro- video firstly, and then the semantic representation of the multi-shots video scene is learned by de-noising, association between video frames in the same shot and the association modeling between different shots. Experiments results show that the classification performance of the proposed method is superior to the existing micro-video scene classification methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Unified Framework for Shot Type Classification Based on Subject Centric Lens

Attention-Based Video Disentangling and Matching Network for Zero-Shot Action Recognition

Data and code availability

The data that support the findings of this study and the code are available at https://github.com/guojiemla/GJ_Project_DeMsASa.

References

Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vision 42(3):145–175
Article Google Scholar
Doretto G, Chiuso A, Wu N, Soatto S (2003) Dynamic textures. Int J Comput Vision 51(2):91–109
Article Google Scholar
Shroff N, Turaga P, Chellappa R (2010) Moving vistas: exploiting motion for describing scenes. In: CVPR, pp 1911–1918
Marszalek M, Laptev I, Schmida C (2009) Actions in context. In: CVPR, pp 2929–2936
Vasudevan AB, Muralidharan S, Chintapalli SR, Raman S (2013) Dynamic scene classification using spatial and temporal cues. In: ICCV, pp 803–810
Derpanis KG, Lecce M, Daniilidis K, Wildes RP (2012) Dynamic scene understanding: the role of orientation features in space and time in scene classification. In: ICCV, pp 1306–1313
Feichtenhofer CAP, Wildes RP (2014) Bags of spacetime energies for dynamic scene recognition. In: CVPR, pp 2681–2688
Feichtenhofer CAP, Wildes RP (2016) Dynamic scene recognition with complementary spatiotemporal features. IEEE T Pattern Anal 38(12):2389–2401
Article Google Scholar
Du L, Ling H (2016) Dynamic scene classification using redundant spatial scenelets. IEEE T Cybern 46(9):2156–2165
Article Google Scholar
Theriault C, Thome N, Cord M (2013) Dynamic scene classification: learning motion descriptors with slow features analysis. In: CVPR, pp 2603–2610
Wiskott L, Sejnowski TJ (2002) Slow feature analysis: unsupervised learning of invariances. Neural Comput 14(4):2156–2165
Article Google Scholar
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: ICCV, pp 4489–4497
Wiskott L, Sejnowski TJ (2019) Long-short-term features for dynamic scene classification. IEEE T Circ Syst 29(4):1038–1047
Article Google Scholar
Bertasius G, Wang H, Torresani L (2021) Is space-time attention all you need for video understanding? In: ICML, pp 813–824
Zhang R, Li J, Sun H, Ge Y, Luo P, Wang X, Lin L (2019) Scan: self-and-collaborative attention network for video person re-identification. IEEE T Image Process 28(10):4870–4882
Article MathSciNet Google Scholar
Zhang J, Nie L, Wang X, He X, Huang X, Chua TS (2016) Shorter-is-better: venue category estimation from micro-video. In: ACMMM, pp 1415–1424
Nie L, Wang X, Zhang J, He X, Zhang H, Hong R, Tian Q (2017) Enhancing micro-video understanding by harnessing external sounds. In: ACMMM, pp 1192–1200
Krizhevsky A, Sutskever I, Hinton G (2012) Imagenet classification with deep convolutional neural networks. In: NeurIPS, pp.1097–1105
Liu M, Nie L, Wang M, Chen B (2017) Towards micro-video understanding by joint sequential-sparse modeling. In: ACMMM, pp 970–978
Liu W, Huang X, Cao G, Zhang J, Song G, Yang L (2019) Multi-modal sequence model with gated fully convolutional blocks for micro-video venue classification. Multimed Tools Appl 79(9):6709–6726
Google Scholar
Guo J, Nie X, Cui C, Xi X, Ma Y, Yin Y (2018) Getting more from one attractive scene: venue retrieval in micro-videos. In: PCM, pp 721–733
Guo J, Nie X, Jian M, Yin Y (2019) Binary feature representation learning for scene retrieval in micro-video. Multimed Tools Appl 78(17):24539–24552
Article Google Scholar
Guo J, Nie X, Yin Y (2020) Mutual complementarity: multi-modal enhancement semantic learning for micro-video scene recognition. IEEE Access 8:29518–29524
Article Google Scholar
Zhou B, Lapedriza A, Khosla A, Oliva A, Torralba A (2018) Places: a 10 million image database for scene recognition. IEEE T Pattern Anal 40(6):1452–1464
Article Google Scholar
Wei Y, Wang X, Guan W, Nie L, Lin Z, Chen B (2019) Neural multimodal cooperative learning toward micro-video understanding. IEEE T Image Process 29:1–14
Article MathSciNet Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR, pp 770–778
Wang B, Huang X, Cao G, Yang L, Wei X, Tao Z (2022) Hybrid-attention and frame difference enhanced network for micro-video venue recognition. J Intell Fuzzy Syst 43(3):3337–3353
Article Google Scholar
Wang B, Huang X, Cao G, Yang L, Wei X, Tao Z (2022) Attention-enhanced and trusted multimodal learning for micro-video venue recognition. Comput Electr Eng 102:108–127
Article Google Scholar
Lu W, Lin J, Jing P, Su Y (2023) A multimodal aggregation network with serial self-attention mechanism for micro-video multi-label classification. IEEE Signal Proc Let 30:60–64
Article Google Scholar
Liu W, Cao J, Wei R, Zhu X, Liu B (2024) Enhancing micro-video venue recognition via multi-modal and multi-granularity object relations. IEEE T Circ Syst Vid 34(7):5440–5451
Li Y, Liu S, Wang X, Jing P (2023) Self-supervised deep partial adversarial network for micro-video multimodal classification. Inf Sci 230:356–369
Article Google Scholar
Souek T, Loko J (2020) Transnet v2: An effective deep network architecture for fast shot transition detection. In: ACMMM, pp. 11218–11221
Guo J, Nie X, Ma Y, Shaheed K, Ullah I, Yin Y (2021) Attention based consistent semantic learning for micro-video scene recognition. Inf Sci 543:504–516
Article MathSciNet Google Scholar
Liu W, Huang X, Cao G, Zhang J, Song G, Yang L (2019) Joint learning of nnextvlad, cnn and context gating for micro-video venue classification. IEEE Access 7:77091–77099
Article Google Scholar
Zhang Y, Min W, Nie L (2020) Hybrid-attention enhanced two-stream fusion network for video venue prediction. IEEE T Multimed 23:2917–2929
Article Google Scholar

Download references

Acknowledgements

This work was supported by Natural Science Foundation of Shandong Province (ZR2021QF119, ZR2022MF272), the Foundation of Key Laboratory of Computing Power Network and Information Security, Ministry of Education (2023ZD030) and doctoral funds of Shandong Jianzhu University.

Author information

Authors and Affiliations

Shandong Jianzhu University, Jinan, 250101, China
Rui Gong, Yu Zhang, Yue Liu, Jie Guo & Xiushan Nie
The First Affiliated Hospital of Shandong First Medical University, Jinan, 250014, China
Yanhui Zhang

Authors

Rui Gong
View author publications
You can also search for this author inPubMed Google Scholar
Yu Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Yanhui Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Yue Liu
View author publications
You can also search for this author inPubMed Google Scholar
Jie Guo
View author publications
You can also search for this author inPubMed Google Scholar
Xiushan Nie
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Rui Gong: Writing original draft, Programming. Yu Zhang: Writing, review and editing. Yanhui Zhang: Writing,review and editing. Yue Liu: Writing, review and editing. Jie Guo: Methodology, Supervision.Xiushan Nie:Methodology, Supervision

Corresponding authors

Correspondence to Jie Guo or Xiushan Nie.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethical and informed consent

Written informed content for publication of this paper was obtained from all authors.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Gong, R., Zhang, Y., Zhang, Y. et al. Demsasa: micro-video scene classification based on denoising multi-shots association self-attention. Pattern Anal Applic 27, 155 (2024). https://doi.org/10.1007/s10044-024-01378-6

Download citation

Received: 21 June 2024
Accepted: 11 November 2024
Published: 29 November 2024
DOI: https://doi.org/10.1007/s10044-024-01378-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Demsasa: micro-video scene classification based on denoising multi-shots association self-attention

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Unified Framework for Shot Type Classification Based on Subject Centric Lens

Attention-Based Video Disentangling and Matching Network for Zero-Shot Action Recognition

Attention-Based Video Disentangling and Matching Network for Zero-Shot Action Recognition

Data and code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflict of interest

Ethical and informed consent

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now