Due to segmentation and splicing in micro-videos when user upload videos to platform, the content of different shots in the same scene is discontinuous, which leads to the problem of large content differences between different shots. At the same time, due to the low resolution of the shooting equipment or jitter and other factors, the video has noise information. In view of the above problems, the conventional and serialized scene feature learning in micro-video cannot learn the content difference and correlation between different shots, which will weaken the semantic representation of scene features. Therefore, this paper proposes a micro-video scene classification method based on De-noising Multi-shots Association Self-attention (DeMsASa) model. In this method, the shot boundary detection algorithm segments micro- video firstly, and then the semantic representation of the multi-shots video scene is learned by de-noising, association between video frames in the same shot and the association modeling between different shots. Experiments results show that the classification performance of the proposed method is superior to the existing micro-video scene classification methods.

Similar content being viewed by others
Data and code availability
The data that support the findings of this study and the code are available at https://github.com/guojiemla/GJ_Project_DeMsASa.
Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vision 42(3):145–175
Doretto G, Chiuso A, Wu N, Soatto S (2003) Dynamic textures. Int J Comput Vision 51(2):91–109
Shroff N, Turaga P, Chellappa R (2010) Moving vistas: exploiting motion for describing scenes. In: CVPR, pp 1911–1918
Marszalek M, Laptev I, Schmida C (2009) Actions in context. In: CVPR, pp 2929–2936
Vasudevan AB, Muralidharan S, Chintapalli SR, Raman S (2013) Dynamic scene classification using spatial and temporal cues. In: ICCV, pp 803–810
Derpanis KG, Lecce M, Daniilidis K, Wildes RP (2012) Dynamic scene understanding: the role of orientation features in space and time in scene classification. In: ICCV, pp 1306–1313
Feichtenhofer CAP, Wildes RP (2014) Bags of spacetime energies for dynamic scene recognition. In: CVPR, pp 2681–2688
Feichtenhofer CAP, Wildes RP (2016) Dynamic scene recognition with complementary spatiotemporal features. IEEE T Pattern Anal 38(12):2389–2401
Du L, Ling H (2016) Dynamic scene classification using redundant spatial scenelets. IEEE T Cybern 46(9):2156–2165
Theriault C, Thome N, Cord M (2013) Dynamic scene classification: learning motion descriptors with slow features analysis. In: CVPR, pp 2603–2610
Wiskott L, Sejnowski TJ (2002) Slow feature analysis: unsupervised learning of invariances. Neural Comput 14(4):2156–2165
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: ICCV, pp 4489–4497
Wiskott L, Sejnowski TJ (2019) Long-short-term features for dynamic scene classification. IEEE T Circ Syst 29(4):1038–1047
Bertasius G, Wang H, Torresani L (2021) Is space-time attention all you need for video understanding? In: ICML, pp 813–824
Zhang R, Li J, Sun H, Ge Y, Luo P, Wang X, Lin L (2019) Scan: self-and-collaborative attention network for video person re-identification. IEEE T Image Process 28(10):4870–4882
Zhang J, Nie L, Wang X, He X, Huang X, Chua TS (2016) Shorter-is-better: venue category estimation from micro-video. In: ACMMM, pp 1415–1424
Nie L, Wang X, Zhang J, He X, Zhang H, Hong R, Tian Q (2017) Enhancing micro-video understanding by harnessing external sounds. In: ACMMM, pp 1192–1200
Krizhevsky A, Sutskever I, Hinton G (2012) Imagenet classification with deep convolutional neural networks. In: NeurIPS, pp.1097–1105
Liu M, Nie L, Wang M, Chen B (2017) Towards micro-video understanding by joint sequential-sparse modeling. In: ACMMM, pp 970–978
Liu W, Huang X, Cao G, Zhang J, Song G, Yang L (2019) Multi-modal sequence model with gated fully convolutional blocks for micro-video venue classification. Multimed Tools Appl 79(9):6709–6726
Guo J, Nie X, Cui C, Xi X, Ma Y, Yin Y (2018) Getting more from one attractive scene: venue retrieval in micro-videos. In: PCM, pp 721–733
Guo J, Nie X, Jian M, Yin Y (2019) Binary feature representation learning for scene retrieval in micro-video. Multimed Tools Appl 78(17):24539–24552
Guo J, Nie X, Yin Y (2020) Mutual complementarity: multi-modal enhancement semantic learning for micro-video scene recognition. IEEE Access 8:29518–29524
Zhou B, Lapedriza A, Khosla A, Oliva A, Torralba A (2018) Places: a 10 million image database for scene recognition. IEEE T Pattern Anal 40(6):1452–1464
Wei Y, Wang X, Guan W, Nie L, Lin Z, Chen B (2019) Neural multimodal cooperative learning toward micro-video understanding. IEEE T Image Process 29:1–14
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR, pp 770–778
Wang B, Huang X, Cao G, Yang L, Wei X, Tao Z (2022) Hybrid-attention and frame difference enhanced network for micro-video venue recognition. J Intell Fuzzy Syst 43(3):3337–3353
Wang B, Huang X, Cao G, Yang L, Wei X, Tao Z (2022) Attention-enhanced and trusted multimodal learning for micro-video venue recognition. Comput Electr Eng 102:108–127
Lu W, Lin J, Jing P, Su Y (2023) A multimodal aggregation network with serial self-attention mechanism for micro-video multi-label classification. IEEE Signal Proc Let 30:60–64
Liu W, Cao J, Wei R, Zhu X, Liu B (2024) Enhancing micro-video venue recognition via multi-modal and multi-granularity object relations. IEEE T Circ Syst Vid 34(7):5440–5451
Li Y, Liu S, Wang X, Jing P (2023) Self-supervised deep partial adversarial network for micro-video multimodal classification. Inf Sci 230:356–369
Souek T, Loko J (2020) Transnet v2: An effective deep network architecture for fast shot transition detection. In: ACMMM, pp. 11218–11221
Guo J, Nie X, Ma Y, Shaheed K, Ullah I, Yin Y (2021) Attention based consistent semantic learning for micro-video scene recognition. Inf Sci 543:504–516
Liu W, Huang X, Cao G, Zhang J, Song G, Yang L (2019) Joint learning of nnextvlad, cnn and context gating for micro-video venue classification. IEEE Access 7:77091–77099
Zhang Y, Min W, Nie L (2020) Hybrid-attention enhanced two-stream fusion network for video venue prediction. IEEE T Multimed 23:2917–2929
This work was supported by Natural Science Foundation of Shandong Province (ZR2021QF119, ZR2022MF272), the Foundation of Key Laboratory of Computing Power Network and Information Security, Ministry of Education (2023ZD030) and doctoral funds of Shandong Jianzhu University.
Author information
Authors and Affiliations
Rui Gong: Writing original draft, Programming. Yu Zhang: Writing, review and editing. Yanhui Zhang: Writing,review and editing. Yue Liu: Writing, review and editing. Jie Guo: Methodology, Supervision.Xiushan Nie:Methodology, Supervision
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Ethical and informed consent
Written informed content for publication of this paper was obtained from all authors.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Gong, R., Zhang, Y., Zhang, Y. et al. Demsasa: micro-video scene classification based on denoising multi-shots association self-attention. Pattern Anal Applic 27, 155 (2024). https://doi.org/10.1007/s10044-024-01378-6
DOI: https://doi.org/10.1007/s10044-024-01378-6