Skip to main content
Log in

Demsasa: micro-video scene classification based on denoising multi-shots association self-attention

  • Original Paper
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

Due to segmentation and splicing in micro-videos when user upload videos to platform, the content of different shots in the same scene is discontinuous, which leads to the problem of large content differences between different shots. At the same time, due to the low resolution of the shooting equipment or jitter and other factors, the video has noise information. In view of the above problems, the conventional and serialized scene feature learning in micro-video cannot learn the content difference and correlation between different shots, which will weaken the semantic representation of scene features. Therefore, this paper proposes a micro-video scene classification method based on De-noising Multi-shots Association Self-attention (DeMsASa) model. In this method, the shot boundary detection algorithm segments micro- video firstly, and then the semantic representation of the multi-shots video scene is learned by de-noising, association between video frames in the same shot and the association modeling between different shots. Experiments results show that the classification performance of the proposed method is superior to the existing micro-video scene classification methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data and code availability

The data that support the findings of this study and the code are available at https://github.com/guojiemla/GJ_Project_DeMsASa.

References

  1. Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vision 42(3):145–175

    Article  Google Scholar 

  2. Doretto G, Chiuso A, Wu N, Soatto S (2003) Dynamic textures. Int J Comput Vision 51(2):91–109

    Article  Google Scholar 

  3. Shroff N, Turaga P, Chellappa R (2010) Moving vistas: exploiting motion for describing scenes. In: CVPR, pp 1911–1918

  4. Marszalek M, Laptev I, Schmida C (2009) Actions in context. In: CVPR, pp 2929–2936

  5. Vasudevan AB, Muralidharan S, Chintapalli SR, Raman S (2013) Dynamic scene classification using spatial and temporal cues. In: ICCV, pp 803–810

  6. Derpanis KG, Lecce M, Daniilidis K, Wildes RP (2012) Dynamic scene understanding: the role of orientation features in space and time in scene classification. In: ICCV, pp 1306–1313

  7. Feichtenhofer CAP, Wildes RP (2014) Bags of spacetime energies for dynamic scene recognition. In: CVPR, pp 2681–2688

  8. Feichtenhofer CAP, Wildes RP (2016) Dynamic scene recognition with complementary spatiotemporal features. IEEE T Pattern Anal 38(12):2389–2401

    Article  Google Scholar 

  9. Du L, Ling H (2016) Dynamic scene classification using redundant spatial scenelets. IEEE T Cybern 46(9):2156–2165

    Article  Google Scholar 

  10. Theriault C, Thome N, Cord M (2013) Dynamic scene classification: learning motion descriptors with slow features analysis. In: CVPR, pp 2603–2610

  11. Wiskott L, Sejnowski TJ (2002) Slow feature analysis: unsupervised learning of invariances. Neural Comput 14(4):2156–2165

    Article  Google Scholar 

  12. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: ICCV, pp 4489–4497

  13. Wiskott L, Sejnowski TJ (2019) Long-short-term features for dynamic scene classification. IEEE T Circ Syst 29(4):1038–1047

    Article  Google Scholar 

  14. Bertasius G, Wang H, Torresani L (2021) Is space-time attention all you need for video understanding? In: ICML, pp 813–824

  15. Zhang R, Li J, Sun H, Ge Y, Luo P, Wang X, Lin L (2019) Scan: self-and-collaborative attention network for video person re-identification. IEEE T Image Process 28(10):4870–4882

    Article  MathSciNet  Google Scholar 

  16. Zhang J, Nie L, Wang X, He X, Huang X, Chua TS (2016) Shorter-is-better: venue category estimation from micro-video. In: ACMMM, pp 1415–1424

  17. Nie L, Wang X, Zhang J, He X, Zhang H, Hong R, Tian Q (2017) Enhancing micro-video understanding by harnessing external sounds. In: ACMMM, pp 1192–1200

  18. Krizhevsky A, Sutskever I, Hinton G (2012) Imagenet classification with deep convolutional neural networks. In: NeurIPS, pp.1097–1105

  19. Liu M, Nie L, Wang M, Chen B (2017) Towards micro-video understanding by joint sequential-sparse modeling. In: ACMMM, pp 970–978

  20. Liu W, Huang X, Cao G, Zhang J, Song G, Yang L (2019) Multi-modal sequence model with gated fully convolutional blocks for micro-video venue classification. Multimed Tools Appl 79(9):6709–6726

    Google Scholar 

  21. Guo J, Nie X, Cui C, Xi X, Ma Y, Yin Y (2018) Getting more from one attractive scene: venue retrieval in micro-videos. In: PCM, pp 721–733

  22. Guo J, Nie X, Jian M, Yin Y (2019) Binary feature representation learning for scene retrieval in micro-video. Multimed Tools Appl 78(17):24539–24552

    Article  Google Scholar 

  23. Guo J, Nie X, Yin Y (2020) Mutual complementarity: multi-modal enhancement semantic learning for micro-video scene recognition. IEEE Access 8:29518–29524

    Article  Google Scholar 

  24. Zhou B, Lapedriza A, Khosla A, Oliva A, Torralba A (2018) Places: a 10 million image database for scene recognition. IEEE T Pattern Anal 40(6):1452–1464

    Article  Google Scholar 

  25. Wei Y, Wang X, Guan W, Nie L, Lin Z, Chen B (2019) Neural multimodal cooperative learning toward micro-video understanding. IEEE T Image Process 29:1–14

    Article  MathSciNet  Google Scholar 

  26. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR, pp 770–778

  27. Wang B, Huang X, Cao G, Yang L, Wei X, Tao Z (2022) Hybrid-attention and frame difference enhanced network for micro-video venue recognition. J Intell Fuzzy Syst 43(3):3337–3353

    Article  Google Scholar 

  28. Wang B, Huang X, Cao G, Yang L, Wei X, Tao Z (2022) Attention-enhanced and trusted multimodal learning for micro-video venue recognition. Comput Electr Eng 102:108–127

    Article  Google Scholar 

  29. Lu W, Lin J, Jing P, Su Y (2023) A multimodal aggregation network with serial self-attention mechanism for micro-video multi-label classification. IEEE Signal Proc Let 30:60–64

    Article  Google Scholar 

  30. Liu W, Cao J, Wei R, Zhu X, Liu B (2024) Enhancing micro-video venue recognition via multi-modal and multi-granularity object relations. IEEE T Circ Syst Vid 34(7):5440–5451

  31. Li Y, Liu S, Wang X, Jing P (2023) Self-supervised deep partial adversarial network for micro-video multimodal classification. Inf Sci 230:356–369

    Article  Google Scholar 

  32. Souek T, Loko J (2020) Transnet v2: An effective deep network architecture for fast shot transition detection. In: ACMMM, pp. 11218–11221

  33. Guo J, Nie X, Ma Y, Shaheed K, Ullah I, Yin Y (2021) Attention based consistent semantic learning for micro-video scene recognition. Inf Sci 543:504–516

    Article  MathSciNet  Google Scholar 

  34. Liu W, Huang X, Cao G, Zhang J, Song G, Yang L (2019) Joint learning of nnextvlad, cnn and context gating for micro-video venue classification. IEEE Access 7:77091–77099

    Article  Google Scholar 

  35. Zhang Y, Min W, Nie L (2020) Hybrid-attention enhanced two-stream fusion network for video venue prediction. IEEE T Multimed 23:2917–2929

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by Natural Science Foundation of Shandong Province (ZR2021QF119, ZR2022MF272), the Foundation of Key Laboratory of Computing Power Network and Information Security, Ministry of Education (2023ZD030) and doctoral funds of Shandong Jianzhu University.

Author information

Authors and Affiliations

Authors

Contributions

Rui Gong: Writing original draft, Programming. Yu Zhang: Writing, review and editing. Yanhui Zhang: Writing,review and editing. Yue Liu: Writing, review and editing. Jie Guo: Methodology, Supervision.Xiushan Nie:Methodology, Supervision

Corresponding authors

Correspondence to Jie Guo or Xiushan Nie.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethical and informed consent

Written informed content for publication of this paper was obtained from all authors.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gong, R., Zhang, Y., Zhang, Y. et al. Demsasa: micro-video scene classification based on denoising multi-shots association self-attention. Pattern Anal Applic 27, 155 (2024). https://doi.org/10.1007/s10044-024-01378-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10044-024-01378-6

Keywords