Skip to main content

Advertisement

Log in

A deep learning-based classification for topic detection of audiovisual documents

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

The description of the audiovisual documents aims essentially at providing meaningful and explanatory information about their content. Despite the multiple efforts made by several researchers to extract descriptions, the lack of pertinent semantic descriptions always persists. We introduce, in this paper, a new approach to improve the semantic descriptions of the cinematic audiovisual documents. To ensure a high description level, we combine different sources of information related to the content (the script of the movie and the superposed text of the image). This process is mainly based on a semantic segmentation algorithm. The Structured Topic Model (STM) and the LSCOM Ontology (http://www.ee.columbia.edu/ln/dvmm/lscom/) (Large Scale Concept ontologyMultimedia) are adapted for knowledge and descriptions extraction. Deep classification techniques, such as LSTM (long short-term memory) and softmax regression, are used to classify the generic topics into specific topics. The performance of the developed approach is assessed as follows. First, STM topic is adapted and evaluated using the CMU movie summary corpus. Then, the topics detection and classification processes are applied and their results are compared to those provided by human judgments employing the MoviLens dataset. Finally, quantitative evaluation is performed utilizing the M-VAD (Montreal Video Annotation Dataset) [44] and MPII-MD (large scale movie description datasets) [35] databases. The comparative study proves that the suggested approach outperforms the existing ones in terms of the precision of the obtained topics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. https://aws.amazon.com/transcribe/

  2. http://movienet.site/

  3. http://www.cs.cmu.edu/~ark/personas/

References

  1. Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Lawrence Zitnick C, Parikh D (2015) Vqa: Visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425–2433

  2. Atkinson J, Gonzalez A, Munoz M, Astudillo H (2014) Web metadata extraction and semantic indexing for learning objects extraction. Appl Intell 41(2):649–664

    Article  Google Scholar 

  3. Ballan L, Bertini M, Del Bimbo A, Seidenari L, Serra G (2011) Event detection and recognition for semantic annotation of video. Multimed Tools Appl 51(1):279–302

    Article  Google Scholar 

  4. Basu S, Yu Y, Singh VK, Zimmermann R (2016) Videopedia: Lecture video recommendation for educational blogs using topic modeling. Springer, Cham, pp 238–250

    Google Scholar 

  5. Bellegarda JR (1997) A latent semantic analysis framework for large-span language modeling. In: EUROSPEECH

  6. Ben-Ahmed O, Huet B (2018) Deep multimodal features for movie genre and interestingness prediction. In: 2018 International conference on content-based multimedia indexing (CBMI). IEEE, pp 1–6

  7. Bougiatiotis K, Giannakopoulos T (2016) Content representation and similarity of movies based on topic extraction from subtitles. In: Proceedings of the 9th Hellenic conference on artificial intelligence. ACM, pp 1–7

  8. Chang X, Yang Y, Hauptmann A, Xing EP, Yu YL (2015) Semantic concept discovery for large-scale zero-shot event detection. In: Twenty-fourth international joint conference on artificial intelligence

  9. Chen D, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp 190–200

  10. Chen X, Zou D, Cheng G, Xie H (2020) Detecting latent topics and trends in educational technologies over four decades using structural topic modeling: a retrospective of all volumes of computers & education. Comput Educ 151(103):855

    Google Scholar 

  11. Dascalu M, Dessus P, Trausan-matu S, Bianco M, Nardy A (2013) Readerbench, an environment for analyzing text complexity and reading strategies. In: Artif Intell Educ. Springer, pp 379–388

  12. Denkowski M, Lavie A (2014) Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the ninth workshop on statistical machine translation, pp 376–380

  13. Fang Z, Liu J, Li Y, Qiao Y, Lu H (2019) Improving visual question answering using dropout and enhanced question encoder. Pattern Recogn 90:404–414

    Article  Google Scholar 

  14. Fourati M, Jedidi A, Gargouri F (2017) Generic descriptions for movie document: an experimental study. In: 2017 IEEE/ACS 14Th international conference on computer systems and applications (AICCSA). IEEE, pp 766–773

  15. Fourati M, Jedidi A, Gargouri F (2020) A survey on description and modeling of audiovisual documents. Multimed Tools Appl 79(45):33,519–33, 546

    Article  Google Scholar 

  16. Fourati M, Jedidi A, Hassin HB, Gargouri F (2015) Towards fusion of textual and visual modalities for describing audiovisual documents. Inter J Multimed Data Eng Manag (IJMDEM) 6(2):52–70

    Article  Google Scholar 

  17. Gan Z, Gan C, He X, Pu Y, Tran K, Gao J, Carin L, Deng L (2017) Semantic compositional networks for visual captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5630–5639

  18. Gharbi H, Bahroun S, Zagrouba E (2019) Key frame extraction for video summarization using local description and repeatability graph clustering. SIViP 13(3):507–515

    Article  Google Scholar 

  19. Hamroun M, Tamine K, Crespin B (2021) Multimodal video indexing (mvi): A new method based on machine learning and semi-automatic annotation on large video collections. International Journal of Image and Graphics p 2250022

  20. Hao X, Zhou F, Li X (2020) Scene-edge gru for video caption. In: 2020 IEEE 4Th information technology, networking, electronic and automation control conference (ITNEC). IEEE, vol 1, pp 1290–1295

  21. Harispe S, Senchez D, Ranwez S, Janaqi S, Montmain J (2014) A framework for unifying ontology-based semantic similarity measures: a study in the biomedical domain. J Biomed Inf 48:38–53

    Article  Google Scholar 

  22. He Y, Li Y, Lei J, Leung C (2016) A framework of query expansion for image retrieval based on knowledge base and concept similarity. Neurocomputing - Inpress

  23. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  24. Huang Q, Xiong Y, Rao A, Wang J, Lin D (2020) Movienet: a holistic dataset for movie understanding. In: Computer vision–ECCV 2020: 16th european conference, glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16. Springer, pp 709–727

  25. Jelodar H, Wang Y, Yuan C, Feng X, Jiang X, Li Y, Zhao L (2019) Latent dirichlet allocation (lda) and topic modeling: models, applications, a survey. Multimed Tools Appl 78(11):15,169–15,211

    Article  Google Scholar 

  26. Li L, Tang S, Zhang Y, Deng L, Tian Q (2017) Gla: Global–local attention for image description. IEEE Trans Multimedia 20(3):726–737

    Article  Google Scholar 

  27. Li X, Zhang J, Ouyang J (2019) Dirichlet multinomial mixture with variational manifold regularization: Topic modeling over short texts. In: Proceedings of the AAAI Conference on artificial intelligence, vol 33, pp 7884–7891

  28. Lin CY (2004) Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81

  29. Luo B, Li H, Meng F, Wu Q, Huang C (2017) Video object segmentation via global consistency aware query strategy. IEEE Trans Multimed 19(7):1482–1493

    Article  Google Scholar 

  30. Matthews P (2019) Human-in-the-loop topic modelling: Assessing topic labelling and genre-topic relations with a movie plot summary corpus. In: The human position in an artificial world: creativity, ethics and AI in knowledge organization. Ergon-verlag, pp 181–207

  31. Matthews P, Glitre K (2021) Genre analysis of movies using a topic model of plot summaries. J Assoc Inf Sci 72:1–17

    Google Scholar 

  32. Mocanu B, Tapu R, Tapu E (2016) Video retrieval using relevant topics extraction from movie subtitles. In: 12Th IEEE international symposium on electronics and telecommunications (ISETC), 2016. IEEE, pp 327–330

  33. Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu:, a method for automatic evaluation of machine translation. p 311–318

  34. Roberts ME, Stewart BM, Tingley D (2019) Stm: an r package for structural topic models. J Stat Softw 91(1):1–40

    Google Scholar 

  35. Rohrbach A, Rohrbach M, Tandon N, Schiele B (2015) A dataset for movie description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3202–3212

  36. Rotman D, Porat D, Ashour G (2016) Robust and efficient video scene detection using optimal sequential grouping. In: 2016 IEEE International symposium on multimedia (ISM). IEEE, pp 275–280

  37. Rotman D, Porat D, Ashour G (2017) Robust video scene detection using multimodal fusion of optimally grouped features. In: 2017 IEEE 19Th international workshop on multimedia signal processing (MMSP). IEEE, pp 1–6

  38. Sadique MF, Rahman MA, Haque SR (2020) Content based unsupervised video summarization using birds foraging search. In: 2020 11Th international conference on computing, communication and networking technologies (ICCCNT). IEEE, pp 1–7

  39. Sanchez-Nielsen E, Chavez-Gutierrez F, Lorenzo-Navarro J (2019) A semantic parliamentary multimedia approach for retrieval of video clips with content understanding. Multimedia Systems 25:337–354

    Article  Google Scholar 

  40. Shah R, Zimmermann R (2017) Multimodal analysis of user-generated multimedia content. Springer

  41. Song J, Guo Y, Gao L, Li X, Hanjalic A, Shen HT (2018) From deterministic to generative: Multimodal stochastic rnns for video captioning. IEEE Trans Neural Netw Learn Syst 30(10):3047–3058

    Article  Google Scholar 

  42. Stappen L, Baird A, Cambria E, Schuller BW (2021) Sentiment analysis and topic recognition in video transcriptions. IEEE Intell Syst 36(2):88–95

    Article  Google Scholar 

  43. Tang P, Wang C, Wang X, Liu W, Zeng W, Wang J (2019) Object detection in videos by high quality object linking. IEEE Trans Pattern Anal Mach Intell 42(5):1272–1278

    Article  Google Scholar 

  44. Torabi A, Pal C, Larochelle H, Courville A (2015) Using descriptive video services to create a large data source for video annotation research. CoRR:1503.01070, p 1–7

  45. Trojahn TH, Goularte R (2021) Temporal video scene segmentation using deep-learning. Multimed Tools Appl 80(12):17, 487–17, 513

    Article  Google Scholar 

  46. Tsai WL (2021) A cooperative mechanism for managing multimedia project documentation. Multimedia Tools and Applications, p 1–14

  47. Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575

  48. Wang H, Gao C, Han Y (2020) Sequence in sequence for video captioning. Pattern Recogn Lett 130:327–334

    Article  Google Scholar 

  49. Xu J, Mei T, Yao T, Rui Y (2016) Msr-vtt: a large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5288–5296

  50. Yang H, Meinel C (2014) Content based lecture video retrieval using speech and video text information. IEEE Trans Learn Technol 7(2):142–154

    Article  Google Scholar 

  51. Yang Y, Zhou J, Ai J, Bin Y, Hanjalic A, Shen HT, Ji Y (2018) Video captioning by adversarial lstm. IEEE Trans Image Process 27(11):5600–5611

    Article  MathSciNet  Google Scholar 

  52. Ye G, Li Y, Xu H, Liu D, Chang SF (2015) Eventnet: a large scale structured concept library for complex event detection in video. In: Proceedings of the 23rd ACM international conference on Multimedia. ACM, pp 471–480

  53. Zhao B, Li X, Lu X (2019) Cam-rnn: Co-attention model based rnn for video captioning. IEEE Trans Image Process 28(11):5552–5565

    Article  MathSciNet  MATH  Google Scholar 

  54. Zhou W, Li H, Tian Q (2017) Recent advance in content-based image retrieval: A literature survey. arXiv:1706.06064

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Manel Fourati.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fourati, M., Jedidi, A. & Gargouri, F. A deep learning-based classification for topic detection of audiovisual documents. Appl Intell 53, 8776–8798 (2023). https://doi.org/10.1007/s10489-022-03938-x

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-03938-x

Keywords

Navigation