Abstract
The description of the audiovisual documents aims essentially at providing meaningful and explanatory information about their content. Despite the multiple efforts made by several researchers to extract descriptions, the lack of pertinent semantic descriptions always persists. We introduce, in this paper, a new approach to improve the semantic descriptions of the cinematic audiovisual documents. To ensure a high description level, we combine different sources of information related to the content (the script of the movie and the superposed text of the image). This process is mainly based on a semantic segmentation algorithm. The Structured Topic Model (STM) and the LSCOM Ontology (http://www.ee.columbia.edu/ln/dvmm/lscom/) (Large Scale Concept ontologyMultimedia) are adapted for knowledge and descriptions extraction. Deep classification techniques, such as LSTM (long short-term memory) and softmax regression, are used to classify the generic topics into specific topics. The performance of the developed approach is assessed as follows. First, STM topic is adapted and evaluated using the CMU movie summary corpus. Then, the topics detection and classification processes are applied and their results are compared to those provided by human judgments employing the MoviLens dataset. Finally, quantitative evaluation is performed utilizing the M-VAD (Montreal Video Annotation Dataset) [44] and MPII-MD (large scale movie description datasets) [35] databases. The comparative study proves that the suggested approach outperforms the existing ones in terms of the precision of the obtained topics.
Similar content being viewed by others
References
Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Lawrence Zitnick C, Parikh D (2015) Vqa: Visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425–2433
Atkinson J, Gonzalez A, Munoz M, Astudillo H (2014) Web metadata extraction and semantic indexing for learning objects extraction. Appl Intell 41(2):649–664
Ballan L, Bertini M, Del Bimbo A, Seidenari L, Serra G (2011) Event detection and recognition for semantic annotation of video. Multimed Tools Appl 51(1):279–302
Basu S, Yu Y, Singh VK, Zimmermann R (2016) Videopedia: Lecture video recommendation for educational blogs using topic modeling. Springer, Cham, pp 238–250
Bellegarda JR (1997) A latent semantic analysis framework for large-span language modeling. In: EUROSPEECH
Ben-Ahmed O, Huet B (2018) Deep multimodal features for movie genre and interestingness prediction. In: 2018 International conference on content-based multimedia indexing (CBMI). IEEE, pp 1–6
Bougiatiotis K, Giannakopoulos T (2016) Content representation and similarity of movies based on topic extraction from subtitles. In: Proceedings of the 9th Hellenic conference on artificial intelligence. ACM, pp 1–7
Chang X, Yang Y, Hauptmann A, Xing EP, Yu YL (2015) Semantic concept discovery for large-scale zero-shot event detection. In: Twenty-fourth international joint conference on artificial intelligence
Chen D, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp 190–200
Chen X, Zou D, Cheng G, Xie H (2020) Detecting latent topics and trends in educational technologies over four decades using structural topic modeling: a retrospective of all volumes of computers & education. Comput Educ 151(103):855
Dascalu M, Dessus P, Trausan-matu S, Bianco M, Nardy A (2013) Readerbench, an environment for analyzing text complexity and reading strategies. In: Artif Intell Educ. Springer, pp 379–388
Denkowski M, Lavie A (2014) Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the ninth workshop on statistical machine translation, pp 376–380
Fang Z, Liu J, Li Y, Qiao Y, Lu H (2019) Improving visual question answering using dropout and enhanced question encoder. Pattern Recogn 90:404–414
Fourati M, Jedidi A, Gargouri F (2017) Generic descriptions for movie document: an experimental study. In: 2017 IEEE/ACS 14Th international conference on computer systems and applications (AICCSA). IEEE, pp 766–773
Fourati M, Jedidi A, Gargouri F (2020) A survey on description and modeling of audiovisual documents. Multimed Tools Appl 79(45):33,519–33, 546
Fourati M, Jedidi A, Hassin HB, Gargouri F (2015) Towards fusion of textual and visual modalities for describing audiovisual documents. Inter J Multimed Data Eng Manag (IJMDEM) 6(2):52–70
Gan Z, Gan C, He X, Pu Y, Tran K, Gao J, Carin L, Deng L (2017) Semantic compositional networks for visual captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5630–5639
Gharbi H, Bahroun S, Zagrouba E (2019) Key frame extraction for video summarization using local description and repeatability graph clustering. SIViP 13(3):507–515
Hamroun M, Tamine K, Crespin B (2021) Multimodal video indexing (mvi): A new method based on machine learning and semi-automatic annotation on large video collections. International Journal of Image and Graphics p 2250022
Hao X, Zhou F, Li X (2020) Scene-edge gru for video caption. In: 2020 IEEE 4Th information technology, networking, electronic and automation control conference (ITNEC). IEEE, vol 1, pp 1290–1295
Harispe S, Senchez D, Ranwez S, Janaqi S, Montmain J (2014) A framework for unifying ontology-based semantic similarity measures: a study in the biomedical domain. J Biomed Inf 48:38–53
He Y, Li Y, Lei J, Leung C (2016) A framework of query expansion for image retrieval based on knowledge base and concept similarity. Neurocomputing - Inpress
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Huang Q, Xiong Y, Rao A, Wang J, Lin D (2020) Movienet: a holistic dataset for movie understanding. In: Computer vision–ECCV 2020: 16th european conference, glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16. Springer, pp 709–727
Jelodar H, Wang Y, Yuan C, Feng X, Jiang X, Li Y, Zhao L (2019) Latent dirichlet allocation (lda) and topic modeling: models, applications, a survey. Multimed Tools Appl 78(11):15,169–15,211
Li L, Tang S, Zhang Y, Deng L, Tian Q (2017) Gla: Global–local attention for image description. IEEE Trans Multimedia 20(3):726–737
Li X, Zhang J, Ouyang J (2019) Dirichlet multinomial mixture with variational manifold regularization: Topic modeling over short texts. In: Proceedings of the AAAI Conference on artificial intelligence, vol 33, pp 7884–7891
Lin CY (2004) Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81
Luo B, Li H, Meng F, Wu Q, Huang C (2017) Video object segmentation via global consistency aware query strategy. IEEE Trans Multimed 19(7):1482–1493
Matthews P (2019) Human-in-the-loop topic modelling: Assessing topic labelling and genre-topic relations with a movie plot summary corpus. In: The human position in an artificial world: creativity, ethics and AI in knowledge organization. Ergon-verlag, pp 181–207
Matthews P, Glitre K (2021) Genre analysis of movies using a topic model of plot summaries. J Assoc Inf Sci 72:1–17
Mocanu B, Tapu R, Tapu E (2016) Video retrieval using relevant topics extraction from movie subtitles. In: 12Th IEEE international symposium on electronics and telecommunications (ISETC), 2016. IEEE, pp 327–330
Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu:, a method for automatic evaluation of machine translation. p 311–318
Roberts ME, Stewart BM, Tingley D (2019) Stm: an r package for structural topic models. J Stat Softw 91(1):1–40
Rohrbach A, Rohrbach M, Tandon N, Schiele B (2015) A dataset for movie description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3202–3212
Rotman D, Porat D, Ashour G (2016) Robust and efficient video scene detection using optimal sequential grouping. In: 2016 IEEE International symposium on multimedia (ISM). IEEE, pp 275–280
Rotman D, Porat D, Ashour G (2017) Robust video scene detection using multimodal fusion of optimally grouped features. In: 2017 IEEE 19Th international workshop on multimedia signal processing (MMSP). IEEE, pp 1–6
Sadique MF, Rahman MA, Haque SR (2020) Content based unsupervised video summarization using birds foraging search. In: 2020 11Th international conference on computing, communication and networking technologies (ICCCNT). IEEE, pp 1–7
Sanchez-Nielsen E, Chavez-Gutierrez F, Lorenzo-Navarro J (2019) A semantic parliamentary multimedia approach for retrieval of video clips with content understanding. Multimedia Systems 25:337–354
Shah R, Zimmermann R (2017) Multimodal analysis of user-generated multimedia content. Springer
Song J, Guo Y, Gao L, Li X, Hanjalic A, Shen HT (2018) From deterministic to generative: Multimodal stochastic rnns for video captioning. IEEE Trans Neural Netw Learn Syst 30(10):3047–3058
Stappen L, Baird A, Cambria E, Schuller BW (2021) Sentiment analysis and topic recognition in video transcriptions. IEEE Intell Syst 36(2):88–95
Tang P, Wang C, Wang X, Liu W, Zeng W, Wang J (2019) Object detection in videos by high quality object linking. IEEE Trans Pattern Anal Mach Intell 42(5):1272–1278
Torabi A, Pal C, Larochelle H, Courville A (2015) Using descriptive video services to create a large data source for video annotation research. CoRR:1503.01070, p 1–7
Trojahn TH, Goularte R (2021) Temporal video scene segmentation using deep-learning. Multimed Tools Appl 80(12):17, 487–17, 513
Tsai WL (2021) A cooperative mechanism for managing multimedia project documentation. Multimedia Tools and Applications, p 1–14
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
Wang H, Gao C, Han Y (2020) Sequence in sequence for video captioning. Pattern Recogn Lett 130:327–334
Xu J, Mei T, Yao T, Rui Y (2016) Msr-vtt: a large video description dataset for bridging video and language. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5288–5296
Yang H, Meinel C (2014) Content based lecture video retrieval using speech and video text information. IEEE Trans Learn Technol 7(2):142–154
Yang Y, Zhou J, Ai J, Bin Y, Hanjalic A, Shen HT, Ji Y (2018) Video captioning by adversarial lstm. IEEE Trans Image Process 27(11):5600–5611
Ye G, Li Y, Xu H, Liu D, Chang SF (2015) Eventnet: a large scale structured concept library for complex event detection in video. In: Proceedings of the 23rd ACM international conference on Multimedia. ACM, pp 471–480
Zhao B, Li X, Lu X (2019) Cam-rnn: Co-attention model based rnn for video captioning. IEEE Trans Image Process 28(11):5552–5565
Zhou W, Li H, Tian Q (2017) Recent advance in content-based image retrieval: A literature survey. arXiv:1706.06064
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Fourati, M., Jedidi, A. & Gargouri, F. A deep learning-based classification for topic detection of audiovisual documents. Appl Intell 53, 8776–8798 (2023). https://doi.org/10.1007/s10489-022-03938-x
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-022-03938-x