Abstract
Visual concept detection is one of the most active research areas in multimedia analysis. The goal of visual concept detection is to assign to each elementary temporal segment of a video, a confidence score for each target concept (e.g. forest, ocean, sky, etc.). The establishment of such associations between the video content and the concept labels is a key step toward semantics-based indexing, retrieval, and summarization of videos, as well as deeper analysis (e.g., video event detection). Due to its significance for the multimedia analysis community, concept detection is the topic of international benchmarking activities such as TRECVID. While video is typically a multi-modal signal composed of visual content, speech, audio, and possibly also subtitles, most research has so far focused on exploiting the visual modality. In this chapter, we introduce fusion and text analysis techniques for harnessing automatic speech recognition (ASR) transcripts or subtitles to improve the results of visual concept detection. Since the emphasis is on late fusion, the introduced algorithms for handling text and the fusion can be used in conjunction with standard algorithms for visual concept detection. We test our techniques on the TRECVID 2012 Semantic indexing (SIN) task dataset, which is made of more than 800 h of heterogeneous videos collected from Internet archives.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Thus the name Explicit Semantic Analysis—due to the use of natural concepts (Wikipedia articles), the model is easy to explain to human users.
- 2.
The ESAlib implementation obtained from http://ticcky.github.io/esalib/ with ESA background built from Wikipedia snapshot from 2005.
References
Bao L, Yu SI, Lan ZZ, Overwijk A, Jin Q, Langner B, Garbus M, Burger S, Metze F, Hauptmann A (2011) Informedia@ TRECVID 2011 multimedia event detection, semantic indexing. TRECVID compet 1:107–123
Bay H, Tuytelaars T, Van Gool L (2006) SURF: speeded up robust features. In: Computer vision-ECCV 2006. Springer, Heidelberg, pp 404–417
Cernekova Z, Pitas I, Nikou C (2006) Information theory-based shot cut/fade detection and video summarization. IEEE Trans Circuits Syst Video Technol 16(1):82–91
Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2:27:1–27:27. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
Chavez GC, Precioso F, Cord M, Philipp-Foliguet S, Araujo AdA (2006) Shot boundary detection at TRECVID 2006. In: Proceedings of the TREC video retrieval evaluation, p 1–8
Delezoide B, Precioso F, Gosselin PH, Redi M, Mérialdo B, Granjon L, Pellerin D, Rombaut M, Jégou H, Vieux R et al (2011) IRIM at TRECVID 2011: semantic indexing and instance search. In: Notebook papers of the TREC video retrieval evaluation workshop (TRECVID)
Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) LIBLINEAR: a library for large linear classification. J Mach Learn Res 9:1871–1874
Gabrilovich E, Markovitch S (2007) Computing semantic relatedness using Wikipedia-based explicit semantic analysis. IJCAI 7:1606–1611
Gauvain JL, Lamel L, Adda G (2002) The LIMSI broadcast news transcription system. Speech Commun 37(1):89–108
Hamadi A, Mulhem P, Quénot G (2013) Conceptual feedback for semantic multimedia indexing. In: Proceedings of the 11th international workshop on content-based multimedia indexing (CBMI). IEEE, pp 53–58
Harris C, Stephens M (1988) A combined corner and edge detector. In: Alvey vision conference, vol 15. Manchester, p 50
Kliegr T, Chandramouli K, Nemrava J, Svatek V, Izquierdo E (2008) Combining image captions and visual analysis for image concept classification. In: Proceedings of the 9th international workshop on multimedia data mining: held in conjunction with the ACM SIGKDD 2008, MDM ’08ACM, New York, pp 8–17
Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, vol 2. IEEE, pp 2169–2178
Leong CW, Mihalcea R, Hassan S (2010) Text mining for automatic image tagging. In: Proceedings of the 23rd international conference on computational linguistics: posters. Association for Computational Linguistics, pp 647–655
Lin WH, Hauptmann A (2002) News video classification using SVM-based multimodal classifiers and combination strategies. In: Proceedings of the 10th ACM international conference on multimedia. ACM, pp 323–326
Liu C, Liu H, Jiang S, Huang Q, Zheng Y, Zhang W (2006) JDL at TRECVID 2006 shot boundary detection. In: TRECVID 2006 workshop
Lowe DG (1999) Object recognition from local scale-invariant features. In: Proceedings of the 7th IEEE international conference on computer vision, vol 2. IEEE, pp 1150–1157
Markatopoulou F, Moumtzidou A, Tzelepis C, Avgerinakis K, Gkalelis N, Vrochidis S, Mezaris V, Kompatsiaris I (2013) ITI-CERTH participation to TRECVID 2013. In: Proceedings of TRECVID 2013 workshop. TRECVID 2013
Mittal A, Cheong LF (2004) Addressing the problems of Bayesian network classification of video using high-dimensional features. IEEE Trans Knowl Data Eng 16(2):230–244
Moumtzidou A, Gkalelis N, Sidiropoulos P, Dimopoulos M, Nikolopoulos S, Vrochidis S, Mezaris V, Kompatsiaris I (2012) ITI-CERTH participation to TRECVID 2012. In: Proceedings of TRECVID 2012 workshop. TRECVID 2012
Over P, Awad G, Michel M, Fiscus J, Sanders G, Kraaij W, Smeaton AF, Quénot G (2013) TRECVID 2013—an overview of the goals, tasks, data, evaluation mechanisms and metrics. In: Proceedings of TRECVID 2013. NIST
Over P, Awad G, Michel M, Fiscus J, Sanders G, Shaw B, Kraaij W, Smeaton AF, Quénot G (2012) TRECVID 2012—an overview of the goals, tasks, data, evaluation mechanisms and metrics. In: Proceedings of TRECVID 2012. NIST
Quénot G, Moraru D, Besacier L (2003) Clips at TRECVID: shot boundary detection and feature detection. In: TRECVID 2003 workshop notebook papers. Citeseer
Radinsky K, Agichtein E, Gabrilovich E, Markovitch S (2011) A word at a time: computing word relatedness using temporal semantic analysis. In: Proceedings of the 20th international conference on world wide web. ACM, pp 337–346
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv (CSUR) 34(1):1–47
Sechidis K, Tsoumakas G, Vlahavas I (2011) On the stratification of multi-label data. In: Machine learning and knowledge discovery in databases. Springer, Berlin, pp 145–158
Sidiropoulos P, Mezaris V, Kompatsiaris I (2013) Enhancing video concept detection with the use of tomographs. In: Proceedings of the 20th IEEE international conference on image processing (ICIP), pp 3991–3995
Tsamoura E, Mezaris V, Kompatsiaris I (2008) Gradual transition detection using color coherence and other criteria in a video shot meta-segmentation framework. In: Proceedings of the 15th IEEE international conference on image processing (ICIP), pp 45–48
Van De Sande KE, Gevers T, Snoek CG (2010) Evaluating color descriptors for object and scene recognition. IEEE Trans Pattern Anal Mach Intell 32(9):1582–1596
Wan KW, Yau WY, Roy S (2013) Metadata enrichment for news video retrieval: a graph-based propagation approach. In: Proceedings of the 21st ACM international conference on multimedia. ACM, pp 373–376
Witten I, Milne D (2008) An effective, low-cost measure of semantic relatedness obtained from wikipedia links. In: Proceeding of AAAI workshop on Wikipedia and artificial intelligence: an evolving synergy. AAAI Press, Chicago, pp 25–30
Yilmaz E, Kanoulas E, Aslam JA (2008) A simple and efficient sampling method for estimating AP and NDCG. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 603–610
Zhao ZC, Cai AN (2006) Shot boundary detection algorithm in compressed domain based on adaboost and fuzzy theory. In: Advances in natural computation. Springer, Berlin, pp 617–626
Acknowledgments
This work was supported by the European Commission under contract FP7-287911 LinkedTV.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Galanopoulos, D., Dojchinovski, M., Chandramouli, K., Kliegr, T., Mezaris, V. (2015). Multimodal Fusion: Combining Visual and Textual Cues for Concept Detection in Video. In: Baughman, A., Gao, J., Pan, JY., Petrushin, V. (eds) Multimedia Data Mining and Analytics. Springer, Cham. https://doi.org/10.1007/978-3-319-14998-1_13
Download citation
DOI: https://doi.org/10.1007/978-3-319-14998-1_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-14997-4
Online ISBN: 978-3-319-14998-1
eBook Packages: Computer ScienceComputer Science (R0)