Skip to main content
Log in

CLOVIS: towards precision-oriented text-based video retrieval through the unification of automatically-extracted concepts and relations of the visual and audio/speech contents

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

Traditional multimedia (video) retrieval systems use the keyword-based approach in order to make the search process fast although this approach has several shortcomings and limitations related to the way the user is able to formulate her/his information need. Typical Web multimedia retrieval systems illustrate this paradigm in the sense that the result of a search consists of a collection of thousands of multimedia documents, many of which would be irrelevant or not fully exploited by the typical user. Indeed, according to studies related to users’ behavior, an individual is mostly interested in the initial documents returned during a search session and therefore a multimedia retrieval system is to model the multimedia content as precisely as possible to allow for the first retrieved images to be fully relevant to the user’s information need. For this, the keyword-based approach proves to be clearly insufficient and the need for a high-level index and query language, addressing the issue of combining modalities within expressive frameworks for video indexing and retrieval is of huge importance and the only solution for achieving significant retrieval performance. This paper presents a multi-facetted conceptual framework integrating multiple characterizations of the visual and audio contents for automatic video retrieval. It relies on an expressive representation formalism handling high-level video descriptions and a full-text query framework in an attempt to operate video indexing and retrieval beyond trivial low-level processes, keyword-annotation frameworks and state-of-the art architectures loosely-coupling visual and audio descriptions. Experiments on the multimedia topic search task of the TRECVID evaluation campaign validate our proposal.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

References

  • Amato, G., Mainetto, G., & Savino, P. (1998). An approach to a content-based retrieval of multimedia data. Multimedia Tools and Applications, 7(1–2), 9–36.

    Article  Google Scholar 

  • Amir, A., Berg, M., & Chang, S.-F. (2003). IBM research TRECVID-2003 video retrieval system. In NIST TRECVID-2003.

  • Assfalg, J., Bertini, M., Colombo, C., & del Bimbo, A. (2002). Semantic annotation of sports videos. IEEE MultiMedia, 9(2), 52–60.

    Article  Google Scholar 

  • Belkhatir, M. (2005). Combining visual semantics and texture characterizations for precision-oriented automatic image retrieval. In Proceedings of ECIR (pp. 457–474).

  • Belkhatir, M., Mulhem, P., Chiaramella, Y. (2004). Integrating perceptual signal features within a multi-facetted conceptual model for automatic image retrieval. In Proceedings of ECIR (pp. 267–282).

  • Belkhatir, M., Mulhem, P., & Chiaramella, Y. (2005). A full-text framework for the image retrieval signal/semantic integration. In Proceedings of DEXA 2005 (pp. 113–123).

  • Berlin, B., & Kay, P. (1991). Basic color terms: Their universality and evolution. Berkeley: University of California Press.

    Google Scholar 

  • Bertini, M., del Bimbo, A., & Nunziati, W. (2003). Annotation and retrieval of structured video documents. In Proceedings of ECIR (pp. 12–24).

  • Bhushan, N. A., & Lohse, G. (1997). The texture lexicon: Understanding the categorization of visual texture terms and their relationship to texture images. Cognitive Science, 21(2), 219–246.

    Article  Google Scholar 

  • Blei, D., & Jordan, M. (2003). Modeling annotated data. ACM SIGIR, 127–134.

  • Carneiro, G., et al. (2006). Supervised learning of semantic classes for image annotation and retrieval. IEEE PAMI, 394–410.

  • Charhad, M., Moraru, D., Ayache, S., & Quenot, G. (2005). Speaker identity indexing in audio-visual documents. In Proceedings of content-based multimedia indexing (CBMI).

  • Chua, T.-S., et al. (2004). TRECVID 2004 search task by NUS PRIS. In The online proceedings of the TREC video retrieval evaluation. Retrieved from http://www-nlpir.nist.gov/projects/tvpubs/tv.pubs.org.html#2004.

  • Cleverdon, C. W., Mills, J., & Keen, E. M. (1966). Factors determining the performance of indexing systems. TR vol. 2: Test results, ASLIB Cranfield Research Project (2).

  • Cohn, A. (1997). Qualitative spatial representation and reasoning with the region connection calculus. Geoinformatica, 1, 1–44.

    Article  Google Scholar 

  • Cox, I., et al. (2000). The Bayesian IR system, PicHunter: Theory, implementation and psychophysical experiments. IEEE Transactions on Image Processing, 9(1), 20–37.

    Article  Google Scholar 

  • Etievent, E., Lebourgeois, F., & Jolion, J. M. (1999). Assisted video sequences indexing: Motion analysis based on interest points. In Proceedings of ICIAP (pp. 27–29).

  • Fablet, R., & Bouthemy, P. (2000). Statistical motion-based video indexing and retrieval. In Proceedings of the conf. on content-based multimedia information access RIAO (pp. 602–619).

  • Fan, J., et al. (2004). ClassView: Hierarchical video shot classification, indexing, and accessing. IEEE Transactions on Multimedia, 6(1), 70–86.

    Article  Google Scholar 

  • Feng, S. L., Manmatha, R., & Lavrenko, V. (2004). Multiple Bernoulli relevance models for image and video annotation. In Proceedings of CVPR (pp. 1002–1009).

  • Gauvain, J. L., Lamel, L., & Adda, G. (2002). The LIMSI broadcast news transcription system. Speech Communication, 37, 89–108.

    Article  MATH  Google Scholar 

  • Gong, Y., Chua, C. H., & Xiaoyi, G. (1996). Image indexing and retrieval based on color histograms. Multimedia Tools and Applications, II, 133–156.

    Google Scholar 

  • Hollink, L. (2004). Classification of user image descriptions. International Journal of Human–Computer Studies, 61(5), 601–626.

    Article  Google Scholar 

  • Ianeva, T. (2004). Probabilistic approaches to video retrieval. In The online proceedings of the TREC video retrieval evaluation. Retrieved from http://www-nlpir.nist.gov/projects/tvpubs/tv.pubs.org.html#2004.

  • Iyengar, G., et al. (2005). Joint visual-text modeling for automatic retrieval of multimedia documents. In Proceedings of ACM MM (pp. 21–30).

  • Jiang, H., Montesi, D., & Elmagarmid, A. K. (1999). Integrated video and text for content-based access to video databases. Multimedia Tools and Applications, 9(3), 227–249.

    Article  Google Scholar 

  • Jin, Y., et al. (2005). Image annotations by combining multiple evidence & wordNet. In Proceedings of ACM MM (pp. 706–715).

  • Kemp, T., Schmidt, M., Westphal, M., & Waibel, A. (2000). Strategies for automatic segmentation of audio data. In Proceedings of ICASSP (pp. 1423–1426).

  • Kennedy, L. S., Natsev, A., & Chang, S.-F. (2005). Automatic discovery of query-class-dependent models for multimodal search. In Proceedings of ACM Multimedia (pp. 24–28).

  • Kwon, S., & Narayanan, S. (2002). Speaker change detection using a new weighted distance measure. In Proceedings of int’l conf. spoken language processing (ICSLP) (pp. 2537–2540).

  • Lim, J. H., & Jin, J. S. (2005). A structured learning framework for content-based image indexing and visual query. Multimedia Systems, 10(4), 317–331.

    Article  Google Scholar 

  • Lin, P.-C., Wang, J.-C., Wang, J.-F., & Sung, H.-C. (2007). Unsupervised speaker change detection using SVM training misclassification rate. IEEE Transactions on Computers, 56(9), 1234–1244.

    MathSciNet  Google Scholar 

  • Liu, J., et al. (2007). Dual cross-media relevance model for image annotation. In Proceedings of ACM MM (pp. 605–614).

  • Lu, Y., et al. (2000). A unified framework for semantics and feature based RF in image retrieval systems. In Proceedings of ACM MM (pp. 31–37).

  • Martinet, J., Mulhem, P., & Chiaramella, Y. (2005). A model for weighting image objects in home photographs. In Proceedings of CIKM (pp. 760–767).

  • Mittal, A., & Cheong, L. F. (2003). Framework for synthesizing semantic-level indices. Multimedia Tools and Applications, 20(2), 135–158.

    Article  Google Scholar 

  • Miyahara, M., & Yoshida, Y. (1988). Mathematical transform of (R,G,B) color data to munsell (H,V,C) color data. In Proceedings of SPIE-visual communications and image processing (pp. 650–657).

  • Mojsilovic, A., & Rogowitz, B. (2001). Capturing image semantics with low-level descriptors. In Proceedings of IEEE ICIP (pp. 18–21).

  • Mulhem, P., Lim, J. H., Leow, W. K., & Kankanhalli, M. (2003). Advances in digital home image albums (chapter IX, pp. 201–226). Multimedia Systems and Content-Based Image Retrieval, Idea Publishing.

  • Naphade, M. R., & Huang, T. S. (2002). Factor graph framework for semantic video indexing. IEEE Transactions on Circuits and Systems for Video Technology, 12(1), 40–52.

    Article  Google Scholar 

  • Natsev, A., Naphade, M., & Tesic, J. (2005). Learning the semantics of multimedia queries and concepts from a small number of examples. In Proceedings of ACM MM (pp. 598–607).

  • Neo, S. Y., et al. (2006). Video retrieval using high-level features: Exploiting query matching and confidence-based weighting. In Proceedings of CIVR.

  • Ounis, I., & Pasca, M. (1998). RELIEF: Combining expressiveness and rapidity into a single system. In Proceedings of ACM SIGIR (pp. 266–274).

  • Platt, J. C. (1999). Probabilities for support vector machines. In Advances in large margin classifiers (pp. 61–74). Cambridge, MA: MIT.

    Google Scholar 

  • Quénot, G. (2001). TREC-10 shot boundary detection task: CLIPS system description and evaluation. In Proceedings of TREC (pp. 13–16).

  • Smeaton, A. F., Over, P., & Kraaij, W. (2006). Evaluation campaigns and TRECVid. In Proceeding of the multimedia information retrieval workshop (pp. 321–330).

  • Smeulders, A., et al. (2000). Content-based image retrieval at the end of the early years. IEEE PAMI, 22(12), 1349–1380.

    Google Scholar 

  • Snoek, S., et al. (2006). The semantic pathfinder: Using an authoring metaphor for generic multimedia indexing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10), 1678–1689.

    Article  Google Scholar 

  • Sowa, J. F. (1984). Conceptual structures: Information processing in mind and machine. Reading, MA: Addison-Wesley.

    MATH  Google Scholar 

  • Srikanth, M., et al. (2005). Exploiting ontologies for automatic image annotation. In Proceedings of ACM SIGIR (pp. 1349–1380).

  • Town, C. P., & Sinclair, D. (2000). Content-based image retrieval using semantic visual categories. TR2000-14, AT&T Labs Cambridge.

  • Van Rijsbergen, C. J. (1986). A non-classical logic for information retrieval. Computer Journal, 29(6), 481–485.

    Article  MATH  Google Scholar 

  • Vapnik, V. (1998). Statistical learning theory. New York: Wiley.

    MATH  Google Scholar 

  • Westerveld, T., & de Vries, A. P. (2003). Experimental evaluation of a generative probabilistic image retrieval model on ‘easy’ data. SIGIR Multimedia Information Retrieval Workshop.

  • Westerveld, T., et al. (2003). Combining infomation sources for video retrieval: The lowlands team at TRECVID 2003. In NIST TRECVID-2003.

  • Yan, R., Yang, J., & Hauptmann, A. G. (2004). Learning query-class dependent weights in automatic video retrieval. In Proceedings of ACM MM (pp. 270–278).

  • Yang, J., Chen, M. Y., & Hauptmann, A. G. (2004). Finding person X: Correlating names with visual appearances. In Proceedings of CIVR (pp. 270–278).

  • Zhou, X. S., & Huang, T. S. (2002). Unifying keywords and visual contents in image retrieval. IEEE Multimedia, 9(2), 23–33.

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to M. Belkhatir.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Belkhatir, M. CLOVIS: towards precision-oriented text-based video retrieval through the unification of automatically-extracted concepts and relations of the visual and audio/speech contents. J Intell Inf Syst 34, 135–175 (2010). https://doi.org/10.1007/s10844-009-0083-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-009-0083-x

Keywords

Navigation