Skip to main content

SuMACC Project’s Corpus

A Topic-Based Query Extension Approach to Retrieve Multimedia Documents

  • Conference paper
Text, Speech and Dialogue (TSD 2014)

Abstract

The SuMACC project aims at automatically tracking new multimodal entities on Internet. The goal of the project is to propose robust multimedia methods that define relevant patterns allowing to automatically retrieve these entities. This paper describes the SuMACC corpus collected on video-sharing platforms using word-queries. Since concepts are limited to a single or few words, querying video-sharing platforms with the concept only can easily introduce irrelevant collected videos. In this paper, we propose to use an extended query obtained by mapping the initial concept into a topic space from a Latent Dirichlet Allocation (LDA) algorithm. This topic-based query extension approach allows to better retrieve videos related to the targeted concept. As a result, a corpus of 7,517 videos, extracted using the simple (i.e. concept only) and the extended queries, from 47 concepts, was obtained. Results show the effectiveness of the proposed thematic querying approach compared to the simple concept query in terms of relevance (+ 21%) and ambiguity (− 4%). The annotation process as well as the corpus statistics are detailed in this paper.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ando, R.K., Lee, L.: Iterative Residual Rescaling: An analysis and generalization of LSI. In: Proceedings of SIGIR, pp. 154–162 (2001)

    Google Scholar 

  2. Bertier, M., Guerraoui, R., Leroy, V., Kermarrec, A.M.: Toward personalized query expansion. In: ACM EuroSys Workshop on Social Network Systems (SNS), pp. 7–12 (2009)

    Google Scholar 

  3. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. The Journal of Machine Learning Research 3, 993–1022 (2003)

    MATH  Google Scholar 

  4. Crouch, C.J., Crouch, D.B., Nareddy, K.R.: The automatic generation of extended queries. In: ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 369–383 (1990)

    Google Scholar 

  5. Feng, B., Cao, J., Chen, Z., Zhang, Y., Lin, S.: Multi-modal query expansion for web video search. In: ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 721–722 (2010)

    Google Scholar 

  6. Gauch, S., Wang, J., Rachakonda, S.M.: A corpus analysis approach for automatic query expansion and its extension to multiple databases. ACM Transactions on Information Systems 17(3), 250–269 (1999)

    Article  Google Scholar 

  7. Hofmann, T.: Probabilistic latent semantic indexing. In: ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57 (1999)

    Google Scholar 

  8. Smeaton, A.F., Over, P., Kraaij, W.: Evaluation campaigns and trecvid. In: Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval, pp. 321–330 (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Morchid, M. et al. (2014). SuMACC Project’s Corpus. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2014. Lecture Notes in Computer Science(), vol 8655. Springer, Cham. https://doi.org/10.1007/978-3-319-10816-2_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-10816-2_4

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-10815-5

  • Online ISBN: 978-3-319-10816-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics