Skip to main content
Log in

Text-to-video: a semantic search engine for internet videos

  • Regular Paper
  • Published:
International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Abstract

Semantic search or text-to-video search in video is a novel and challenging problem in information and multimedia retrieval. Existing solutions are mainly limited to text-to-text matching, in which the query words are matched against the user-generated metadata. This kind of text-to-text search, though simple, is of limited functionality as it provides no understanding about the video content. This paper presents a state-of-the-art system for event search without any user-generated metadata or example videos, known as text-to-video search. The system relies on substantial video content understanding and allows for searching complex events over a large collection of videos. The proposed text-to-video search can be used to augment the existing text-to-text search for video. The novelty and practicality are demonstrated by the evaluation in NIST TRECVID 2014, where the proposed system achieves the best performance. We share our observations and lessons in building such a state-of-the-art system, which may be instrumental in guiding the design of the future system for video search and analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. When different retrieval models are used for different features, and their ranked lists have very different score distributions, we may need to solve a linear programming problem to determine the starting pseudo-positives [18].

  2. The reason we can randomly select pseudo-negative samples is that they have negligible impacts on performance [18].

  3. The time is the search time evaluated on an Intel Xeon 2.53GHz CPU. It does not include the PRF time, which is about 238 ms over 200K videos.

References

  1. Banerjee S, Pedersen T (2002) An adapted lesk algorithm for word sense disambiguation using wordnet. In: CICLing

  2. Bengio Y, Louradour J, Collobert R, Weston J (2009) Curriculum learning. In: ICML

  3. Bhattacharya S, Yu FX, Chang SF (2014) Minimally needed evidence for complex event recognition in unconstrained videos. In: ICMR

  4. Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM TIST 2:27:1–27:27

    Google Scholar 

  5. Church KW, Hanks P (1990) Word association norms, mutual information, and lexicography. Comput Linguist 16(1):22–29

    Google Scholar 

  6. Dalton J, Allan J, Mirajkar P (2013) Zero-shot video retrieval using content and concepts. In: CIKM

  7. Davidson J, Liebald B, Liu J et al (2010) The youtube video recommendation system. In: RecSys

  8. Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: CVPR

  9. Gkalelis N, Mezaris V (2014) Video event detection using generalized subclass discriminant analysis and linear support vector machines. In: ICMR

  10. Habibian A, Mensink T, Snoek CG (2014) Composite concept discovery for zero-shot video event detection. In: ICMR

  11. Habibian A, van de Sande KE, Snoek CG (2013) Recommendations for video event recognition using concept vocabularies. In: ICMR

  12. Hauptmann A, Yan R, Lin WH, Christel M, Wactlar H (2007) Can high-level concepts fill the semantic gap in video retrieval? A case study with broadcast news. TMM 9(5):958–966

    Google Scholar 

  13. Inoue N, Shinoda K (2014) n-gram models for video semantic indexing. In: MM

  14. Jiang L, Hauptmann A, Xiang G (2012) Leveraging high-level and low-level features for multimedia event detection. In: MM

  15. Jiang L, Meng D, Mitamura T, Hauptmann AG (2014) Easy samples first: self-paced reranking for zero-example multimedia search. In: MM

  16. Jiang L, Meng D, Yu SI, Lan Z, Shan S, Hauptmann AG (2014) Self-paced learning with diversity. In: NIPS

  17. Jiang L, Meng D, Zhao Q, Shan S, Hauptmann AG (2015) Self-paced curriculum learning. In: AAAI

  18. Jiang L, Mitamura T, Yu SI, Hauptmann AG (2014) Zero-example event search using multimodal pseudo relevance feedback. In: ICMR

  19. Jiang L, Tong W, Meng D, Hauptmann AG (2014) Towards efficient learning of optimal spatial bag-of-words representations. In: ICMR

  20. Jiang L, Yu SI, Meng D, Mitamura T, Hauptmann AG (2015) Bridging the ultimate semantic gap: a semantic search engine for internet videos. In: ICMR

  21. Jiang L, Yu SI, Meng D, Yang Y, Mitamura T, Hauptmann AG (2015) Fast and accurate content-based semantic search in 100m internet videos. In: MM

  22. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: CVPR

  23. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: NIPS

  24. Kumar M, Packer B, Koller D (2010) Self-paced learning for latent variable models. In: NIPS

  25. Lee H (2014) Analyzing complex events and human actions in“ in-the-wild” videos. In: UMD Ph.D Theses and Dissertations

  26. Levy O, Goldberg Y (2014) Dependency-based word embeddings. In: ACL

  27. Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval, vol 1. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  28. Mazloom M, Li X, Snoek CG (2014) Few-example video event retrieval using tag propagation. In: ICMR

  29. Miao Y, Gowayyed M, Metze F (2015) Eesen: end-to-end speech recognition using deep rnn models and wfst-based decoding. arXiv:1507.08240

  30. Miao Y, Jiang L, Zhang H, Metze F (2014) Improvements to speaker adaptive training of deep neural networks. In: SLT

  31. Miao Y, Metze F (2013) Improving low-resource cd-dnn-hmm using dropout and multilingual dnn training. In: INTERSPEECH

  32. Miao Y, Metze F, Rawat S (2013) Deep maxout networks for low-resource speech recognition. In: ASRU

  33. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: NIPS

  34. Norouzi M, Mikolov T, Bengio S, Singer Y, Shlens J, Frome, A, Corrado GS, Dean J (2014) Zero-shot learning by convex combination of semantic embeddings. In: ICLR

  35. Oh S, McCloskey S, Kim I, Vahdat A, Cannons KJ, Hajimirsadeghi H, Mori G, Perera AA, Pandey M, Corso JJ (2014) Multimedia event detection with multimodal feature fusion and temporal concept localization. Mach Vision Appl 25(1):49–69

    Article  Google Scholar 

  36. Over P, Awad G, Michel M, Fiscus J, Sanders G, Kraaij W, Smeaton AF, Quenot G (2014) TRECVID 2014—an overview of the goals, tasks, data, evaluation mechanisms and metrics. In: TRECVID

  37. Palatucci M, Pomerleau D, Hinton GE, Mitchell TM (2009) Zero-shot learning with semantic output codes. In: NIPS

  38. Povey D, Ghoshal A, Boulianne G et al (2011) The kaldi speech recognition toolkit. In: ASRU

  39. Safadi B, Sahuguet M, Huet B (2014) When textual and visual information join forces for multimedia retrieval. In: ICMR

  40. Thomee B, Shamma DA, Friedland G, Elizalde B, Ni K, Poland D, Borth D, Li LJ (2015) The new data and new challenges in multimedia research. arXiv:1503.01817

  41. Tong W, Yang Y, Jiang L et al (2014) E-lamp: integration of innovative ideas for multimedia event detection. Mach Vision Appl 25(1):5–15

    Article  Google Scholar 

  42. Vedaldi A, Zisserman A (2012) Efficient additive kernels via explicit feature maps. PAMI 34(3):480–492

    Article  Google Scholar 

  43. Wang F, Sun Z, Jiang Y, Ngo C (2013) Video event detection using motion relativity and feature selection. In: TMM

  44. Wang H, Schmid C (2013) Action recognition with improved trajectories. In: ICCV

  45. Wu S, Bondugula S, Luisier F, Zhuang X, Natarajan P (2014) Zero-shot event detection using multi-modal fusion of weakly supervised concepts. In: CVPR

  46. Wu Z, Palmer M (1994) Verbs semantics and lexical selection. In: ACL

  47. Younessian E, Mitamura T, Hauptmann A (2012) Multimodal knowledge-based analysis in multimedia event detection. In: ICMR

  48. Yu SI, Jiang L, Hauptmann A (2014) Instructional videos for unsupervised harvesting and learning of action examples. In: MM

  49. Yu SI, Jiang L, Xu Z, Yang Y, Hauptmann AG (2015) Content-based video search over 1 million videos with 1 core in 1 second. In: ICMR

  50. Yu SI, Jiang L, Xu Z et al (2014) Cmu-informedia@trecvid 2014. In: TRECVID

  51. Zhai C, Lafferty J (2004) A study of smoothing methods for language models applied to information retrieval. TOIS 22(2)

  52. Zhao Q, Meng D, Jiang L, Xie Q, Xu Z, Hauptmann AG (2015) Self-paced learning for matrix factorization. In: AAAI

Download references

Acknowledgments

This work was partially supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number D11PC20068. Deyu Meng was partially supported by the China NSFC project under contract 61373114. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC, or the U.S. Government.

This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by the National Science Foundation grant number OCI-1053575. It used the Blacklight system at the Pittsburgh Supercomputing Center (PSC).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lu Jiang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jiang, L., Yu, SI., Meng, D. et al. Text-to-video: a semantic search engine for internet videos. Int J Multimed Info Retr 5, 3–18 (2016). https://doi.org/10.1007/s13735-015-0093-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13735-015-0093-0

Keywords

Navigation