Abstract
Semantic search or text-to-video search in video is a novel and challenging problem in information and multimedia retrieval. Existing solutions are mainly limited to text-to-text matching, in which the query words are matched against the user-generated metadata. This kind of text-to-text search, though simple, is of limited functionality as it provides no understanding about the video content. This paper presents a state-of-the-art system for event search without any user-generated metadata or example videos, known as text-to-video search. The system relies on substantial video content understanding and allows for searching complex events over a large collection of videos. The proposed text-to-video search can be used to augment the existing text-to-text search for video. The novelty and practicality are demonstrated by the evaluation in NIST TRECVID 2014, where the proposed system achieves the best performance. We share our observations and lessons in building such a state-of-the-art system, which may be instrumental in guiding the design of the future system for video search and analysis.
Similar content being viewed by others
Notes
When different retrieval models are used for different features, and their ranked lists have very different score distributions, we may need to solve a linear programming problem to determine the starting pseudo-positives [18].
The reason we can randomly select pseudo-negative samples is that they have negligible impacts on performance [18].
The time is the search time evaluated on an Intel Xeon 2.53GHz CPU. It does not include the PRF time, which is about 238 ms over 200K videos.
References
Banerjee S, Pedersen T (2002) An adapted lesk algorithm for word sense disambiguation using wordnet. In: CICLing
Bengio Y, Louradour J, Collobert R, Weston J (2009) Curriculum learning. In: ICML
Bhattacharya S, Yu FX, Chang SF (2014) Minimally needed evidence for complex event recognition in unconstrained videos. In: ICMR
Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM TIST 2:27:1–27:27
Church KW, Hanks P (1990) Word association norms, mutual information, and lexicography. Comput Linguist 16(1):22–29
Dalton J, Allan J, Mirajkar P (2013) Zero-shot video retrieval using content and concepts. In: CIKM
Davidson J, Liebald B, Liu J et al (2010) The youtube video recommendation system. In: RecSys
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: CVPR
Gkalelis N, Mezaris V (2014) Video event detection using generalized subclass discriminant analysis and linear support vector machines. In: ICMR
Habibian A, Mensink T, Snoek CG (2014) Composite concept discovery for zero-shot video event detection. In: ICMR
Habibian A, van de Sande KE, Snoek CG (2013) Recommendations for video event recognition using concept vocabularies. In: ICMR
Hauptmann A, Yan R, Lin WH, Christel M, Wactlar H (2007) Can high-level concepts fill the semantic gap in video retrieval? A case study with broadcast news. TMM 9(5):958–966
Inoue N, Shinoda K (2014) n-gram models for video semantic indexing. In: MM
Jiang L, Hauptmann A, Xiang G (2012) Leveraging high-level and low-level features for multimedia event detection. In: MM
Jiang L, Meng D, Mitamura T, Hauptmann AG (2014) Easy samples first: self-paced reranking for zero-example multimedia search. In: MM
Jiang L, Meng D, Yu SI, Lan Z, Shan S, Hauptmann AG (2014) Self-paced learning with diversity. In: NIPS
Jiang L, Meng D, Zhao Q, Shan S, Hauptmann AG (2015) Self-paced curriculum learning. In: AAAI
Jiang L, Mitamura T, Yu SI, Hauptmann AG (2014) Zero-example event search using multimodal pseudo relevance feedback. In: ICMR
Jiang L, Tong W, Meng D, Hauptmann AG (2014) Towards efficient learning of optimal spatial bag-of-words representations. In: ICMR
Jiang L, Yu SI, Meng D, Mitamura T, Hauptmann AG (2015) Bridging the ultimate semantic gap: a semantic search engine for internet videos. In: ICMR
Jiang L, Yu SI, Meng D, Yang Y, Mitamura T, Hauptmann AG (2015) Fast and accurate content-based semantic search in 100m internet videos. In: MM
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: CVPR
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: NIPS
Kumar M, Packer B, Koller D (2010) Self-paced learning for latent variable models. In: NIPS
Lee H (2014) Analyzing complex events and human actions in“ in-the-wild” videos. In: UMD Ph.D Theses and Dissertations
Levy O, Goldberg Y (2014) Dependency-based word embeddings. In: ACL
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval, vol 1. Cambridge University Press, Cambridge
Mazloom M, Li X, Snoek CG (2014) Few-example video event retrieval using tag propagation. In: ICMR
Miao Y, Gowayyed M, Metze F (2015) Eesen: end-to-end speech recognition using deep rnn models and wfst-based decoding. arXiv:1507.08240
Miao Y, Jiang L, Zhang H, Metze F (2014) Improvements to speaker adaptive training of deep neural networks. In: SLT
Miao Y, Metze F (2013) Improving low-resource cd-dnn-hmm using dropout and multilingual dnn training. In: INTERSPEECH
Miao Y, Metze F, Rawat S (2013) Deep maxout networks for low-resource speech recognition. In: ASRU
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: NIPS
Norouzi M, Mikolov T, Bengio S, Singer Y, Shlens J, Frome, A, Corrado GS, Dean J (2014) Zero-shot learning by convex combination of semantic embeddings. In: ICLR
Oh S, McCloskey S, Kim I, Vahdat A, Cannons KJ, Hajimirsadeghi H, Mori G, Perera AA, Pandey M, Corso JJ (2014) Multimedia event detection with multimodal feature fusion and temporal concept localization. Mach Vision Appl 25(1):49–69
Over P, Awad G, Michel M, Fiscus J, Sanders G, Kraaij W, Smeaton AF, Quenot G (2014) TRECVID 2014—an overview of the goals, tasks, data, evaluation mechanisms and metrics. In: TRECVID
Palatucci M, Pomerleau D, Hinton GE, Mitchell TM (2009) Zero-shot learning with semantic output codes. In: NIPS
Povey D, Ghoshal A, Boulianne G et al (2011) The kaldi speech recognition toolkit. In: ASRU
Safadi B, Sahuguet M, Huet B (2014) When textual and visual information join forces for multimedia retrieval. In: ICMR
Thomee B, Shamma DA, Friedland G, Elizalde B, Ni K, Poland D, Borth D, Li LJ (2015) The new data and new challenges in multimedia research. arXiv:1503.01817
Tong W, Yang Y, Jiang L et al (2014) E-lamp: integration of innovative ideas for multimedia event detection. Mach Vision Appl 25(1):5–15
Vedaldi A, Zisserman A (2012) Efficient additive kernels via explicit feature maps. PAMI 34(3):480–492
Wang F, Sun Z, Jiang Y, Ngo C (2013) Video event detection using motion relativity and feature selection. In: TMM
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: ICCV
Wu S, Bondugula S, Luisier F, Zhuang X, Natarajan P (2014) Zero-shot event detection using multi-modal fusion of weakly supervised concepts. In: CVPR
Wu Z, Palmer M (1994) Verbs semantics and lexical selection. In: ACL
Younessian E, Mitamura T, Hauptmann A (2012) Multimodal knowledge-based analysis in multimedia event detection. In: ICMR
Yu SI, Jiang L, Hauptmann A (2014) Instructional videos for unsupervised harvesting and learning of action examples. In: MM
Yu SI, Jiang L, Xu Z, Yang Y, Hauptmann AG (2015) Content-based video search over 1 million videos with 1 core in 1 second. In: ICMR
Yu SI, Jiang L, Xu Z et al (2014) Cmu-informedia@trecvid 2014. In: TRECVID
Zhai C, Lafferty J (2004) A study of smoothing methods for language models applied to information retrieval. TOIS 22(2)
Zhao Q, Meng D, Jiang L, Xie Q, Xu Z, Hauptmann AG (2015) Self-paced learning for matrix factorization. In: AAAI
Acknowledgments
This work was partially supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number D11PC20068. Deyu Meng was partially supported by the China NSFC project under contract 61373114. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC, or the U.S. Government.
This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by the National Science Foundation grant number OCI-1053575. It used the Blacklight system at the Pittsburgh Supercomputing Center (PSC).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Jiang, L., Yu, SI., Meng, D. et al. Text-to-video: a semantic search engine for internet videos. Int J Multimed Info Retr 5, 3–18 (2016). https://doi.org/10.1007/s13735-015-0093-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13735-015-0093-0