Text-to-video: a semantic search engine for internet videos

Jiang, Lu; Yu, Shoou-I; Meng, Deyu; Mitamura, Teruko; Hauptmann, Alexander G.

doi:10.1007/s13735-015-0093-0

Text-to-video: a semantic search engine for internet videos

Regular Paper
Published: 24 December 2015

Volume 5, pages 3–18, (2016)
Cite this article

International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Lu Jiang¹,
Shoou-I Yu¹,
Deyu Meng²,
Teruko Mitamura¹ &
…
Alexander G. Hauptmann¹

703 Accesses
3 Citations
3 Altmetric
4 Mentions
Explore all metrics

Abstract

Semantic search or text-to-video search in video is a novel and challenging problem in information and multimedia retrieval. Existing solutions are mainly limited to text-to-text matching, in which the query words are matched against the user-generated metadata. This kind of text-to-text search, though simple, is of limited functionality as it provides no understanding about the video content. This paper presents a state-of-the-art system for event search without any user-generated metadata or example videos, known as text-to-video search. The system relies on substantial video content understanding and allows for searching complex events over a large collection of videos. The proposed text-to-video search can be used to augment the existing text-to-text search for video. The novelty and practicality are demonstrated by the evaluation in NIST TRECVID 2014, where the proposed system achieves the best performance. We share our observations and lessons in building such a state-of-the-art system, which may be instrumental in guiding the design of the future system for video search and analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comprehensive review of the video-to-text problem

Article 16 January 2022

V-FIRST: A Flexible Interactive Retrieval System for Video at VBS 2022

Find and Focus: Retrieve and Localize Video Events with Natural Language Queries

Notes

When different retrieval models are used for different features, and their ranked lists have very different score distributions, we may need to solve a linear programming problem to determine the starting pseudo-positives [18].
The reason we can randomly select pseudo-negative samples is that they have negligible impacts on performance [18].
The time is the search time evaluated on an Intel Xeon 2.53GHz CPU. It does not include the PRF time, which is about 238 ms over 200K videos.

References

Banerjee S, Pedersen T (2002) An adapted lesk algorithm for word sense disambiguation using wordnet. In: CICLing
Bengio Y, Louradour J, Collobert R, Weston J (2009) Curriculum learning. In: ICML
Bhattacharya S, Yu FX, Chang SF (2014) Minimally needed evidence for complex event recognition in unconstrained videos. In: ICMR
Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM TIST 2:27:1–27:27
Google Scholar
Church KW, Hanks P (1990) Word association norms, mutual information, and lexicography. Comput Linguist 16(1):22–29
Google Scholar
Dalton J, Allan J, Mirajkar P (2013) Zero-shot video retrieval using content and concepts. In: CIKM
Davidson J, Liebald B, Liu J et al (2010) The youtube video recommendation system. In: RecSys
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: CVPR
Gkalelis N, Mezaris V (2014) Video event detection using generalized subclass discriminant analysis and linear support vector machines. In: ICMR
Habibian A, Mensink T, Snoek CG (2014) Composite concept discovery for zero-shot video event detection. In: ICMR
Habibian A, van de Sande KE, Snoek CG (2013) Recommendations for video event recognition using concept vocabularies. In: ICMR
Hauptmann A, Yan R, Lin WH, Christel M, Wactlar H (2007) Can high-level concepts fill the semantic gap in video retrieval? A case study with broadcast news. TMM 9(5):958–966
Google Scholar
Inoue N, Shinoda K (2014) n-gram models for video semantic indexing. In: MM
Jiang L, Hauptmann A, Xiang G (2012) Leveraging high-level and low-level features for multimedia event detection. In: MM
Jiang L, Meng D, Mitamura T, Hauptmann AG (2014) Easy samples first: self-paced reranking for zero-example multimedia search. In: MM
Jiang L, Meng D, Yu SI, Lan Z, Shan S, Hauptmann AG (2014) Self-paced learning with diversity. In: NIPS
Jiang L, Meng D, Zhao Q, Shan S, Hauptmann AG (2015) Self-paced curriculum learning. In: AAAI
Jiang L, Mitamura T, Yu SI, Hauptmann AG (2014) Zero-example event search using multimodal pseudo relevance feedback. In: ICMR
Jiang L, Tong W, Meng D, Hauptmann AG (2014) Towards efficient learning of optimal spatial bag-of-words representations. In: ICMR
Jiang L, Yu SI, Meng D, Mitamura T, Hauptmann AG (2015) Bridging the ultimate semantic gap: a semantic search engine for internet videos. In: ICMR
Jiang L, Yu SI, Meng D, Yang Y, Mitamura T, Hauptmann AG (2015) Fast and accurate content-based semantic search in 100m internet videos. In: MM
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: CVPR
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: NIPS
Kumar M, Packer B, Koller D (2010) Self-paced learning for latent variable models. In: NIPS
Lee H (2014) Analyzing complex events and human actions in“ in-the-wild” videos. In: UMD Ph.D Theses and Dissertations
Levy O, Goldberg Y (2014) Dependency-based word embeddings. In: ACL
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval, vol 1. Cambridge University Press, Cambridge
Book MATH Google Scholar
Mazloom M, Li X, Snoek CG (2014) Few-example video event retrieval using tag propagation. In: ICMR
Miao Y, Gowayyed M, Metze F (2015) Eesen: end-to-end speech recognition using deep rnn models and wfst-based decoding. arXiv:1507.08240
Miao Y, Jiang L, Zhang H, Metze F (2014) Improvements to speaker adaptive training of deep neural networks. In: SLT
Miao Y, Metze F (2013) Improving low-resource cd-dnn-hmm using dropout and multilingual dnn training. In: INTERSPEECH
Miao Y, Metze F, Rawat S (2013) Deep maxout networks for low-resource speech recognition. In: ASRU
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: NIPS
Norouzi M, Mikolov T, Bengio S, Singer Y, Shlens J, Frome, A, Corrado GS, Dean J (2014) Zero-shot learning by convex combination of semantic embeddings. In: ICLR
Oh S, McCloskey S, Kim I, Vahdat A, Cannons KJ, Hajimirsadeghi H, Mori G, Perera AA, Pandey M, Corso JJ (2014) Multimedia event detection with multimodal feature fusion and temporal concept localization. Mach Vision Appl 25(1):49–69
Article Google Scholar
Over P, Awad G, Michel M, Fiscus J, Sanders G, Kraaij W, Smeaton AF, Quenot G (2014) TRECVID 2014—an overview of the goals, tasks, data, evaluation mechanisms and metrics. In: TRECVID
Palatucci M, Pomerleau D, Hinton GE, Mitchell TM (2009) Zero-shot learning with semantic output codes. In: NIPS
Povey D, Ghoshal A, Boulianne G et al (2011) The kaldi speech recognition toolkit. In: ASRU
Safadi B, Sahuguet M, Huet B (2014) When textual and visual information join forces for multimedia retrieval. In: ICMR
Thomee B, Shamma DA, Friedland G, Elizalde B, Ni K, Poland D, Borth D, Li LJ (2015) The new data and new challenges in multimedia research. arXiv:1503.01817
Tong W, Yang Y, Jiang L et al (2014) E-lamp: integration of innovative ideas for multimedia event detection. Mach Vision Appl 25(1):5–15
Article Google Scholar
Vedaldi A, Zisserman A (2012) Efficient additive kernels via explicit feature maps. PAMI 34(3):480–492
Article Google Scholar
Wang F, Sun Z, Jiang Y, Ngo C (2013) Video event detection using motion relativity and feature selection. In: TMM
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: ICCV
Wu S, Bondugula S, Luisier F, Zhuang X, Natarajan P (2014) Zero-shot event detection using multi-modal fusion of weakly supervised concepts. In: CVPR
Wu Z, Palmer M (1994) Verbs semantics and lexical selection. In: ACL
Younessian E, Mitamura T, Hauptmann A (2012) Multimodal knowledge-based analysis in multimedia event detection. In: ICMR
Yu SI, Jiang L, Hauptmann A (2014) Instructional videos for unsupervised harvesting and learning of action examples. In: MM
Yu SI, Jiang L, Xu Z, Yang Y, Hauptmann AG (2015) Content-based video search over 1 million videos with 1 core in 1 second. In: ICMR
Yu SI, Jiang L, Xu Z et al (2014) Cmu-informedia@trecvid 2014. In: TRECVID
Zhai C, Lafferty J (2004) A study of smoothing methods for language models applied to information retrieval. TOIS 22(2)
Zhao Q, Meng D, Jiang L, Xie Q, Xu Z, Hauptmann AG (2015) Self-paced learning for matrix factorization. In: AAAI

Download references

Acknowledgments

This work was partially supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number D11PC20068. Deyu Meng was partially supported by the China NSFC project under contract 61373114. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC, or the U.S. Government.

This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by the National Science Foundation grant number OCI-1053575. It used the Blacklight system at the Pittsburgh Supercomputing Center (PSC).

Author information

Authors and Affiliations

School of Computer Science, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, USA
Lu Jiang, Shoou-I Yu, Teruko Mitamura & Alexander G. Hauptmann
School of Mathematics and Statistics, Xi’an Jiaotong University, No. 28, Xianning West Road, Xi’an, China
Deyu Meng

Authors

Lu Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Shoou-I Yu
View author publications
You can also search for this author in PubMed Google Scholar
Deyu Meng
View author publications
You can also search for this author in PubMed Google Scholar
Teruko Mitamura
View author publications
You can also search for this author in PubMed Google Scholar
Alexander G. Hauptmann
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lu Jiang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jiang, L., Yu, SI., Meng, D. et al. Text-to-video: a semantic search engine for internet videos. Int J Multimed Info Retr 5, 3–18 (2016). https://doi.org/10.1007/s13735-015-0093-0

Download citation

Received: 30 September 2015
Revised: 26 November 2015
Accepted: 06 December 2015
Published: 24 December 2015
Issue Date: March 2016
DOI: https://doi.org/10.1007/s13735-015-0093-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Text-to-video: a semantic search engine for internet videos

Abstract

Access this article

Similar content being viewed by others

A comprehensive review of the video-to-text problem

V-FIRST: A Flexible Interactive Retrieval System for Video at VBS 2022

Find and Focus: Retrieve and Localize Video Events with Natural Language Queries

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Text-to-video: a semantic search engine for internet videos

Abstract

Access this article

Similar content being viewed by others

A comprehensive review of the video-to-text problem

V-FIRST: A Flexible Interactive Retrieval System for Video at VBS 2022

Find and Focus: Retrieve and Localize Video Events with Natural Language Queries

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation