Cross-Modality Video Segment Retrieval with Ensemble Learning

Yu, Xinyan; Zhang, Ya; Zhang, Rui

doi:10.1007/978-3-030-30671-7_5

Xinyan Yu⁵,
Ya Zhang⁵ &
Rui Zhang⁵

678 Accesses

Abstract

Jointly modeling vision and language is a new research area which has many applications, such as video segment retrieval and video dense caption. Compared with video language retrieval, video segment retrieval is a novel task that uses natural language to retrieve a specific video segment from the whole video. One common method is to learn a similarity metric between video and language features. In this chapter, we utilize ensemble learning method to learn a video segment retrieval model. Our ensemble model aims to combine each single-stream model to learn a better similarity metric. We evaluate our method on the task of the video clip retrieval with the new proposed Distinct Describable Moments dataset. Extensive experiments have shown that our approach achieves improvement compared with the result of the state-of-art.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M et al (2016) Tensorflow: A system for large-scale machine learning
Google Scholar
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) ImageNet: a large-scale hierarchical image database. In: CVPR09
Google Scholar
Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Mikolov T et al (2013) Devise: a deep visual-semantic embedding model. In: Advances in neural information processing systems, pp 2121–2129
Google Scholar
Gao J, Sun C, Yang Z, Nevatia R (2017) Tall: temporal activity localization via language query
Google Scholar
Girshick R (2015) Fast r-cnn. arXiv:1504.08083
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Google Scholar
Hendricks LA, Wang O, Shechtman E, Sivic J, Darrell T, Russell B (2017) Localizing moments in video with natural language. arXiv:1708.01641
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137
Google Scholar
Karpathy A, Joulin A, Fei-Fei LF (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ (eds) Advances in neural information processing systems 27. Curran Associates, Inc, pp 1889–1897. http://papers.nips.cc/paper/5281-deep-fragment-embeddings-for-bidirectional-image-sentence-mapping.pdf
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
Google Scholar
Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: International conference on machine learning, pp 1188–1196
Google Scholar
Ma L, Lu Z, Shang L, Li H (2015) Multimodal convolutional neural networks for matching image and sentence. In: Proceedings of the IEEE international conference on computer vision, pp 2623–2631
Google Scholar
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:1301.3781
Nam H, Ha JW, Kim J (2016) Dual attention networks for multimodal reasoning and matching. arXiv:1611.00471
Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Google Scholar
Rohrbach A, Torabi A, Rohrbach M, Tandon N, Pal C, Larochelle H, Courville A, Schiele B (2017) Movie description. Int J Comput Vis
Google Scholar
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576
Google Scholar
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A et al (2015) Going deeper with convolutions. In: CVPR
Google Scholar
Torabi A, Tandon N, Sigal L (2016) Learning language-visual embedding for movie understanding with natural-language. arXiv:1609.08124
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: 2015 IEEE international conference on computer vision (ICCV). IEEE, pp 4489–4497
Google Scholar
Vendrov I, Kiros R, Fidler S, Urtasun R (2015) Order-embeddings of images and language. arXiv:1511.06361
Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015) Sequence to sequence - video to text. In: The IEEE international conference on computer vision (ICCV)
Google Scholar
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: The IEEE conference on computer vision and pattern recognition (CVPR)
Google Scholar
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. In: European conference on computer vision. Springer, pp 20–36
Google Scholar
Wang L, Li Y, Lazebnik S (2016) Learning deep structure-preserving image-text embeddings. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5005–5013
Google Scholar
Yan F, Mikolajczyk K (2015) Deep correlation for matching images and text. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 3441–3450
Google Scholar
Yu Y, Ko H, Choi J, Kim G (2016) End-to-end concept word detection for video captioning, retrieval, and question answering. arXiv:1610.02947

Download references

Author information

Authors and Affiliations

Cooperative Medianet Innovation Center, Shanghai Jiao Tong University, Minhang, China
Xinyan Yu, Ya Zhang & Rui Zhang

Authors

Xinyan Yu
View author publications
You can also search for this author in PubMed Google Scholar
Ya Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Rui Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Ya Zhang or Rui Zhang .

Editor information

Editors and Affiliations

Indraprastha Institute of Information Technology Delhi, New Delhi, India
Richa Singh
Indraprastha Institute of Information Technology Delhi, New Delhi, India
Mayank Vatsa
Johns Hopkins University, Baltimore, MD, USA
Vishal M. Patel
IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA
Nalini Ratha

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Yu, X., Zhang, Y., Zhang, R. (2020). Cross-Modality Video Segment Retrieval with Ensemble Learning. In: Singh, R., Vatsa, M., Patel, V., Ratha, N. (eds) Domain Adaptation for Visual Understanding. Springer, Cham. https://doi.org/10.1007/978-3-030-30671-7_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-30671-7_5
Published: 09 January 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30670-0
Online ISBN: 978-3-030-30671-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics