Abstract
Video-text cross-modal retrieval is significant to computer vision. Most of existing works focus on exploring the global similarity between modalities, but ignore the influence of details on retrieval results. How to explore the correlation between different forms of data from multiple angles is a key issue. In this paper, we propose a Multi-grained Encoding and Joint Embedding Spaces Fusion (MEJESF) for video-text cross-modal retrieval. Specifically, we propose a novel dual encoding network to explore not only coarse-grained feature but also fine-grained feature of modals. At the same time, giving considerations to multiple encoding and hard sample mining, a modified pairwise ranking loss function is introduced. After that, we build two joint embedding spaces and adopt them when retrieving by fusing their scores. Experiments on two public benchmark datasets (MSR-VTT,MSVD) demonstrate that our method can obtain promising performance compared to the state-of-the-art methods in video-text cross-modal retrieval. Furthermore, our network model achieves outstanding performance in zero-example video retrieval.
Similar content being viewed by others
References
Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical correlation analysis. In: the 30th International conference on machine learning, pp 1247–1255
Carreira J , Zisserman A (2017) Quo vadis, action recognition: a new model and the kinetics dataset. In: IEEE conference on computer vision and pattern recognition, pp 6299–6308
Chen D L, Dolan W B (2011) Collecting highly parallel data for paraphrase evaluation. In: the 49th annual meeting of the association for computational linguistics: human language technologies, proceedings of the conference, pp 190–200
Chi J, Peng Y (2018) Dual adversarial networks for zero-shot cross-media retrieval. In: the 27th international joint conference on artificial intelligence, pp 663–669
Dong J, Li X, Snoek Cees GM (2016) Word2visualvec: Image and video to sentence matching by visual feature prediction. arXiv:1604.06838
Dong J, Li X, Snoek CGM (2018) Predicting visual features from text for image and video caption retrieval. IEEE Trans Multimed 20(12):3377–3388
Dong J, Li X, Xu C, Ji S, He Y, Yang G, Wang X (2019) Dual encoding for zero-example video retrieval. In: the IEEE conference on computer vision and pattern recognition, pp 9346–9355
Faghri F, Fleet D J, Kiros J R, Fidler S (2017) Vse++: Improved visual-semantic embeddings. arXiv:1707.05612
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: the IEEE conference on computer vision and pattern recognition, pp 770–778
Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: Data, models and evaluation metrics. J Artif Intell Res 47:853–899
Kim Y (2014) Convolutional neural networks for sentence classification. In: the 2014 conference on empirical methods in natural language processing, pp 1746–1751
Kiros R, Salakhutdinov R, Zemel R S (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539
Kiros R, Zhu Y, Salakhutdinov R, Zemel R S, Urtasun R, Torralba A, Fidler S (2015) Skip-thought vectors. In: advances in neural information processing systems 28: annual conference on neural information processing systems 2015, pp 3294–3302
Li W, Zheng Y, Zhang Y, Feng R, Zhang T, Fan W (2020) Cross-modal retrieval with dual multi-angle self-attention. J Assoc Inf Sci Technol 72 (1):46–65
Liu Y, Albanie S, Nagrani A, Zisserman A (2019) Use what you have: Video retrieval using representations from collaborative experts. arXiv:1907.13487
Markatopoulou F, Galanopoulos D, Mezaris V, Patras I (2017) Query and keyframe representations for ad-hoc video search. In: the 2017 ACM on international conference on multimedia retrieval, pp 407–411
Mc Donald K, Smeaton A F (2005) A comparison of score, rank and probability-based fusion methods for video shot retrieval. In: the 4th international conference on image and video retrieval, pp 61–70
Miech A, Laptev I, Sivic J (2018) Learning a text-video embedding from incomplete and heterogeneous data. arXiv:1804.02516
Miech A, Zhukov D, Alayrac J-B, Tapaswi M, Laptev I, Sivic J (2019) Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. arXiv:1906.03327
Mithun N C, Li J, Metze F, Roy-Chowdhury A K (2018) Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In: the 2008 ACM on international conference on multimedia retrieval, pp 19–27
Molchanov P, Tyree S, Karras T, Aila T, Kautz J (2016) Pruning convolutional neural networks for resource efficient inference. arXiv:1611.06440
Otani M, Nakashima Y, Rahtu E, Heikkil J, Yokoya N (2016) Learning joint representations of videos and sentences with web image search. In: the european conference on computer vision, pp 651–667
Pan Y, Mei T, Yao T, Li H, Rui Y (2016) Jointly modeling embedding and translation to bridge video and language. In: the IEEE conference on computer vision and pattern recognition, pp 4594–4602
Peng Y, Huang X, Qi J (2016) Cross-media shared representation by hierarchical learning with multiple deep networks. In: the 25th international joint conference on artificial intelligence, pp 3846–3853
Peng Y, Qi J, Yuan Y (2018) Modality-specific cross-modal similarity measurement with recurrent attention network. IEEE Trans Image Process 27(11):5585–5599
Shen X, Shen F, Sun Q-S, Yang Y, Yuan Y-H, Shen H T (2016) Semi-paired discrete hashing: Learning latent hash codes for semi-paired cross-view retrieval. IEEE Trans Cybern 47(12):4275–4288
Socher R, Li F (2010) Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora. In: the IEEE conference on computer vision and pattern recognition, pp 966–973
Ueki K, Hirakawa K, Kikuchi K, Ogawa T, Kobayashi T (2017) Waseda meisei at trecvid 2017: Ad-hoc video search. In: the 2017 TREC video retrieval evaluation
Xu J, Mei T, Yao T, Rui Y (2016) Msr-vtt: A large video description dataset for bridging video and language. In: the IEEE conference on computer vision and pattern recognition, pp 5288–5296
Xu R, Xiong C, Chen W, Corso J J (2015) Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In: the 29th AAAI conference on artificial intelligence, pp 2346–2352
Xu R, Li C, Yan J, Deng C, Liu X (2019) Graph convolutional network hashing for cross-modal retrieval. In: the 28th international joint conference on artificial intelligence, pp 10–16
Xu X, Lu H, Song J, Yang Y, Shen H T, Li X (2019) Ternary adversarial networks with self-supervision for zero-shot cross-modal retrieval. IEEE Trans Cybern 50(6):2400–2413
Xu X, Song J, Lu H, Yang Y, Shen F, Huang Z (2018) Modal-adversarial semantic learning network for extendable cross-modal retrieval. In: the 2018 ACM on international conference on multimedia retrieval, pp 46–54
Xue H, Chu W, Zhao Z, Cai D (2018) A better way to attend: Attention with trees for video question answering. IEEE Trans Image Process 27 (11):5563–5574
Yang Y, Zhou J, Ai J, Bin Y, Hanjalic A, Shen H T, Ji Y (2018) Video captioning by adversarial lstm. IEEE Trans Image Process 27(11):5600–5611
Yu Y, Kim J, Kim G (2018) A joint sequence fusion model for video question answering and retrieval. In: the european conference on computer vision, pp 471–487
Yu Y, Ko H, Choi J, Kim G (2017) End-to-end concept word detection for video captioning, retrieval, and question answering. In: the IEEE conference on computer vision and pattern recognition, pp 3165–3173
Zhang J, Peng Y, Yuan M (2018) Sch-gan: Semi-supervised cross-modal hashing by generative adversarial network. IEEE Trans Cybern 50(2):489–502
Zhang X, Zhou S, Feng J, Lai H, Li B, Pan Y, Yin J, Yan S (2017) Hashgan: attention-aware deep adversarial hashing for cross modal retrieval. arXiv:1711.09347
Zhuo T, Cheng Z, Zhang P, Wong Y, Kankanhalli M (2019) Unsupervised online video object segmentation with motion property understanding. IEEE Trans Image Process 29:237–249
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China under Project No. 62177015, in part by the Key-Area Research and Development Program of Guangdong Province No. 2019B111101001 and in part by the Science and Technology on Information System Engineering Laboratory No. WDZC 20205250410.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
The authors declare that there is no conflict of interest regarding the publication of this article.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Cui, X., Xiao, J., Cao, Y. et al. Multi-grained encoding and joint embedding space fusion for video and text cross-modal retrieval. Multimed Tools Appl 81, 34367–34386 (2022). https://doi.org/10.1007/s11042-022-13048-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-13048-y