Skip to main content
Log in

Multi-grained encoding and joint embedding space fusion for video and text cross-modal retrieval

  • 1168: Deep Pattern Discovery for Big Multimedia Data
  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Video-text cross-modal retrieval is significant to computer vision. Most of existing works focus on exploring the global similarity between modalities, but ignore the influence of details on retrieval results. How to explore the correlation between different forms of data from multiple angles is a key issue. In this paper, we propose a Multi-grained Encoding and Joint Embedding Spaces Fusion (MEJESF) for video-text cross-modal retrieval. Specifically, we propose a novel dual encoding network to explore not only coarse-grained feature but also fine-grained feature of modals. At the same time, giving considerations to multiple encoding and hard sample mining, a modified pairwise ranking loss function is introduced. After that, we build two joint embedding spaces and adopt them when retrieving by fusing their scores. Experiments on two public benchmark datasets (MSR-VTT,MSVD) demonstrate that our method can obtain promising performance compared to the state-of-the-art methods in video-text cross-modal retrieval. Furthermore, our network model achieves outstanding performance in zero-example video retrieval.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical correlation analysis. In: the 30th International conference on machine learning, pp 1247–1255

  2. Carreira J , Zisserman A (2017) Quo vadis, action recognition: a new model and the kinetics dataset. In: IEEE conference on computer vision and pattern recognition, pp 6299–6308

  3. Chen D L, Dolan W B (2011) Collecting highly parallel data for paraphrase evaluation. In: the 49th annual meeting of the association for computational linguistics: human language technologies, proceedings of the conference, pp 190–200

  4. Chi J, Peng Y (2018) Dual adversarial networks for zero-shot cross-media retrieval. In: the 27th international joint conference on artificial intelligence, pp 663–669

  5. Dong J, Li X, Snoek Cees GM (2016) Word2visualvec: Image and video to sentence matching by visual feature prediction. arXiv:1604.06838

  6. Dong J, Li X, Snoek CGM (2018) Predicting visual features from text for image and video caption retrieval. IEEE Trans Multimed 20(12):3377–3388

    Article  Google Scholar 

  7. Dong J, Li X, Xu C, Ji S, He Y, Yang G, Wang X (2019) Dual encoding for zero-example video retrieval. In: the IEEE conference on computer vision and pattern recognition, pp 9346–9355

  8. Faghri F, Fleet D J, Kiros J R, Fidler S (2017) Vse++: Improved visual-semantic embeddings. arXiv:1707.05612

  9. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: the IEEE conference on computer vision and pattern recognition, pp 770–778

  10. Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: Data, models and evaluation metrics. J Artif Intell Res 47:853–899

    Article  MathSciNet  Google Scholar 

  11. Kim Y (2014) Convolutional neural networks for sentence classification. In: the 2014 conference on empirical methods in natural language processing, pp 1746–1751

  12. Kiros R, Salakhutdinov R, Zemel R S (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539

  13. Kiros R, Zhu Y, Salakhutdinov R, Zemel R S, Urtasun R, Torralba A, Fidler S (2015) Skip-thought vectors. In: advances in neural information processing systems 28: annual conference on neural information processing systems 2015, pp 3294–3302

  14. Li W, Zheng Y, Zhang Y, Feng R, Zhang T, Fan W (2020) Cross-modal retrieval with dual multi-angle self-attention. J Assoc Inf Sci Technol 72 (1):46–65

    Article  Google Scholar 

  15. Liu Y, Albanie S, Nagrani A, Zisserman A (2019) Use what you have: Video retrieval using representations from collaborative experts. arXiv:1907.13487

  16. Markatopoulou F, Galanopoulos D, Mezaris V, Patras I (2017) Query and keyframe representations for ad-hoc video search. In: the 2017 ACM on international conference on multimedia retrieval, pp 407–411

  17. Mc Donald K, Smeaton A F (2005) A comparison of score, rank and probability-based fusion methods for video shot retrieval. In: the 4th international conference on image and video retrieval, pp 61–70

  18. Miech A, Laptev I, Sivic J (2018) Learning a text-video embedding from incomplete and heterogeneous data. arXiv:1804.02516

  19. Miech A, Zhukov D, Alayrac J-B, Tapaswi M, Laptev I, Sivic J (2019) Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. arXiv:1906.03327

  20. Mithun N C, Li J, Metze F, Roy-Chowdhury A K (2018) Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In: the 2008 ACM on international conference on multimedia retrieval, pp 19–27

  21. Molchanov P, Tyree S, Karras T, Aila T, Kautz J (2016) Pruning convolutional neural networks for resource efficient inference. arXiv:1611.06440

  22. Otani M, Nakashima Y, Rahtu E, Heikkil J, Yokoya N (2016) Learning joint representations of videos and sentences with web image search. In: the european conference on computer vision, pp 651–667

  23. Pan Y, Mei T, Yao T, Li H, Rui Y (2016) Jointly modeling embedding and translation to bridge video and language. In: the IEEE conference on computer vision and pattern recognition, pp 4594–4602

  24. Peng Y, Huang X, Qi J (2016) Cross-media shared representation by hierarchical learning with multiple deep networks. In: the 25th international joint conference on artificial intelligence, pp 3846–3853

  25. Peng Y, Qi J, Yuan Y (2018) Modality-specific cross-modal similarity measurement with recurrent attention network. IEEE Trans Image Process 27(11):5585–5599

    Article  MathSciNet  Google Scholar 

  26. Shen X, Shen F, Sun Q-S, Yang Y, Yuan Y-H, Shen H T (2016) Semi-paired discrete hashing: Learning latent hash codes for semi-paired cross-view retrieval. IEEE Trans Cybern 47(12):4275–4288

    Article  Google Scholar 

  27. Socher R, Li F (2010) Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora. In: the IEEE conference on computer vision and pattern recognition, pp 966–973

  28. Ueki K, Hirakawa K, Kikuchi K, Ogawa T, Kobayashi T (2017) Waseda meisei at trecvid 2017: Ad-hoc video search. In: the 2017 TREC video retrieval evaluation

  29. Xu J, Mei T, Yao T, Rui Y (2016) Msr-vtt: A large video description dataset for bridging video and language. In: the IEEE conference on computer vision and pattern recognition, pp 5288–5296

  30. Xu R, Xiong C, Chen W, Corso J J (2015) Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In: the 29th AAAI conference on artificial intelligence, pp 2346–2352

  31. Xu R, Li C, Yan J, Deng C, Liu X (2019) Graph convolutional network hashing for cross-modal retrieval. In: the 28th international joint conference on artificial intelligence, pp 10–16

  32. Xu X, Lu H, Song J, Yang Y, Shen H T, Li X (2019) Ternary adversarial networks with self-supervision for zero-shot cross-modal retrieval. IEEE Trans Cybern 50(6):2400–2413

    Article  Google Scholar 

  33. Xu X, Song J, Lu H, Yang Y, Shen F, Huang Z (2018) Modal-adversarial semantic learning network for extendable cross-modal retrieval. In: the 2018 ACM on international conference on multimedia retrieval, pp 46–54

  34. Xue H, Chu W, Zhao Z, Cai D (2018) A better way to attend: Attention with trees for video question answering. IEEE Trans Image Process 27 (11):5563–5574

    Article  MathSciNet  Google Scholar 

  35. Yang Y, Zhou J, Ai J, Bin Y, Hanjalic A, Shen H T, Ji Y (2018) Video captioning by adversarial lstm. IEEE Trans Image Process 27(11):5600–5611

    Article  MathSciNet  Google Scholar 

  36. Yu Y, Kim J, Kim G (2018) A joint sequence fusion model for video question answering and retrieval. In: the european conference on computer vision, pp 471–487

  37. Yu Y, Ko H, Choi J, Kim G (2017) End-to-end concept word detection for video captioning, retrieval, and question answering. In: the IEEE conference on computer vision and pattern recognition, pp 3165–3173

  38. Zhang J, Peng Y, Yuan M (2018) Sch-gan: Semi-supervised cross-modal hashing by generative adversarial network. IEEE Trans Cybern 50(2):489–502

    Article  Google Scholar 

  39. Zhang X, Zhou S, Feng J, Lai H, Li B, Pan Y, Yin J, Yan S (2017) Hashgan: attention-aware deep adversarial hashing for cross modal retrieval. arXiv:1711.09347

  40. Zhuo T, Cheng Z, Zhang P, Wong Y, Kankanhalli M (2019) Unsupervised online video object segmentation with motion property understanding. IEEE Trans Image Process 29:237–249

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Project No. 62177015, in part by the Key-Area Research and Development Program of Guangdong Province No. 2019B111101001 and in part by the Science and Technology on Information System Engineering Laboratory No. WDZC 20205250410.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jing Xiao.

Ethics declarations

Conflict of Interests

The authors declare that there is no conflict of interest regarding the publication of this article.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cui, X., Xiao, J., Cao, Y. et al. Multi-grained encoding and joint embedding space fusion for video and text cross-modal retrieval. Multimed Tools Appl 81, 34367–34386 (2022). https://doi.org/10.1007/s11042-022-13048-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-13048-y

Keywords

Navigation