Skip to main content
Log in

Joint embeddings with multimodal cues for video-text retrieval

  • Regular Paper
  • Published:
International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Abstract

For multimedia applications, constructing a joint representation that could carry information for multiple modalities could be very conducive for downstream use cases. In this paper, we study how to effectively utilize available multimodal cues from videos in learning joint representations for the cross-modal video-text retrieval task. Existing hand-labeled video-text datasets are often very limited by their size considering the enormous amount of diversity the visual world contains. This makes it extremely difficult to develop a robust video-text retrieval system based on deep neural network models. In this regard, we propose a framework that simultaneously utilizes multimodal visual cues by a “mixture of experts” approach for retrieval. We conduct extensive experiments to verify that our system is able to boost the performance of the retrieval task compared to the state of the art. In addition, we propose a modified pairwise ranking loss function in training the embedding and study the effect of various loss functions. Experiments on two benchmark datasets show that our approach yields significant gain compared to the state of the art.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical correlation analysis. In: International conference on machine learning, pp 1247–1255

  2. Awad G, Butt A, Fiscus J, Joy D, Delgado A, Michel M, Smeaton AF, Graham Y, Kraaij W, Quénot G et al (2017) Trecvid 2017: evaluating ad-hoc and instance video search, events detection, video captioning and hyperlinking. In: Proceedings of TRECVID

  3. Aytar Y, Vondrick C, Torralba A (2016) Soundnet: learning sound representations from unlabeled video. In: Advances in neural information processing systems, pp 892–900

  4. Aytar Y, Vondrick C, Torralba A (2017) See, hear, and read: deep aligned representations. arXiv preprint arXiv:1706.00932

  5. Bois R, Vukotić V, Simon AR, Sicre R, Raymond C, Sébillot P, Gravier G (2017) Exploiting multimodality in video hyperlinking to improve target diversity. In: International conference on multimedia modeling, Springer, pp 185–197

  6. Budnik M, Demirdelen M, Gravier G (2018) A study on multimodal video hyperlinking with visual aggregation. In: 2018 IEEE international conference on multimedia and expo, IEEE, pp 1–6

  7. Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 4724–4733

  8. Cha M, Gwon Y, Kung H (2015) Multimodal sparse representation learning and applications. arXiv preprint arXiv:1511.06238

  9. Chen DL, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: Annual meeting of the association for computational linguistics: human language technologies, vol 1, ACL, pp 190–200

  10. Chi J, Peng Y (2018) Dual adversarial networks for zero-shot cross-media retrieval. In: International joint conferences on artificial intelligence, pp 663–669

  11. Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555

  12. Dong J, Li X, Snoek CG (2016) Word2visualvec: Image and video to sentence matching by visual feature prediction. arXiv preprint arXiv:1604.06838

  13. Faghri F, Fleet DJ, Kiros JR, Fidler S (2018) Vse++: improved visual-semantic embeddings. In: British machine vision conference (BMVC)

  14. Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. In: European conference on computer vision, Springer, pp 15–29

  15. Feng F, Wang X, Li R (2014) Cross-modal retrieval with correspondence autoencoder. In: ACM multimedia conference, ACM, pp 7–16

  16. Fraz MM, Remagnino P, Hoppe A, Uyyanonvara B, Rudnicka AR, Owen CG, Barman SA (2012) An ensemble classification-based approach applied to retinal blood vessel segmentation. IEEE Trans Biomed Eng 59(9):2538–2548

    Article  Google Scholar 

  17. Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Mikolov T et al (2013) Devise: a deep visual-semantic embedding model. In: Advances in neural information processing systems, pp 2121–2129

  18. Gong Y, Ke Q, Isard M, Lazebnik S (2014) A multi-view embedding space for modeling internet images, tags, and their semantics. Int J Comput Vis 106(2):210–233

    Article  Google Scholar 

  19. Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664

    Article  MATH  Google Scholar 

  20. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 770–778

  21. Henning CA, Ewerth R (2017) Estimating the information gap between textual and visual representations. In: International conference on multimedia retrieval, ACM, pp 14–22

  22. Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47:853–899

    Article  MathSciNet  MATH  Google Scholar 

  23. Huang G, Liu Z, van der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 2261–2269

  24. Huang Y, Wang W, Wang L (2017) Instance-aware image and sentence matching with selective multimodal LSTM. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 2310–2318

  25. Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 3128–3137

  26. Karpathy A, Joulin A, Li FFF (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: Advances in neural information processing systems, pp 1889–1897

  27. Kingma D, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980

  28. Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539

  29. Kiros R, Zhu Y, Salakhutdinov RR, Zemel R, Urtasun R, Torralba A, Fidler S (2015) Skip-thought vectors. In: Advances in neural information processing systems, pp 3294–3302

  30. Klein B, Lev G, Sadeh G, Wolf L (2015) Associating neural word embeddings with deep image representations using fisher vectors. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 4437–4446

  31. Lee JH (1997) Analyses of multiple evidence combination. In: ACM SIGIR forum, vol 31, ACM, pp 267–276

  32. Ma Z, Lu Y, Foster D (2015) Finding linear structure in large datasets with scalable canonical correlation analysis. In: International conference on machine learning, pp 169–178

  33. Manmatha R, Wu CY, Smola AJ, Krahenbuhl P (2017) Sampling matters in deep embedding learning. In: IEEE international conference on computer vision, IEEE, pp 2859–2867

  34. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119

  35. Mithun NC, Li J, Metze F, Roy-Chowdhury AK (2018) Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In: ACM international conference on multimedia retrieval

  36. Mithun NC, Munir S, Guo K, Shelton C (2018) Odds: real-time object detection using depth sensors on embedded gpus. In: ACM/IEEE international conference on information processing in sensor networks, IEEE Press, pp 230–241

  37. Mithun NC, Panda R, Roy-Chowdhury AK (2016) Generating diverse image datasets with limited labeling. In: ACM multimedia conference, ACM, pp 566–570

  38. Mithun NC, Rameswar P, Papalexakis E, Roy-Chowdhury A (2018) Webly supervised joint embedding for cross-modal image-text retrieval. In: ACM international conference on multimedia

  39. Nam H, Ha JW, Kim J (2017) Dual attention networks for multimodal reasoning and matching. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 299–307

  40. Otani M, Nakashima Y, Rahtu E, Heikkilä J, Yokoya N (2016) Learning joint representations of videos and sentences with web image search. In: European conference on computer vision, Springer, pp 651–667

  41. Pan Y, Mei T, Yao T, Li H, Rui Y (2016) Jointly modeling embedding and translation to bridge video and language. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 4594–4602

  42. Polikar R (2006) Ensemble based systems in decision making. IEEE Circuits Syst Mag 6(3):21–45

    Article  Google Scholar 

  43. Polikar R (2007) Bootstrap inspired techniques in computational intelligence: ensemble of classifiers, incremental learning, data fusion and missing features. IEEE Signal Process Mag 24(4):59–72

    Article  Google Scholar 

  44. TQN Tran, Le Borgne H, Crucianu M (2016) Aggregating image and text quantized correlated components. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2046–2054

  45. Ramanishka V, Das A, Park DH, Venugopalan S, Hendricks LA, Rohrbach M, Saenko K (2016) Multimodal video description. In: ACM multimedia conference, ACM, pp 1092–1096

  46. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 779–788

  47. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99

  48. Schroff F, Kalenichenko D, Philbin J (2015) Facenet: a unified embedding for face recognition and clustering. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 815–823

  49. Socher R, Fei-Fei L (2010) Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 966–973

  50. Torabi A, Tandon N, Sigal L (2016) Learning language-visual embedding for movie understanding with natural-language. arXiv preprint arXiv:1609.08124

  51. Usunier N, Buffoni D, Gallinari P (2009) Ranking with ordered weighted pairwise classification. In: International conference on machine learning, ACM, pp 1057–1064

  52. Vendrov I, Kiros R, Fidler S, Urtasun R (2016) Order-embeddings of images and language. In: International conference on learning representations

  53. Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015) Sequence to sequence-video to text. In: IEEE International conference on computer vision, IEEE, pp 4534–4542

  54. Vukotić V, Raymond C, Gravier G (2016) Bidirectional joint representation learning with symmetrical deep neural networks for multimodal and crossmodal applications. In: ACM international conference on multimedia retrieval, ACM, pp 343–346

  55. Vukotić V, Raymond C, Gravier G (2017) Generative adversarial networks for multimodal representation learning in video hyperlinking. In: ACM international conference on multimedia retrieval, ACM, pp 416–419

  56. Vukotić V, Raymond C, Gravier G (2018) A crossmodal approach to multimodal fusion in video hyperlinking. IEEE Multimed 25(2):11–23

    Article  Google Scholar 

  57. Wang B, Yang Y, Xu X, Hanjalic A, Shen HT (2017) Adversarial cross-modal retrieval. In: ACM multimedia conference, ACM, pp 154–162

  58. Wang L, Li Y, Huang J, Lazebnik S (2018) Learning two-branch neural networks for image-text matching tasks. IEEE Trans Pattern Anal Mach Intell 41(2):394–407

  59. Wang L, Li Y, Lazebnik S (2016) Learning deep structure-preserving image-text embeddings. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 5005–5013

  60. Xu J, Mei T, Yao T, Rui Y (2016) Msr-vtt: A large video description dataset for bridging video and language. In: IEEE conference on computer vision and pattern recognition, pp 5288–5296

  61. Xu R, Xiong C, Chen W, Corso JJ (2015) Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. In: AAAI, vol 5, p 6

  62. Yan F, Mikolajczyk K (2015) Deep correlation for matching images and text. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 3441–3450

  63. Yan R, Yang J, Hauptmann AG (2004) Learning query-class dependent weights in automatic video retrieval. In: ACM multimedia conference, ACM, pp 548–555

  64. Zhang L, Ma B, Li G, Huang Q, Tian Q (2017) Multi-networks joint learning for large-scale cross-modal retrieval. In: ACM multimedia conference, ACM, pp 907–915

  65. Zhang X, Gao K, Zhang Y, Zhang D, Li J, Tian Q (2017) Task-driven dynamic fusion: reducing ambiguity in video description. In: IEEE conference on computer vision and pattern recognition, IEEE, pp 3713–3721

  66. Zhou B, Lapedriza A, Khosla A, Oliva A, Torralba A (2017) Places: a 10 million image database for scene recognition. IEEE Trans Pattern Anal Mach Intell 40:1452–1464

    Article  Google Scholar 

Download references

Acknowledgements

This work was partially supported by NSF grants 33384, IIS-1746031, CNS-1544969, ACI-1548562, and ACI-1445606. J. Li was supported by the Bosch Graduate Fellowship to CMU LTI. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Niluthpol C. Mithun.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mithun, N.C., Li, J., Metze, F. et al. Joint embeddings with multimodal cues for video-text retrieval. Int J Multimed Info Retr 8, 3–18 (2019). https://doi.org/10.1007/s13735-018-00166-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13735-018-00166-3

Keywords

Navigation