Skip to main content

Variational Deep Representation Learning for Cross-Modal Retrieval

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 13020))

Abstract

In this paper, we propose a variational deep representation learning (VDRL) approach for cross-modal retrieval. Numerous existing methods map the image and text to the point representations, which is challenging to model the semantic multiplicity of the sample. To address this issue, our VDRL aims to map the image and text to the semantic distributions and measure the similarity by comparing the difference between their distributions. Specifically, our VDRL network is trained under three constraints: 1) The Variational Autoencoder loss is minimized to learn the distributions of both the images in image semantic space and the texts in text semantic space. 2) The mutual information is introduced to ensure the VDRL learns the intact distribution for the sample. 3) The triplet hinge loss is incorporated to align the distributions of the images and texts at the semantic level. Consequently, the semantic multiplicity of each sample is modeled in our method. Experimental results demonstrate that our approach achieves compelling performance with state-of-the-art methods.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Belghazi, M.I., Baratin, A., Rajeswar, S., et al.: MINE: mutual information neural estimation. arXiv (2018)

    Google Scholar 

  2. Cornia, M., Baraldi, L., Tavakoli, H.R., et al.: A unified cycle-consistent neural model for text and image retrieval. Multimedia Tools Appl. 79(35), 25697–25721 (2020)

    Article  Google Scholar 

  3. Devlin, J., Chang, M.W., Lee, K., et al.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv (2018)

    Google Scholar 

  4. Ding, K., Fan, B., Huo, C., Xiang, S., Pan, C.: Cross-modal hashing via rank-order preserving. IEEE TMM 19(3), 571–585 (2016)

    Google Scholar 

  5. Faghri, F., Fleet, D.J., Kiros, J.R., et al.: Vse++: improving visual-semantic embeddings with hard negatives. arXiv (2017)

    Google Scholar 

  6. Givens, C.R., Shortt, R.M., et al.: A class of wasserstein metrics for probability distributions. Michigan Math. J. 31(2), 231–240 (1984)

    Article  MathSciNet  Google Scholar 

  7. He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)

    Google Scholar 

  8. Hjelm, R.D., Fedorov, A., Lavoie-Marchildon, S., et al.: Learning deep representations by mutual information estimation and maximization. In: ICLR (2018)

    Google Scholar 

  9. Hu, D., Nie, F., Li, X.: Deep binary reconstruction for cross-modal hashing. IEEE TMM 21(4), 973–985 (2018)

    Google Scholar 

  10. Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR, pp. 3128–3137 (2015)

    Google Scholar 

  11. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv (2013)

    Google Scholar 

  12. Li, W., Zheng, Y., Zhang, Y., et al.: Cross-modal retrieval with dual multi-angle self-attention. J. Assoc. Inf. Sci. Technol. 72(1), 46–65 (2021)

    Article  Google Scholar 

  13. Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  14. Liu, R., Zhao, Y., Wei, S., et al.: Modality-invariant image-text embedding for image-sentence matching. ACM Trans. Multimedia Comput. Commun. Appl. 15(1), 1–19 (2019)

    Article  Google Scholar 

  15. Liu, Y., Guo, Y., Liu, L., et al.: CycleMatch: a cycle-consistent embedding network for image-text matching. Pattern Recogn. 93, 365–379 (2019)

    Article  Google Scholar 

  16. Ma, X., Zhang, T., Xu, C.: Multi-level correlation adversarial hashing for cross-modal retrieval. IEEE TMM 22(12), 3101–3114 (2020)

    Google Scholar 

  17. Malisiewicz, T., Gupta, A., Efros, A.A.: Ensemble of exemplar-SVMs for object detection and beyond. In: ICCV, pp. 89–96. IEEE (2011)

    Google Scholar 

  18. Peters, M.E., Neumann, M., Iyyer, M., et al.: Deep contextualized word representations. arXiv (2018)

    Google Scholar 

  19. Sarafianos, N., Xu, X., Kakadiaris, I.A.: Adversarial representation learning for text-to-image matching. In: ICCV, pp. 5814–5824 (2019)

    Google Scholar 

  20. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv (2014)

    Google Scholar 

  21. Surís, D., Duarte, A., Salvador, A., Torres, J., Giró-i-Nieto, X.: Cross-modal embeddings for video and audio retrieval. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11132, pp. 711–716. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11018-5_62

    Chapter  Google Scholar 

  22. Wang, L., Li, Y., Lazebnik, S.: Learning deep structure-preserving image-text embeddings. In: CVPR, pp. 5005–5013 (2016)

    Google Scholar 

  23. Wang, S., Chen, Y., Zhuo, J., et al.: Joint global and co-attentive representation learning for image-sentence retrieval. In: ACM MM, pp. 1398–1406 (2018)

    Google Scholar 

  24. Wu, H., Mao, J., Zhang, Y., et al.: Unified visual-semantic embeddings: bridging vision and language with structured meaning representations. In: CVPR, pp. 6609–6618 (2019)

    Google Scholar 

  25. You, Q., Zhang, Z., Luo, J.: End-to-end convolutional semantic embeddings. In: CVPR, pp. 5735–5744 (2018)

    Google Scholar 

  26. Young, P., Lai, A., Hodosh, M., et al.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2, 67–78 (2014)

    Article  Google Scholar 

  27. Zhang, Y., Lu, H.: Deep cross-modal projection learning for image-text matching. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 707–723. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_42

    Chapter  Google Scholar 

  28. Zheng, Z., Zheng, L., Garrett, M., et al.: Dual-path convolutional image-text embeddings with instance loss. ACM Trans. Multimedia Comput. Commun. Appl. 16(2), 1–23 (2020)

    Article  Google Scholar 

Download references

Acknowledgment

This work is partially supported by the National Natural Science Foundation of China under Grant 61862050, the National Natural Science Foundation of Ningxia under Grant 2020AAC03031, and the Scientific Research Innovation Project of First-Class Western Universities under Grant ZKZD2017005.

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 752 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yang, C., Deng, Z., Li, T., Liu, H., Liu, L. (2021). Variational Deep Representation Learning for Cross-Modal Retrieval. In: Ma, H., et al. Pattern Recognition and Computer Vision. PRCV 2021. Lecture Notes in Computer Science(), vol 13020. Springer, Cham. https://doi.org/10.1007/978-3-030-88007-1_41

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-88007-1_41

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-88006-4

  • Online ISBN: 978-3-030-88007-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics