Abstract
In this paper, we propose a variational deep representation learning (VDRL) approach for cross-modal retrieval. Numerous existing methods map the image and text to the point representations, which is challenging to model the semantic multiplicity of the sample. To address this issue, our VDRL aims to map the image and text to the semantic distributions and measure the similarity by comparing the difference between their distributions. Specifically, our VDRL network is trained under three constraints: 1) The Variational Autoencoder loss is minimized to learn the distributions of both the images in image semantic space and the texts in text semantic space. 2) The mutual information is introduced to ensure the VDRL learns the intact distribution for the sample. 3) The triplet hinge loss is incorporated to align the distributions of the images and texts at the semantic level. Consequently, the semantic multiplicity of each sample is modeled in our method. Experimental results demonstrate that our approach achieves compelling performance with state-of-the-art methods.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Belghazi, M.I., Baratin, A., Rajeswar, S., et al.: MINE: mutual information neural estimation. arXiv (2018)
Cornia, M., Baraldi, L., Tavakoli, H.R., et al.: A unified cycle-consistent neural model for text and image retrieval. Multimedia Tools Appl. 79(35), 25697–25721 (2020)
Devlin, J., Chang, M.W., Lee, K., et al.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv (2018)
Ding, K., Fan, B., Huo, C., Xiang, S., Pan, C.: Cross-modal hashing via rank-order preserving. IEEE TMM 19(3), 571–585 (2016)
Faghri, F., Fleet, D.J., Kiros, J.R., et al.: Vse++: improving visual-semantic embeddings with hard negatives. arXiv (2017)
Givens, C.R., Shortt, R.M., et al.: A class of wasserstein metrics for probability distributions. Michigan Math. J. 31(2), 231–240 (1984)
He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Hjelm, R.D., Fedorov, A., Lavoie-Marchildon, S., et al.: Learning deep representations by mutual information estimation and maximization. In: ICLR (2018)
Hu, D., Nie, F., Li, X.: Deep binary reconstruction for cross-modal hashing. IEEE TMM 21(4), 973–985 (2018)
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR, pp. 3128–3137 (2015)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv (2013)
Li, W., Zheng, Y., Zhang, Y., et al.: Cross-modal retrieval with dual multi-angle self-attention. J. Assoc. Inf. Sci. Technol. 72(1), 46–65 (2021)
Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, R., Zhao, Y., Wei, S., et al.: Modality-invariant image-text embedding for image-sentence matching. ACM Trans. Multimedia Comput. Commun. Appl. 15(1), 1–19 (2019)
Liu, Y., Guo, Y., Liu, L., et al.: CycleMatch: a cycle-consistent embedding network for image-text matching. Pattern Recogn. 93, 365–379 (2019)
Ma, X., Zhang, T., Xu, C.: Multi-level correlation adversarial hashing for cross-modal retrieval. IEEE TMM 22(12), 3101–3114 (2020)
Malisiewicz, T., Gupta, A., Efros, A.A.: Ensemble of exemplar-SVMs for object detection and beyond. In: ICCV, pp. 89–96. IEEE (2011)
Peters, M.E., Neumann, M., Iyyer, M., et al.: Deep contextualized word representations. arXiv (2018)
Sarafianos, N., Xu, X., Kakadiaris, I.A.: Adversarial representation learning for text-to-image matching. In: ICCV, pp. 5814–5824 (2019)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv (2014)
Surís, D., Duarte, A., Salvador, A., Torres, J., Giró-i-Nieto, X.: Cross-modal embeddings for video and audio retrieval. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11132, pp. 711–716. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11018-5_62
Wang, L., Li, Y., Lazebnik, S.: Learning deep structure-preserving image-text embeddings. In: CVPR, pp. 5005–5013 (2016)
Wang, S., Chen, Y., Zhuo, J., et al.: Joint global and co-attentive representation learning for image-sentence retrieval. In: ACM MM, pp. 1398–1406 (2018)
Wu, H., Mao, J., Zhang, Y., et al.: Unified visual-semantic embeddings: bridging vision and language with structured meaning representations. In: CVPR, pp. 6609–6618 (2019)
You, Q., Zhang, Z., Luo, J.: End-to-end convolutional semantic embeddings. In: CVPR, pp. 5735–5744 (2018)
Young, P., Lai, A., Hodosh, M., et al.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2, 67–78 (2014)
Zhang, Y., Lu, H.: Deep cross-modal projection learning for image-text matching. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 707–723. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_42
Zheng, Z., Zheng, L., Garrett, M., et al.: Dual-path convolutional image-text embeddings with instance loss. ACM Trans. Multimedia Comput. Commun. Appl. 16(2), 1–23 (2020)
Acknowledgment
This work is partially supported by the National Natural Science Foundation of China under Grant 61862050, the National Natural Science Foundation of Ningxia under Grant 2020AAC03031, and the Scientific Research Innovation Project of First-Class Western Universities under Grant ZKZD2017005.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Yang, C., Deng, Z., Li, T., Liu, H., Liu, L. (2021). Variational Deep Representation Learning for Cross-Modal Retrieval. In: Ma, H., et al. Pattern Recognition and Computer Vision. PRCV 2021. Lecture Notes in Computer Science(), vol 13020. Springer, Cham. https://doi.org/10.1007/978-3-030-88007-1_41
Download citation
DOI: https://doi.org/10.1007/978-3-030-88007-1_41
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-88006-4
Online ISBN: 978-3-030-88007-1
eBook Packages: Computer ScienceComputer Science (R0)