Variational Deep Representation Learning for Cross-Modal Retrieval

Yang, Chen; Deng, Zongyong; Li, Tianyu; Liu, Hao; Liu, Libo

doi:10.1007/978-3-030-88007-1_41

Variational Deep Representation Learning for Cross-Modal Retrieval

Chen Yang¹⁶,
Zongyong Deng¹⁶,
Tianyu Li¹⁶,
Hao Liu^16,17 &
…
Libo Liu^16,17

Conference paper
First Online: 22 October 2021

2078 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 13020))

Abstract

In this paper, we propose a variational deep representation learning (VDRL) approach for cross-modal retrieval. Numerous existing methods map the image and text to the point representations, which is challenging to model the semantic multiplicity of the sample. To address this issue, our VDRL aims to map the image and text to the semantic distributions and measure the similarity by comparing the difference between their distributions. Specifically, our VDRL network is trained under three constraints: 1) The Variational Autoencoder loss is minimized to learn the distributions of both the images in image semantic space and the texts in text semantic space. 2) The mutual information is introduced to ensure the VDRL learns the intact distribution for the sample. 3) The triplet hinge loss is incorporated to align the distributions of the images and texts at the semantic level. Consequently, the semantic multiplicity of each sample is modeled in our method. Experimental results demonstrate that our approach achieves compelling performance with state-of-the-art methods.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Belghazi, M.I., Baratin, A., Rajeswar, S., et al.: MINE: mutual information neural estimation. arXiv (2018)
Google Scholar
Cornia, M., Baraldi, L., Tavakoli, H.R., et al.: A unified cycle-consistent neural model for text and image retrieval. Multimedia Tools Appl. 79(35), 25697–25721 (2020)
Article Google Scholar
Devlin, J., Chang, M.W., Lee, K., et al.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv (2018)
Google Scholar
Ding, K., Fan, B., Huo, C., Xiang, S., Pan, C.: Cross-modal hashing via rank-order preserving. IEEE TMM 19(3), 571–585 (2016)
Google Scholar
Faghri, F., Fleet, D.J., Kiros, J.R., et al.: Vse++: improving visual-semantic embeddings with hard negatives. arXiv (2017)
Google Scholar
Givens, C.R., Shortt, R.M., et al.: A class of wasserstein metrics for probability distributions. Michigan Math. J. 31(2), 231–240 (1984)
Article MathSciNet Google Scholar
He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
Hjelm, R.D., Fedorov, A., Lavoie-Marchildon, S., et al.: Learning deep representations by mutual information estimation and maximization. In: ICLR (2018)
Google Scholar
Hu, D., Nie, F., Li, X.: Deep binary reconstruction for cross-modal hashing. IEEE TMM 21(4), 973–985 (2018)
Google Scholar
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR, pp. 3128–3137 (2015)
Google Scholar
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv (2013)
Google Scholar
Li, W., Zheng, Y., Zhang, Y., et al.: Cross-modal retrieval with dual multi-angle self-attention. J. Assoc. Inf. Sci. Technol. 72(1), 46–65 (2021)
Article Google Scholar
Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, R., Zhao, Y., Wei, S., et al.: Modality-invariant image-text embedding for image-sentence matching. ACM Trans. Multimedia Comput. Commun. Appl. 15(1), 1–19 (2019)
Article Google Scholar
Liu, Y., Guo, Y., Liu, L., et al.: CycleMatch: a cycle-consistent embedding network for image-text matching. Pattern Recogn. 93, 365–379 (2019)
Article Google Scholar
Ma, X., Zhang, T., Xu, C.: Multi-level correlation adversarial hashing for cross-modal retrieval. IEEE TMM 22(12), 3101–3114 (2020)
Google Scholar
Malisiewicz, T., Gupta, A., Efros, A.A.: Ensemble of exemplar-SVMs for object detection and beyond. In: ICCV, pp. 89–96. IEEE (2011)
Google Scholar
Peters, M.E., Neumann, M., Iyyer, M., et al.: Deep contextualized word representations. arXiv (2018)
Google Scholar
Sarafianos, N., Xu, X., Kakadiaris, I.A.: Adversarial representation learning for text-to-image matching. In: ICCV, pp. 5814–5824 (2019)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv (2014)
Google Scholar
Surís, D., Duarte, A., Salvador, A., Torres, J., Giró-i-Nieto, X.: Cross-modal embeddings for video and audio retrieval. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11132, pp. 711–716. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11018-5_62
Chapter Google Scholar
Wang, L., Li, Y., Lazebnik, S.: Learning deep structure-preserving image-text embeddings. In: CVPR, pp. 5005–5013 (2016)
Google Scholar
Wang, S., Chen, Y., Zhuo, J., et al.: Joint global and co-attentive representation learning for image-sentence retrieval. In: ACM MM, pp. 1398–1406 (2018)
Google Scholar
Wu, H., Mao, J., Zhang, Y., et al.: Unified visual-semantic embeddings: bridging vision and language with structured meaning representations. In: CVPR, pp. 6609–6618 (2019)
Google Scholar
You, Q., Zhang, Z., Luo, J.: End-to-end convolutional semantic embeddings. In: CVPR, pp. 5735–5744 (2018)
Google Scholar
Young, P., Lai, A., Hodosh, M., et al.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2, 67–78 (2014)
Article Google Scholar
Zhang, Y., Lu, H.: Deep cross-modal projection learning for image-text matching. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 707–723. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_42
Chapter Google Scholar
Zheng, Z., Zheng, L., Garrett, M., et al.: Dual-path convolutional image-text embeddings with instance loss. ACM Trans. Multimedia Comput. Commun. Appl. 16(2), 1–23 (2020)
Article Google Scholar

Download references

Acknowledgment

This work is partially supported by the National Natural Science Foundation of China under Grant 61862050, the National Natural Science Foundation of Ningxia under Grant 2020AAC03031, and the Scientific Research Innovation Project of First-Class Western Universities under Grant ZKZD2017005.

Author information

Authors and Affiliations

School of Information Engineering, Ningxia University, Yinchuan, 750021, China
Chen Yang, Zongyong Deng, Tianyu Li, Hao Liu & Libo Liu
Collaborative Innovation Center for Ningxia Big Data and Artificial Intelligence Co-founded by Ningxia Municipality and Ministry of Education, Yinchuan, 750021, China
Hao Liu & Libo Liu

Authors

Chen Yang
View author publications
You can also search for this author in PubMed Google Scholar
Zongyong Deng
View author publications
You can also search for this author in PubMed Google Scholar
Tianyu Li
View author publications
You can also search for this author in PubMed Google Scholar
Hao Liu
View author publications
You can also search for this author in PubMed Google Scholar
Libo Liu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Science and Technology Beijing, Beijing, China
Huimin Ma
Chinese Academy of Sciences, Beijing, China
Liang Wang
Tsinghua University, Beijing, China
Changshui Zhang
Zhejiang University, Hangzhou, China
Fei Wu
Chinese Academy of Sciences, Beijing, China
Tieniu Tan
Hunan University, Changsha, China
Yaonan Wang
Sun Yat-Sen University, Guangzhou, Guangdong, China
Jianhuang Lai
Beijing Jiaotong University, Beijing, China
Yao Zhao

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 752 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yang, C., Deng, Z., Li, T., Liu, H., Liu, L. (2021). Variational Deep Representation Learning for Cross-Modal Retrieval. In: Ma, H., et al. Pattern Recognition and Computer Vision. PRCV 2021. Lecture Notes in Computer Science(), vol 13020. Springer, Cham. https://doi.org/10.1007/978-3-030-88007-1_41

Download citation

DOI: https://doi.org/10.1007/978-3-030-88007-1_41
Published: 22 October 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-88006-4
Online ISBN: 978-3-030-88007-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics