A Multiple Positives Enhanced NCE Loss for Image-Text Retrieval

Li, Yi; Wu, Dehao; Zhu, Yuesheng

doi:10.1007/978-3-030-98358-1_34

Yi Li¹⁵,
Dehao Wu¹⁵ &
Yuesheng Zhu¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13141))

Included in the following conference series:

International Conference on Multimedia Modeling

2191 Accesses

Abstract

Image-Text Retrieval (ITR) enables users to retrieve relevant contents from different modalities and has attracted considerable attention. Existing approaches typically utilize contrastive loss functions to conduct contrastive learning in the common embedding space, where they aim at pulling semantically related pairs closer while pushing away unrelated pairs. However, we argue that this behaviour is too strict: these approaches neglect to address the inherent misalignments from potential semantically related samples. For example, it commonly exists more than one positive samples in the current batch for a given query and previous methods enforce them apart even if they are semantically related, which leads to a sub-optimal and contradictory optimization direction and then decreases the retrieval performance. In this paper, a Multiple Positives Enhanced Noise Contrastive Estimation learning objective is proposed to alleviate the diversion noise by leveraging and optimizing multiple positive pairs overall for each sample in a mini-batch. We demonstrate the effectiveness of our approach on MS-COCO and Flickr30K datasets for image-to-text and text-to-image retrieval.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)
Google Scholar
Zhu, Y., et al.: Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 19–27 (2015)
Google Scholar
Gutmann, M., Hyvärinen, A.: Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 297–304. JMLR Workshop and Conference Proceedings (2010)
Google Scholar
Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: VSE++: improving visual-semantic embeddings with hard negatives. In: BMVC, p. 12. BMVA Press (2018)
Google Scholar
Li, K., Zhang, Y., Li, K., Li, Y., Fu, Y.: Visual semantic reasoning for image-text matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4654–4662 (2019)
Google Scholar
Messina, N., Amato, G., Esuli, A., Falchi, F., Gennaro, C., Marchand-Maillet, S.: Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 17(4), 1–23 (2021)
Google Scholar
Qu, L., Liu, M., Cao, D., Nie, L., Tian, Q.: Context-aware multi-view summarization network for image-text matching. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 1047–1055 (2020)
Google Scholar
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2, 67–78 (2014)
Article Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Ji, Z., Wang, H., Han, J., Pang, Y.: SMAN: stacked multimodal attention network for cross-modal image-text retrieval. IEEE Trans. Cybern. (2020)
Google Scholar
Messina, N., Falchi, F., Esuli, A., Amato, G.: Transformer reasoning network for image-text matching and retrieval. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 5222–5229. IEEE (2021)
Google Scholar
Eisenschtat, A., Wolf, L.: Linking image and text with 2-way nets. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4601–4611 (2017)
Google Scholar
Klein, B., Lev, G., Sadeh, G., Wolf, L.: Associating neural word embeddings with deep image representations using fisher vectors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4437–4446 (2015)
Google Scholar
Gu, J., Cai, J., Joty, S.R., Niu, L., Wang, G.: Look, imagine and match: improving textual-visual cross-modal retrieval with generative models. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7181–7189 (2018)
Google Scholar
Huang, Y., Wu, Q., Song, C., Wang, L.: Learning semantic concepts and order for image and sentence matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6163–6171 (2018)
Google Scholar
Liu, Y., Guo, Y., Bakker, E.M., Lew, M.S.: Learning a recurrent residual fusion network for multimodal matching. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4107–4116 (2017)
Google Scholar
Lee, K.H., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 201–216 (2018)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. Adv. Neural Inform. Process. Syst. 28, 91–99 (2015)
Google Scholar
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Google Scholar
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT, pp. 4171–4186 (2019)
Google Scholar
Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems, pp. 13–23 (2019)
Google Scholar
Sun, C., Myers, A., Vondrick, C., Murphy, K., Schmid, C.: VideoBERT: a joint model for video and language representation learning. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 7464–7473 (2019)
Google Scholar
Karpathy, A., Joulin, A., Fei-Fei, L.F.: Deep fragment embeddings for bidirectional image sentence mapping. Adv. Neural Inform. Process. Syst. 27 (2014)
Google Scholar
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020)
Google Scholar
Gao, T., Yao, X., Chen, D.: Simcse: Simple contrastive learning of sentence embeddings. In: EMNLP (1). pp. 6894–6910. Association for Computational Linguistics (2021)
Google Scholar
Oord, A.V.D., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Sarafianos, N., Xu, X., Kakadiaris, I.A.: Adversarial representation learning for text-to-image matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5814–5824 (2019)
Google Scholar
Ji, Z., Lin, Z., Wang, H., He, Y.: Multi-modal memory enhancement attention network for image-text matching. IEEE Access 8, 38438–38447 (2020)
Article Google Scholar
Wei, K., Zhou, Z.: Adversarial attentive multi-modal embedding learning for image-text matching. IEEE Access 8, 96237–96248 (2020)
Article Google Scholar
Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Shenzhen Graduate School, Peking University, Beijing, China
Yi Li, Dehao Wu & Yuesheng Zhu

Authors

Yi Li
View author publications
You can also search for this author in PubMed Google Scholar
Dehao Wu
View author publications
You can also search for this author in PubMed Google Scholar
Yuesheng Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuesheng Zhu .

Editor information

Editors and Affiliations

IT University of Copenhagen, Copenhagen, Denmark
Björn Þór Jónsson
Dublin City University, Dublin, Ireland
Cathal Gurrin
University of Science, VNU-HCM, Ho Chi Minh City, Vietnam
Minh-Triet Tran
University of Bergen, Bergen, Norway
Duc-Tien Dang-Nguyen
National Tsing Hua University, Hsinchu, Taiwan
Anita Min-Chun Hu
Hanoi University of Science and Technology, Hanoi, Vietnam
Binh Huynh Thi Thanh
Median Technologies, Valbonne, France
Benoit Huet

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, Y., Wu, D., Zhu, Y. (2022). A Multiple Positives Enhanced NCE Loss for Image-Text Retrieval. In: Þór Jónsson, B., et al. MultiMedia Modeling. MMM 2022. Lecture Notes in Computer Science, vol 13141. Springer, Cham. https://doi.org/10.1007/978-3-030-98358-1_34

Download citation

DOI: https://doi.org/10.1007/978-3-030-98358-1_34
Published: 15 March 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-98357-4
Online ISBN: 978-3-030-98358-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics