Semantic-enhanced discriminative embedding learning for cross-modal retrieval

Pan, Hao; Huang, Jun

doi:10.1007/s13735-022-00237-6

Semantic-enhanced discriminative embedding learning for cross-modal retrieval

Regular Paper
Published: 11 May 2022

Volume 11, pages 369–382, (2022)
Cite this article

International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

366 Accesses
Explore all metrics

Abstract

Cross-modal retrieval requires the retrieval from image to text and vice versa. Most existing methods leverage attention mechanism to explore advanced encoding network and utilize the ranking losses to reduce modal gap. Although these methods have achieved remarkable performance, they still suffer from some drawbacks that hinder the model from learning discriminative semantic embeddings. For example, the attention mechanism may assign larger weights to irrelevant parts than relevant parts, which prevents the model from learning discriminative attention distribution. In addition, traditional ranking losses could disregard relatively discriminative information due to the lack of appropriate hardest negative sample mining and information weighting schemes. In this paper, in order to alleviate these issues, a novel semantic-enhanced discriminative embedding learning method is proposed to enhance the discriminative ability of the model, which mainly consists of three modules. The attention-guided erasing module enables the attention model pay more attention to the relevant parts and reduce the interferences of irrelevant parts by erasing non-attention parts. The large-scale negative sampling module leverages momentum-updated memory banks to expand the number of negative samples, which helps increase the probability of hardest negative being sampled. Moreover, the weighted InfoNCE loss module designs a weighted scheme to assign a larger weight to a harder pair. We evaluate the proposed modules by integrating them into three existing cross-modal retrieval models. Extensive experiments demonstrate that integrating each proposed module to the existing models can steadily improve the performance of all models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cross-Modal Retrieval with Discriminative Dual-Path CNN

Cross Modal Retrieval Algorithm Based on Iterative Queries

Semantic enhancement and multi-level alignment network for cross-modal retrieval

Article 12 January 2024

Jia Chen & Hong Zhang

Notes

References

Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical correlation analysis. In: International conference on machine learning. PMLR
Biten AF, Mafla A, Gómez L, Karatzas D (2022) Is an image worth five sentences? a new look into semantics for image-text matching. In: Proceedings of the IEEE winter conference on applications of computer vision
Chen T, Deng J, Luo J (2020) Adaptive offline quintuplet loss for image-text matching. In: European conference on computer vision. Springer
Faghri F, Fleet DJ, Kiros JR, Fidler S (2018) VSE++: improving visual-semantic embeddings with hard negatives. In: British machine vision conference 2018, BMVC
Hadsell R, Chopra S, LeCun Y (2006) Dimensionality reduction by learning an invariant mapping. In: 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), vol 2. IEEE
He K, Fan H, Wu Y, Xie S, Girshick R (2020) Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8)
Hoffer E, Ailon N (2015) Deep metric learning using triplet network. In: International workshop on similarity-based pattern recognition. Springer
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539
Lee K-H, Chen X, Hua G, Hu H, He X (2018) Stacked cross attention for image-text matching. In: Proceedings of the European conference on computer vision (ECCV)
Li K, Zhang Y, Li K, Li Y, Fu Y (2019) Visual semantic reasoning for image-text matching. In: Proceedings of the IEEE/CVF international conference on computer vision
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, pp 740–755
Liu F, Ye R, Wang X, Li S (2020) Hal: improved text-image matching by mitigating visual semantic hubs. In: Proceedings of the AAAI conference on artificial intelligence, vol 34
Liu X, Wang Z, Shao J, Wang X, Li H (2019) Improving referring expression grounding with cross-modal attention-guided erasing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Mithun NC, Li J, Metze F, Roy-Chowdhury AK (2018) Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In: Proceedings of the 2018 ACM on international conference on multimedia retrieval
Nam H, Ha J-W, Kim J (2017) Dual attention networks for multimodal reasoning and matching. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Oord AVD, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding. arXiv:1807.03748
Song HO, Xiang Y, Jegelka S, Savarese S (2016) Deep metric learning via lifted structured feature embedding. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 28
Wang H, Zhang Y, Ji Z, Pang Y, Ma L (2020) Consensus-aware visual-semantic embedding for image-text matching. In: European conference on computer vision. Springer
Wang L, Li Y, Huang J, Lazebnik S (2018) Learning two-branch neural networks for image-text matching tasks. IEEE Trans Pattern Anal Mach Intell 41(2)
Wang L, Li Y, Lazebnik S (2016) Learning deep structure-preserving image-text embeddings. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Wang X, Han X, Huang W, Dong D, Scott MR (2019) Multi-similarity loss with general pair weighting for deep metric learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Wei J, Xu X, Yang Y, Ji Y, Wang Z, Shen HT (2020) Universal weighting metric learning for cross-modal matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Wei X, Zhang T, Li Y, Zhang Y, Wu F (2020) Multi-modality cross attention network for image and sentence matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Wen K, Gu X, Cheng Q (2020) Learning dual semantic relations with graph attention for image-text matching. In: IEEE transactions on circuits and systems for video technology
Wu Z, Xiong Y, Yu SX, Lin D (2018) Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2
Yu R, Dou Z, Bai S, Zhang Z, Xu Y, Bai X (2018) Hard-aware point-to-set deep metric for person re-identification. In: Proceedings of the European conference on computer vision (ECCV)

Download references

Acknowledgements

This paper was supported by National Key R &D Program of China (2019YFC1521204).

Author information

Authors and Affiliations

University of Chinese Academy of Sciences, Beijing, 100049, China
Hao Pan & Jun Huang
Shanghai Advanced Research Institute, Chinese Academy of Sciences, Shanghai, 201210, China
Hao Pan & Jun Huang

Authors

Hao Pan
View author publications
You can also search for this author in PubMed Google Scholar
Jun Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jun Huang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pan, H., Huang, J. Semantic-enhanced discriminative embedding learning for cross-modal retrieval. Int J Multimed Info Retr 11, 369–382 (2022). https://doi.org/10.1007/s13735-022-00237-6

Download citation

Received: 03 February 2022
Revised: 25 March 2022
Accepted: 05 April 2022
Published: 11 May 2022
Issue Date: September 2022
DOI: https://doi.org/10.1007/s13735-022-00237-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Semantic-enhanced discriminative embedding learning for cross-modal retrieval

Abstract

Access this article

Similar content being viewed by others

Cross-Modal Retrieval with Discriminative Dual-Path CNN

Cross Modal Retrieval Algorithm Based on Iterative Queries

Semantic enhancement and multi-level alignment network for cross-modal retrieval

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Semantic-enhanced discriminative embedding learning for cross-modal retrieval

Abstract

Access this article

Similar content being viewed by others

Cross-Modal Retrieval with Discriminative Dual-Path CNN

Cross Modal Retrieval Algorithm Based on Iterative Queries

Semantic enhancement and multi-level alignment network for cross-modal retrieval

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation