Abstract
Cross-modal retrieval requires building a common latent space that captures and correlates information from different data modalities, usually images and texts. Cross-modal training based on the triplet loss with hard negative mining is a state-of-the-art technique to address this problem. This paper shows that such approach is not always effective in handling intra-modal similarities. Specifically, we found that this method can lead to inconsistent similarity orderings in the latent space, where intra-modal pairs with unknown ground-truth similarity are ranked higher than cross-modal pairs representing the same concept. To address this problem, we propose two novel loss functions that leverage intra-modal similarity constraints available in a training triplet but not used by the original formulation. Additionally, this paper explores the application of this framework to unsupervised image retrieval problems, where cross-modal training can provide the supervisory signals that are otherwise missing in the absence of category labels. Up to our knowledge, we are the first to evaluate cross-modal training for intra-modal retrieval without labels.
We present comprehensive experiments on MS-COCO and Flickr30K, demonstrating the advantages and limitations of the proposed methods in cross-modal and intra-modal retrieval tasks in terms of performance and novelty measures. Our code is publicly available on GitHub https://github.com/MariodotR/FullHN.git.
This research was partially funded by National Agency for Research and Development (ANID, Chile), grant numbers FONDEF IT21I0019, ANID PIA/APOYO AFB180002 and ANID-Basal Project FB0008.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
For MS-COCO, we report results for 5k images.
- 2.
With results of \([ 351.3 - 350.3 , 351.0-349.3, 350.4-351.4]\), respectively.
- 3.
With results of [367.0, 368.1, 375.9, 377.5, 370.4], respectively.
- 4.
The best value is underlined and the best without considering TERAN is highlighted in bold.
References
Chaudhuri, U., Banerjee, B., Bhattacharya, A., Datcu, M.: CMIR-NET: a deep learning based model for cross-modal retrieval in remote sensing. Pattern Recogn. Lett. 131, 456–462 (2020)
Clarke, C.L., et al.: Novelty and diversity in information retrieval evaluation. In: SIGIR 2008 ,p p. 659–666. ACM, New York (2008)
Do, T.T., Tran, T., Ian, R., et al.: A theoretically sound upper bound on the triplet loss for improving the efficiency of deep distance metric learning. In: IEEE CVPR, pp. 10404–10413 (2019)
Dubey, S.R.: A decade survey of content based image retrieval using deep learning. IEEE Trans. Circ. Syst. Video Technol. 32, 2687–2704 (2020)
Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: VSE++: improving visual-semantic embeddings with hard negatives. In: Proceedings of BMVC (2017)
Ge, W., Huang, W., Dong, D., Scott, M.R.: Deep metric learning with hierarchical triplet loss. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 272–288. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_17
Gong, Y., Cosma, G.: Improving visual-semantic embeddings by learning semantically-enhanced hard negatives for cross-modal information retrieval. Pattern Recogn. 137, 109272 (2023)
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 664–676 (2017)
Mahmut, K., Şakir, H.: Deep metric learning: a survey. Symmetry 11(9), 1066 (2019)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2017)
Li, X., Yang, J., Ma, J.: Recent developments of content-based image retrieval (CBIR). Neurocomputing 452, 675–689 (2021)
Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out, pp. 74–81 (2004)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Ma, H., et al.: Ei-clip: entity-aware interventional contrastive learning for e-commerce cross-modal retrieval. In: CVPR, pp. 18051–18061 (2022)
Messina, N., et al.: Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Trans. Multimedia Comput. Commun. Appl. (TOMM), 17(4), 1–23 (2021)
Messina, N., Falchi, F., Esuli, A., Amato, G.: Transformer reasoning network for image-text matching and retrieval. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 5222–5229. IEEE (2021)
Molina, G., et al.: A new content-based image retrieval system for SARS-CoV-2 computer-aided diagnosis. In: Su, R., Zhang, Y.-D., Liu, H. (eds.) MICAD 2021. LNEE, vol. 784, pp. 316–324. Springer, Singapore (2022). https://doi.org/10.1007/978-981-16-3880-0_33
Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 382–398. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_24
Ren, R., et al.: Pair: leveraging passage-centric similarity relation for improving dense passage retrieval, pp. 2173–2183 (2021)
Schubert, E.: A triangle inequality for cosine similarity. In: Reyes, N., et al. (eds.) SISAP 2021. LNCS, vol. 13058, pp. 32–44. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-89657-7_3
Song, H.O., Xiang, Y., Jegelka, S., Savarese, S.: Deep metric learning via lifted structured feature embedding. In: IEEE CVPR, pp. 4004–4012 (2016)
Song, K., Tan, X., Qin, T., Lu, J., Liu, T.Y.: Mpnet: masked and permuted pre-training for language understanding. NIPS 33, 16857–16867 (2020)
Song, Y., Soleymani, M.: Polysemous visual-semantic embedding for cross-modal retrieval. In: CVPR, pp. 1979–1988 (2019)
Tan, M., Le, Q.V.: Efficientnetv2: smaller models and faster training. CoRR abs/2104.00298 (2021)
Tian, Y., et al.: Sosnet: second order similarity regularization for local descriptor learning, pp. 11008–11017 (2019)
Ng, T., Balntas, V., Y, Tian., Mikolajczyk, K.: Solar: Second-order loss and attention for image retrieval. ArXiv (2020)
Wang, Z., et al.: Adaptive margin based deep adversarial metric learning. In: IEEE BigDataSecurity/HPSC/IDS 2020, pp. 100–108 (2020)
Chen, W., Chen, X., Zhang, J., Huang, K.: Beyond triplet loss: a deep quadruplet network for person re-identification. In: IEEE CVPR, pp. 1320–1329 (2017)
Wu, Y., Wang, S., Huang, Q.: Online asymmetric similarity learning for cross-modal retrieval. In: IEEE CVPR, pp. 3984–3993 (2017)
Wu, Y., Wang, S., Huang, Q.: Online fast adaptive low-rank similarity learning for cross-modal retrieval. IEEE Trans. Multimedia 22(5), 1310–1322 (2020)
Xuan, H., Stylianou, A., Liu, X., Pless, R.: Hard negative examples are hard, but useful. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 126–142. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_8
Yang, J., et al.: Vision-language pre-training with triple contrastive learning. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15650–15659 (2022)
Ye, M., et al.: Deep learning for person re-identification: a survey and outlook. IEEE Trans. Pattern Anal. Mach. Intell. 44(6), 2872–2893 (2021)
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. TACL 2, 67–78 (2014)
Zhao, C., et al.: Deep fusion feature representation learning with hard mining center-triplet loss for person re-identification. IEEE Trans. Multimedia 22(12), 3180–3195 (2020)
Zhou, T., et al.: Solving the apparent diversity-accuracy dilemma of recommender systems. PNAS 107, 4511–4515 (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Mallea, M., Nanculef, R., Araya, M. (2023). Enhancing Intra-modal Similarity in a Cross-Modal Triplet Loss. In: Bifet, A., Lorena, A.C., Ribeiro, R.P., Gama, J., Abreu, P.H. (eds) Discovery Science. DS 2023. Lecture Notes in Computer Science(), vol 14276. Springer, Cham. https://doi.org/10.1007/978-3-031-45275-8_17
Download citation
DOI: https://doi.org/10.1007/978-3-031-45275-8_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45274-1
Online ISBN: 978-3-031-45275-8
eBook Packages: Computer ScienceComputer Science (R0)