Abstract
Text-image retrieval task has attracted extensive attention nowadays. Due to the different feature distributions, the performance of this task suffers from the large modal discrepancy. Most retrieval methods map images and texts into a common embedding space and measure the similarities. However, in the dataset, there may be multiple texts corresponding to the same image. Previous approaches rarely consider these texts together when calculating the similarities in common space. In this paper, we propose an improving text-image cross-modal retrieval framework with contrastive loss, which considers multiple texts of one image. Using the overall text features, our approach makes better alignment between image and its corresponding text center. Results on the Flickr30K dataset achieve the competitive performance, validating the effectiveness of the proposed.



Similar content being viewed by others
References
Zhang, D., Yao, L., Chen, K., Wang, S., Chang, X., Liu, Y.: Making sense of spatio-temporal preserving representations for EEG-based human intention recognition. IEEE Trans. Cybern. 50(7), 3033–3044 (2020). https://doi.org/10.1109/TCYB.2019.2905157
Luo, M., Chang, X., Nie, L., Yang, Y., Hauptmann, A.G., Zheng, Q.: An adaptive semisupervised feature analysis for video semantic recognition. IEEE Trans. Cybern. 48(2), 648–660 (2018). https://doi.org/10.1109/TCYB.2017.2647904
Chen, K., Yao, L., Zhang, D., Wang, X., Chang, X., Nie, F.: A semisupervised recurrent convolutional attention model for human activity recognition. IEEE Trans. Neural Networks Learn. Syst. 31(5), 1747–1756 (2020). https://doi.org/10.1109/TNNLS.2019.2927224
Liu, Z., Qian, P., Wang, X., Zhu, L., He, Q., Ji, S.: Smart contract vulnerability detection: from pure neural network to interpretable graph feature and expert pattern fusion, in: IJCAI, 2021, pp. 2751–2759. https://doi.org/10.24963/ijcai.2021/379
Liu, A., Zhou, H., Nie, W., Liu, Z., Liu, W., Xie, H., Mao, Z., Li, X., Song, D.: Hierarchical multi-view context modelling for 3d object classification and retrieval. Inf. Sci. 547, 984–995 (2021). https://doi.org/10.1016/j.ins.2020.09.057
Li, K., Zhang, Y., Li, K., Li, Y., Fu, Y.: Visual semantic reasoning for image-text matching, in: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27–November 2, 2019, IEEE, 2019, pp. 4653–4661. https://doi.org/10.1109/ICCV.2019.00475
Lee, K., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching, in: V. Ferrari, M. Hebert, C. Sminchisescu, Y. Weiss (Eds.), Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8–14, 2018, Proceedings, Part IV, Vol. 11208 of Lecture Notes in Computer Science, Springer, 2018, pp. 212–228. https://doi.org/10.1007/978-3-030-01225-0_13
Wang, W., Yang, X., Ooi, B.C., Zhang, D., Zhuang, Y.: Effective deep learning-based multi-modal retrieval. VLDB J. 25(1), 79–101 (2016). https://doi.org/10.1007/s00778-015-0391-4
Zhen, L., Hu, P., Wang, X., Peng, D.: Deep supervised cross-modal retrieval, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16–20, 2019, Computer Vision Foundation/IEEE, 2019, pp. 10394–10403. http://openaccess.thecvf.com/content_CVPR_2019/html/Zhen_Deep_Supervised_Cross-Modal_Retrieval_CVPR_2019_paper.html
Wang, H., Sahoo, D., Liu, C., Lim, E., Hoi, S.C.H.: Learning cross-modal embeddings with adversarial networks for cooking recipes and food images, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, Computer Vision Foundation/IEEE, 2019, pp. 11572–11581. http://openaccess.thecvf.com/content_CVPR_2019/html/Wang_Learning_Cross-Modal_Embeddings_With_Adversarial_Networks_for_Cooking_Recipes_and_CVPR_2019_paper.html
Zhai, X., Peng, Y., Xiao, J.: Cross-media retrieval by intra-media and inter-media correlation mining. Multimed. Syst. 19(5), 395–406 (2013). https://doi.org/10.1007/s00530-012-0297-6
Xie, L., Pan, P., Lu, Y.: Analyzing semantic correlation for cross-modal retrieval. Multimed. Syst. 21(6), 525–539 (2015). https://doi.org/10.1007/s00530-014-0397-6
Jiang, A., Li, H., Li, Y., Wang, M.: Learning discriminative representations for semantical crossmodal retrieval. Multimed. Syst. 24(1), 111–121 (2018). https://doi.org/10.1007/s00530-016-0532-7
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate, in: Y. Bengio, Y. LeCun (Eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. http://arxiv.org/abs/1409.0473
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A.C., Salakhutdinov, R., Zemel, R.S., Bengio, Y.: Show, attend and tell: Neural image caption generation with visual attention, in: F. R. Bach, D. M. Blei (Eds.), Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, Vol. 37 of JMLR Workshop and Conference Proceedings, JMLR.org, 2015, pp. 2048–2057. http://proceedings.mlr.press/v37/xuc15.html
Ji, Z., Wang, H., Han, J., Pang, Y.: Saliency-guided attention network for image-sentence matching, in: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27–November 2, 2019, IEEE, 2019, pp. 5753–5762. https://doi.org/10.1109/ICCV.2019.00585
Chen, Y., Li, L., Yu, L., Kholy, A.E., Ahmed, F., Gan, Z., Cheng, Y., Liu, J.: UNITER: universal image-text representation learning, in: A. Vedaldi, H. Bischof, T. Brox, J. Frahm (Eds.), Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX, Vol. 12375 of Lecture Notes in Computer Science, Springer, 2020, pp. 104–120. https://doi.org/10.1007/978-3-030-58577-8_7
Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks, in: H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, R. Garnett (Eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8–14, 2019, Vancouver, BC, Canada, 2019, pp. 13–23. https://proceedings.neurips.cc/paper/2019/hash/c74d97b01eae257e44aa9d5bade97baf-Abstract.html
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding, in: J. Burstein, C. Doran, T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Volume 1 (Long and Short Papers), Association for Computational Linguistics, 2019, pp. 4171–4186. https://doi.org/10.18653/v1/n19-1423
Sun, C., Song, X., Feng, F., Zhao, W.X., Zhang, H., Nie, L.: Supervised hierarchical cross-modal hashing, in: B. Piwowarski, M. Chevalier, É. Gaussier, Y. Maarek, J. Nie, F. Scholer (Eds.), Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2019, Paris, France, July 21–25, 2019, ACM, 2019, pp. 725–734. https://doi.org/10.1145/3331184.3331229
Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.: Bottom-up and top-down attention for image captioning and visual question answering, in: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, Computer Vision Foundation / IEEE Computer Society, 2018, pp. 6077–6086. http://openaccess.thecvf.com/content_cvpr_2018/html/Anderson_Bottom-Up_and_Top-Down_CVPR_2018_paper.html
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L., Shamma, D.A., Bernstein, M.S., Fei-Fei, L.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017). https://doi.org/10.1007/s11263-016-0981-7
Chen, X., Li, L., Fei-Fei, L.: A. Gupta, Iterative visual reasoning beyond convolutions, in: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, Computer Vision Foundation / IEEE Computer Society, 2018, pp. 7239–7248. http://openaccess.thecvf.com/content_cvpr_2018/html/Chen_Iterative_Visual_Reasoning_CVPR_2018_paper.html
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguistics 2 ,67–78 (2014). https://tacl2013.cs.columbia.edu/ojs/index.php/tacl/article/view/229
Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: VSE++: improving visual-semantic embeddings with hard negatives, in: British Machine Vision Conference 2018, BMVC 2018, Newcastle, UK, September 3–6, 2018, BMVA Press, 2018, p. 12. http://bmvc2018.org/contents/papers/0344.pdf
Gu, J., Cai, J., Joty, S.R., Niu, L., Wang, G.: Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models, in: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18–22, 2018, Computer Vision Foundation/IEEE Computer Society, 2018, pp. 7181–7189. http://openaccess.thecvf.com/content_cvpr_2018/html/Gu_Look_Imagine_and_CVPR_2018_paper.html
Huang, Y., Wu, Q., Wang, W., Wang, L.: Image and sentence matching via semantic concepts and order learning. IEEE Trans. Pattern Anal. Mach. Intell. 42(3), 636–650 (2020). https://doi.org/10.1109/TPAMI.2018.2883466
Wang, T., Xu, X., Yang, Y., Hanjalic, A., Shen, H.T., Song, J.: Matching images and text with multi-modal tensor fusion and re-ranking, in: L. Amsaleg, B. Huet, M. A. Larson, G. Gravier, H. Hung, C. Ngo, W. T. Ooi (Eds.), Proceedings of the 27th ACM International Conference on Multimedia, MM 2019, Nice, France, October 21-25, 2019, ACM, 2019, pp. 12–20. https://doi.org/10.1145/3343031.3350875
Xu, X., Wang, T., Yang, Y., Zuo, L., Shen, F., Shen, H.T.: Cross-modal attention with semantic consistence for image-text matching. IEEE Trans. Neural Netw. Learn. Syst. 31(12), 5412–5425 (2020). https://doi.org/10.1109/TNNLS.2020.2967597
Wang, Y., Yang, H., Qian, X., Ma, L., Lu, J., Li, B., Fan, X.: Position focused attention network for image-text matching, in: S. Kraus (Ed.), Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, ijcai.org, 2019, pp. 3792–3798. https://doi.org/10.24963/ijcai.2019/526
Acknowledgements
This work was supported in part by the National Nature Science Foundation of China (61902277, U21B2024), the China Postdoctoral Science Foundation (2021T140511, 2020M680884), the Funding Project of the State Key Laboratory of Communication Content Cognition (Grant No. A02106), and the Open Funding Project of the State Key Laboratory of Communication Content Cognition (Grant No. 20K04).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, C., Yang, Y., Guo, J. et al. Improving text-image cross-modal retrieval with contrastive loss. Multimedia Systems 29, 569–575 (2023). https://doi.org/10.1007/s00530-022-00962-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00530-022-00962-2