Asymmetric bi-encoder for image–text retrieval

Xiong, Wei; Liu, Haoliang; Mi, Siya; Zhang, Yu

doi:10.1007/s00530-023-01162-2

Asymmetric bi-encoder for image–text retrieval

Special Issue Paper
Published: 26 August 2023

Volume 29, pages 3805–3818, (2023)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Wei Xiong¹^na1,
Haoliang Liu¹,
Siya Mi^2,3 &
…
Yu Zhang¹

246 Accesses
Explore all metrics

Abstract

Image–text retrieval aims to understand the similarity relationship among image–text pairs while using a ranking model with an optimal distance metric. Although mining the informative pairs is of central importance to training a ranking model, the current dominating ranking model, Cross-Encoder (CE), processes image–text pair jointly with cross-attention mechanisms, imposing \({\mathcal {O}}(N^2)\) encoding complexity. Consequently, with limited computational resources, we can not train CE with a large batch size, where only a mini-batch of pairs is accessible at each iteration. In contrast, the efficient but not effective model, Bi-Encoder(BE), encodes texts and images separately, achieving an \({\mathcal {O}}(N)\) encoding complexity. Thus, to fulfill the potential of CE, we propose an Asymmetric Bi-Encoder(ABE) approach, which is a combination of CE and BE. For image-to-text retrieval, we encode images with BE and encode texts with CE. In contrast, we encode texts with BE and encode images with CE for text-to-image retrieval. Furthermore, in the training phase, we sample large-scale negative pairs with BE to overcome the batch size limitation and mine more informative examples with \({\mathcal {O}}(N)\) complexity. Our proposed method is conceptually simple and easy to implement, with systematic experiments on public benchmarks validating our method’s effectiveness in boosting image-text retrieval.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Multiple Positives Enhanced NCE Loss for Image-Text Retrieval

Enhancing Separate Encoding with Multi-layer Feature Alignment for Image-Text Matching

Bottom-Up Progressive Semantic Alignment for Image-Text Retrieval

Data availability

All code and data are available.

References

Li, G., Duan, N., Fang, Y., Gong, M., Jiang, D.: Unicoder-VL: A universal encoder for vision and language by cross-modal pre-training. In: Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI), New York, NY, pp. 11336–11344 (2020)
Yuan, A., Li, X., Lu, X.: 3g structure for image caption generation. Neurocomputing 330, 17–28 (2019)
Article Google Scholar
Lee, K., Chen, X., Hua, G., Hu, H., He, X.: Stacked cross attention for image-text matching. In: Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, pp. 212–228 (2018)
Wang, L., Li, Y., Lazebnik, S.: Learning deep structure-preserving image-text embeddings. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5005–5013 (2016)
Xu, N., Liu, A.-A., Nie, W., Su, Y.: Multi-guiding long short-term memory for video captioning. Multimedia Syst. 25, 663–672 (2019)
Article Google Scholar
Guo, L., Liu, J., Lu, S., Lu, H.: Show, tell, and polish: ruminant decoding for image captioning. IEEE Trans. Multimedia 22(8), 2149–2162 (2019)
Article Google Scholar
Li, J., Wang, Y., Zhao, D.: Layer-wise enhanced transformer with multi-modal fusion for image caption. Multimedia Systems 29, 1–14 (2022)
do Carmo Nogueira, T., Vinhal, C.D.N., da Cruz Júnior, G., Ullmann, M.R.D., Marques, T.C.: A reference-based model using deep learning for image captioning. Multimedia Systems 29, 1–17 (2022)
Dou, Z.-Y., Xu, Y., Gan, Z., Wang, J., Wang, S., Wang, L., Zhu, C., Zhang, P., Yuan, L., Peng, N., et al.: An empirical study of training end-to-end vision-and-language transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18166–18176 (2022)
Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems (NIPS), pp. 13 (2019)
Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L., Wang, L., Hu, H., Dong, L., Wei, F., et al.: Oscar: Object-semantics aligned pre-training for vision-language tasks. In: Proceedings of the 16th European Conference on Computer Vision (ECCV) (2020)
Zhang, B., Hu, H., Jain, V., Ie, E., Sha, F.: Learning to represent image and text with denotation graph. In: Webber, B., Cohn, T., He, Y., Liu, Y. (eds.) Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 823–839 (2020)
Wang, L., Li, Y., Huang, J., Lazebnik, S.: Learning two-branch neural networks for image-text matching tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 394 (2018)
Dong, J., Li, X., Xu, C., Ji, S., He, Y., Yang, G., Wang, X.: Dual encoding for zero-example video retrieval. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Wang, X., Han, X., Huang, W., Dong, D., Scott, M.R.: Multi-similarity loss with general pair weighting for deep metric learning. In: CVPR (2019)
Sun, Y., Cheng, C., Zhang, Y., Zhang, C., Zheng, L., Wang, Z., Wei, Y.: Circle loss: A unified perspective of pair similarity optimization. In: CVPR (2020)
Lin, T., Maire, M., Belongie, S.J., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Proceedings of the 13th European Conference on Computer Vision (ECCV), Zurich, Switzerland, pp. 740–755 (2014)
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: ICCV (2015)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: NIPS, pp. 5998 (2017)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: NAACL, pp. 4171–4186 (2019)
Zhang, C., Yang, Y., Guo, J., Jin, G., Song, D., Liu, A.A.: Improving text-image cross-modal retrieval with contrastive loss. Multimedia Syst. 29, 1–7 (2022)
Google Scholar
Sun, H., Qin, X., Liu, X.: Image-text matching using multi-subspace joint representation. Multimedia Systems 29, 1–15 (2023)
Lu, H., Fei, N., Huo, Y., Gao, Y., Lu, Z., Wen, J.-R.: Cots: collaborative two-stream vision-language pre-training model for cross-modal retrieval. In: Proceedings of the IEEE/CVF conference on computer Vision and pattern recognition, pp. 15692–15701 (2022)
Chen, Y., Li, L., Yu, L., Kholy, A.E., Ahmed, F., Gan, Z., Cheng, Y., Liu, J.: UNITER: universal image-text representation learning. In: Proceedings of the 16th European Conference on Computer Vision (ECCV), Glasgow, UK, pp. 104–120 (2020)
Lu, X., Zhao, T., Lee, K.: Visualsparta: an embarrassingly simple approach to large-scale text-to-image search with weighted bag-of-words. In: ACL (2021)
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pp. 4171–4186 (2019)
Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: Vse++: Improving visual-semantic embeddings with hard negatives. In: Proceedings of the British Machine Vision Conference (BMVC) (2018)
Geigle, G., Pfeiffer, J., Reimers, N., Vulić, I., Gurevych, I.: Retrieve fast, rerank smart: cooperative and joint approaches for improved cross-modal retrieval. Trans. Assoc. Computat. Linguist. 10, 503–521 (2022)
Article Google Scholar
Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L., Shamma, D.A., Bernstein, M.S., Fei-Fei, L.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. (IJCV) 123(1), 32–73 (2017)
Article MathSciNet Google Scholar
Dong, X., Zhang, H., Zhu, L., Nie, L., Liu, L.: Hierarchical feature aggregation based on transformer for image-text matching. IEEE Trans. Circuits Syst. Video Tech. 32(9), 6437–6447 (2022)
Article Google Scholar
Liu, Q., Li, W., Chen, Z., Hua, B.: Deep metric learning for image retrieval in smart city development. Sustain. Cities Soc. 73, 103067 (2021)
Article Google Scholar
Wu, C.-Y., Manmatha, R., Smola, A.J., Krahenbuhl, P.: Sampling matters in deep embedding learning. In: ICCV (2017)
Suh, Y., Han, B., Kim, W., Lee, K.M.: Stochastic class-based hard example mining for deep metric learning. In: CVPR (2019)
Harwood, B., Kumar BG, V., Carneiro, G., Reid, I., Drummond, T.: Smart mining for deep metric learning. In: ICCV (2017)
Wang, X., Zhang, H., Huang, W., Scott, M.R.: Cross-batch memory for embedding learning. In: CVPR (2020)
Lu, H., Huo, Y., Ding, M., Fei, N., Lu, Z.: Cross-modal contrastive learning for generalizable and efficient image-text retrieval. Machine Intelligence Research 20, 1–14 (2023)
Lu, J., Goswami, V., Rohrbach, M., Parikh, D., Lee, S.: 12-in-1: Multi-task vision and language representation learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Gan, Z., Chen, Y., Li, L., Zhu, C., Cheng, Y., Liu, J.: Large-scale adversarial training for vision-and-language representation learning. In: Advances in Neural Information Processing Systems (NeurIPS), virtual (2020)
Kim, W., Son, B., Kim, I.: Vilt: Vision-and-language transformer without convolution or region supervision. In: International Conference on Machine Learning (ICML) (2021)
Sun, S., Chen, Y.-C., Li, L., Wang, S., Fang, Y., Liu, J.: Lightningdot: Pre-training visual-semantic embeddings for real-time image-text retrieval. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) (2021)
Huang, Y., Wang, W., Wang, L.: Instance-aware image and sentence matching with selective multimodal lstm. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2310–2318 (2017)
Nam, H., Ha, J.W., Kim, J.: Dual attention networks for multimodal reasoning and matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Huang, Y., Wu, Q., Song, C., Wang, L.: Learning semantic concepts and order for image and sentence matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6163–6171 (2018)
Wang, Z., Liu, X., Li, H., Sheng, L., Yan, J., Wang, X., Shao, J.: Camp: Cross-modal adaptive message passing for text-image retrieval. In: Proceedings of the IEEE International Conference on Computer Vision (CVPR), pp. 5764–5773 (2019)
Wang, Y., Yang, H., Qian, X., Ma, L., Lu, J., Li, B., Fan, X.: Position focused attention network for image-text matching. In: International Joint Conference on Artificial Intelligence (IJCAI) (2019)
Hu, Z., Luo, Y., Lin, J., Yan, Y., Chen, J.: Multi-level visual-semantic alignments with relation-wise dual attention network for image and text matching. In: International Joint Conference on Artificial Intelligence (IJCAI), pp. 789–795 (2019)
Wang, H., Zhang, Y., Ji, Z., Pang, Y., Ma, L.: Consensus-aware visual-semantic embedding for image-text matching. In: European Conference on Computer Vision (ECCV), pp. 18–34 (2020). Springer
Wu, J., Wu, C., Lu, J., Wang, L., Cui, X.: Region reinforcement network with topic constraint for image-text matching. IEEE Trans. Circuits Syst. Video Tech. 32(1), 388–397 (2022)
Article Google Scholar
Ge, X., Chen, F., Jose, J.M., Ji, Z., Wu, Z., Liu, X.: Structured multi-modal feature embedding and alignment for image-sentence retrieval. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 5185–5193 (2021)
Li, K., Zhang, Y., Li, K., Li, Y., Fu, Y.: Image-text embedding learning via visual and textual semantic reasoning. IEEE Trans. Pattern Anal. Mach. Intell. 45(1), 641–656 (2022)
Article Google Scholar
Diao, H., Zhang, Y., Ma, L., Lu, H.: Similarity reasoning and filtration for image-text matching. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1218–1226 (2021)
Karpathy, A., Li, F.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (2015)

Download references

Acknowledgements

This work is supported by the National Key R &D Program of China (2018AAA0100104, 2018AAA0100100), Natural Science Foundation of Jiangsu Province (BK20211164).

Author information

Wei Xiong and Haoliang Liu have equally contributed to this work.

Authors and Affiliations

School of Computer Science and Engineering and the Key Lab of Computer Network and Information Integration (Ministry of Education), Southeast University, Nanjing, 211189, China
Wei Xiong, Haoliang Liu & Yu Zhang
School of Cyber Science and Engineering, Southeast University, Nanjing, 211189, China
Siya Mi
Purple Mountain Laboratories, Nanjing, 211111, China
Siya Mi

Authors

Wei Xiong
View author publications
You can also search for this author in PubMed Google Scholar
Haoliang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Siya Mi
View author publications
You can also search for this author in PubMed Google Scholar
Yu Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yu Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Xiong, W., Liu, H., Mi, S. et al. Asymmetric bi-encoder for image–text retrieval. Multimedia Systems 29, 3805–3818 (2023). https://doi.org/10.1007/s00530-023-01162-2

Download citation

Received: 04 March 2023
Accepted: 12 August 2023
Published: 26 August 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s00530-023-01162-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Asymmetric bi-encoder for image–text retrieval

Abstract

Access this article

Similar content being viewed by others

A Multiple Positives Enhanced NCE Loss for Image-Text Retrieval

Enhancing Separate Encoding with Multi-layer Feature Alignment for Image-Text Matching

Bottom-Up Progressive Semantic Alignment for Image-Text Retrieval

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Asymmetric bi-encoder for image–text retrieval

Abstract

Access this article

Similar content being viewed by others

A Multiple Positives Enhanced NCE Loss for Image-Text Retrieval

Enhancing Separate Encoding with Multi-layer Feature Alignment for Image-Text Matching

Bottom-Up Progressive Semantic Alignment for Image-Text Retrieval

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation