Skip to main content

Spherical Linear Interpolation and Text-Anchoring for Zero-Shot Composed Image Retrieval

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15077))

Included in the following conference series:

  • 242 Accesses

Abstract

Composed Image Retrieval (CIR) is a complex task that retrieves images using a query, which is configured with an image and a caption that describes desired modifications to that image. Supervised CIR approaches have shown strong performance, but their reliance on expensive manually-annotated datasets restricts their scalability and broader applicability. To address these issues, previous studies have proposed pseudo-word token-based Zero-Shot CIR (ZS-CIR) methods, which utilize a projection module to map images to word tokens. However, we conjecture that this approach has a downside: the projection module distorts the original image representation and confines the resulting composed embeddings to the text-side. In order to resolve this, we introduce a novel ZS-CIR method that uses Spherical Linear Interpolation (Slerp) to directly merge image and text representations by identifying an intermediate embedding of both. Furthermore, we introduce Text-Anchored-Tuning (TAT), a method that fine-tunes the image encoder while keeping the text encoder fixed. TAT closes the modality gap between images and text, making the Slerp process much more effective. Notably, the TAT method is not only efficient in terms of the scale of the training dataset and training time, but it also serves as an excellent initial checkpoint for training supervised CIR models, thereby highlighting its wider potential. The integration of the Slerp-based ZS-CIR with a TAT-tuned model enables our approach to deliver state-of-the-art retrieval performance across CIR benchmarks. Code is available at https://github.com/youngkyunJang/SLERP-TAT.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://huggingface.co/models.

  2. 2.

    https://github.com/salesforce/BLIP.

References

  1. Baldrati, A., Agnolucci, L., Bertini, M., Del Bimbo, A.: Zero-shot composed image retrieval with textual inversion. In: ICCV (2023)

    Google Scholar 

  2. Baldrati, A., Bertini, M., Uricchio, T., Del Bimbo, A.: Effective conditioned and composed image retrieval combining CLIP-based features. In: CVPR (2022)

    Google Scholar 

  3. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICML (2020)

    Google Scholar 

  4. Cohen, N., Gal, R., Meirom, E.A., Chechik, G., Atzmon, Y.: “This is my unicorn, fluffy’’: personalizing frozen vision-language representations. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13680, pp. 558–577. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20044-1_32

    Chapter  Google Scholar 

  5. Delmas, G., Sampaio de Rezende, R., Csurka, G., Larlus, D.: ARTEMIS: attention-based retrieval with text-explicit matching and implicit similarity. In: ICLR (2022)

    Google Scholar 

  6. Forbes, M., Kaeser-Chen, C., Sharma, P., Belongie, S.: Neural naturalist: generating fine-grained image comparisons. In: EMNLP (2019)

    Google Scholar 

  7. Goenka, S., et al.: FashionVLP: vision language transformer for fashion retrieval with feedback. In: CVPR (2022)

    Google Scholar 

  8. Gu, G., Chun, S., Kim, W., Kang, Y., Yun, S.: Language-only efficient training of zero-shot composed image retrieval. arXiv preprint arXiv:2312.01998 (2023)

  9. Guo, X., Wu, H., Cheng, Y., Rennie, S., Tesauro, G., Feris, R.: Dialog-based interactive image retrieval. In: NeurIPS (2018)

    Google Scholar 

  10. Han, X., et al.: Automatic spatially-aware fashion concept discovery. In: ICCV (2017)

    Google Scholar 

  11. Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. In: ICLR (2021)

    Google Scholar 

  12. Levy, M., Ben-Ari, R., Darshan, N., Lischinski, D.: Data roaming and early fusion for composed image retrieval. arXiv preprint arXiv:2303.09429 (2023)

  13. Li, J., et al.: BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: ICML (2022)

    Google Scholar 

  14. Liang, V.W., Zhang, Y., Kwon, Y., Yeung, S., Zou, J.Y.: Mind the gap: understanding the modality gap in multi-modal contrastive representation learning. In: NeurIPS (2022)

    Google Scholar 

  15. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  16. Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2024)

    Google Scholar 

  17. Liu, Z., Rodriguez-Opazo, C., Teney, D., Gould, S.: Image retrieval on real-life images with pre-trained vision-and-language models. In: ICCV (2021)

    Google Scholar 

  18. Liu, Z., Rodriguez-Opazo, C., Teney, D., Gould, S.: Image retrieval on real-life images with pre-trained vision-and-language models. In: CVPR (2021)

    Google Scholar 

  19. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2018)

    Google Scholar 

  20. Ma, J., Jiang, X., Fan, A., Jiang, J., Yan, J.: Image matching from handcrafted to deep features: a survey. IJCV 129, 23–79 (2021)

    Article  MathSciNet  Google Scholar 

  21. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML. PMLR (2021)

    Google Scholar 

  22. Saito, K., et al.: Pic2Word: mapping pictures to words for zero-shot composed image retrieval. In: CVPR (2023)

    Google Scholar 

  23. Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: ACL (2018)

    Google Scholar 

  24. Shoemake, K.: Animating rotation with quaternion curves. In: Proceedings of the 12th Annual Conference on Computer Graphics and Interactive Techniques (1985)

    Google Scholar 

  25. Suhr, A., Zhou, S., Zhang, A., Zhang, I., Bai, H., Artzi, Y.: A corpus for reasoning about natural language grounded in photographs. In: ACL (2019)

    Google Scholar 

  26. Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)

    Google Scholar 

  27. Ventura, L., Yang, A., Schmid, C., Varol, G.: CoVR: learning composed video retrieval from web video captions. arXiv:2308.14746 (2023)

  28. Vo, N., et al.: Composing text and image for image retrieval-an empirical odyssey. In: CVPR (2019)

    Google Scholar 

  29. Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (2020)

    Google Scholar 

  30. Wu, H., et al.: Fashion IQ: a new dataset towards retrieving images by natural language feedback. In: CVPR (2021)

    Google Scholar 

  31. Zhai, X., et al.: LiT: zero-shot transfer with locked-image text tuning. In: CVPR (2022)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Young Kyun Jang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3960 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Jang, Y.K., Huynh, D., Shah, A., Chen, WK., Lim, SN. (2025). Spherical Linear Interpolation and Text-Anchoring for Zero-Shot Composed Image Retrieval. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15077. Springer, Cham. https://doi.org/10.1007/978-3-031-72655-2_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72655-2_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72654-5

  • Online ISBN: 978-3-031-72655-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics