Abstract
Composed Image Retrieval (CIR) is a complex task that retrieves images using a query, which is configured with an image and a caption that describes desired modifications to that image. Supervised CIR approaches have shown strong performance, but their reliance on expensive manually-annotated datasets restricts their scalability and broader applicability. To address these issues, previous studies have proposed pseudo-word token-based Zero-Shot CIR (ZS-CIR) methods, which utilize a projection module to map images to word tokens. However, we conjecture that this approach has a downside: the projection module distorts the original image representation and confines the resulting composed embeddings to the text-side. In order to resolve this, we introduce a novel ZS-CIR method that uses Spherical Linear Interpolation (Slerp) to directly merge image and text representations by identifying an intermediate embedding of both. Furthermore, we introduce Text-Anchored-Tuning (TAT), a method that fine-tunes the image encoder while keeping the text encoder fixed. TAT closes the modality gap between images and text, making the Slerp process much more effective. Notably, the TAT method is not only efficient in terms of the scale of the training dataset and training time, but it also serves as an excellent initial checkpoint for training supervised CIR models, thereby highlighting its wider potential. The integration of the Slerp-based ZS-CIR with a TAT-tuned model enables our approach to deliver state-of-the-art retrieval performance across CIR benchmarks. Code is available at https://github.com/youngkyunJang/SLERP-TAT.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Baldrati, A., Agnolucci, L., Bertini, M., Del Bimbo, A.: Zero-shot composed image retrieval with textual inversion. In: ICCV (2023)
Baldrati, A., Bertini, M., Uricchio, T., Del Bimbo, A.: Effective conditioned and composed image retrieval combining CLIP-based features. In: CVPR (2022)
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICML (2020)
Cohen, N., Gal, R., Meirom, E.A., Chechik, G., Atzmon, Y.: “This is my unicorn, fluffy’’: personalizing frozen vision-language representations. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13680, pp. 558–577. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20044-1_32
Delmas, G., Sampaio de Rezende, R., Csurka, G., Larlus, D.: ARTEMIS: attention-based retrieval with text-explicit matching and implicit similarity. In: ICLR (2022)
Forbes, M., Kaeser-Chen, C., Sharma, P., Belongie, S.: Neural naturalist: generating fine-grained image comparisons. In: EMNLP (2019)
Goenka, S., et al.: FashionVLP: vision language transformer for fashion retrieval with feedback. In: CVPR (2022)
Gu, G., Chun, S., Kim, W., Kang, Y., Yun, S.: Language-only efficient training of zero-shot composed image retrieval. arXiv preprint arXiv:2312.01998 (2023)
Guo, X., Wu, H., Cheng, Y., Rennie, S., Tesauro, G., Feris, R.: Dialog-based interactive image retrieval. In: NeurIPS (2018)
Han, X., et al.: Automatic spatially-aware fashion concept discovery. In: ICCV (2017)
Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. In: ICLR (2021)
Levy, M., Ben-Ari, R., Darshan, N., Lischinski, D.: Data roaming and early fusion for composed image retrieval. arXiv preprint arXiv:2303.09429 (2023)
Li, J., et al.: BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: ICML (2022)
Liang, V.W., Zhang, Y., Kwon, Y., Yeung, S., Zou, J.Y.: Mind the gap: understanding the modality gap in multi-modal contrastive representation learning. In: NeurIPS (2022)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2024)
Liu, Z., Rodriguez-Opazo, C., Teney, D., Gould, S.: Image retrieval on real-life images with pre-trained vision-and-language models. In: ICCV (2021)
Liu, Z., Rodriguez-Opazo, C., Teney, D., Gould, S.: Image retrieval on real-life images with pre-trained vision-and-language models. In: CVPR (2021)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2018)
Ma, J., Jiang, X., Fan, A., Jiang, J., Yan, J.: Image matching from handcrafted to deep features: a survey. IJCV 129, 23–79 (2021)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML. PMLR (2021)
Saito, K., et al.: Pic2Word: mapping pictures to words for zero-shot composed image retrieval. In: CVPR (2023)
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: ACL (2018)
Shoemake, K.: Animating rotation with quaternion curves. In: Proceedings of the 12th Annual Conference on Computer Graphics and Interactive Techniques (1985)
Suhr, A., Zhou, S., Zhang, A., Zhang, I., Bai, H., Artzi, Y.: A corpus for reasoning about natural language grounded in photographs. In: ACL (2019)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Ventura, L., Yang, A., Schmid, C., Varol, G.: CoVR: learning composed video retrieval from web video captions. arXiv:2308.14746 (2023)
Vo, N., et al.: Composing text and image for image retrieval-an empirical odyssey. In: CVPR (2019)
Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (2020)
Wu, H., et al.: Fashion IQ: a new dataset towards retrieving images by natural language feedback. In: CVPR (2021)
Zhai, X., et al.: LiT: zero-shot transfer with locked-image text tuning. In: CVPR (2022)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Jang, Y.K., Huynh, D., Shah, A., Chen, WK., Lim, SN. (2025). Spherical Linear Interpolation and Text-Anchoring for Zero-Shot Composed Image Retrieval. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15077. Springer, Cham. https://doi.org/10.1007/978-3-031-72655-2_14
Download citation
DOI: https://doi.org/10.1007/978-3-031-72655-2_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72654-5
Online ISBN: 978-3-031-72655-2
eBook Packages: Computer ScienceComputer Science (R0)