Spherical Linear Interpolation and Text-Anchoring for Zero-Shot Composed Image Retrieval

Jang, Young Kyun; Huynh, Dat; Shah, Ashish; Chen, Wen-Kai; Lim, Ser-Nam

doi:10.1007/978-3-031-72655-2_14

Young Kyun Jang¹³,
Dat Huynh¹³,
Ashish Shah¹³,
Wen-Kai Chen¹⁴ &
…
Ser-Nam Lim¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15077))

Included in the following conference series:

European Conference on Computer Vision

242 Accesses

Abstract

Composed Image Retrieval (CIR) is a complex task that retrieves images using a query, which is configured with an image and a caption that describes desired modifications to that image. Supervised CIR approaches have shown strong performance, but their reliance on expensive manually-annotated datasets restricts their scalability and broader applicability. To address these issues, previous studies have proposed pseudo-word token-based Zero-Shot CIR (ZS-CIR) methods, which utilize a projection module to map images to word tokens. However, we conjecture that this approach has a downside: the projection module distorts the original image representation and confines the resulting composed embeddings to the text-side. In order to resolve this, we introduce a novel ZS-CIR method that uses Spherical Linear Interpolation (Slerp) to directly merge image and text representations by identifying an intermediate embedding of both. Furthermore, we introduce Text-Anchored-Tuning (TAT), a method that fine-tunes the image encoder while keeping the text encoder fixed. TAT closes the modality gap between images and text, making the Slerp process much more effective. Notably, the TAT method is not only efficient in terms of the scale of the training dataset and training time, but it also serves as an excellent initial checkpoint for training supervised CIR models, thereby highlighting its wider potential. The integration of the Slerp-based ZS-CIR with a TAT-tuned model enables our approach to deliver state-of-the-art retrieval performance across CIR benchmarks. Code is available at https://github.com/youngkyunJang/SLERP-TAT.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Long-CLIP: Unlocking the Long-Text Capability of CLIP

SHAF: Semantic-Guided Hierarchical Alignment and Fusion for Composed Image Retrieval

Optimizing CLIP Models for Image Retrieval with Maintained Joint-Embedding Alignment

Notes

References

Baldrati, A., Agnolucci, L., Bertini, M., Del Bimbo, A.: Zero-shot composed image retrieval with textual inversion. In: ICCV (2023)
Google Scholar
Baldrati, A., Bertini, M., Uricchio, T., Del Bimbo, A.: Effective conditioned and composed image retrieval combining CLIP-based features. In: CVPR (2022)
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICML (2020)
Google Scholar
Cohen, N., Gal, R., Meirom, E.A., Chechik, G., Atzmon, Y.: “This is my unicorn, fluffy’’: personalizing frozen vision-language representations. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13680, pp. 558–577. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20044-1_32
Chapter Google Scholar
Delmas, G., Sampaio de Rezende, R., Csurka, G., Larlus, D.: ARTEMIS: attention-based retrieval with text-explicit matching and implicit similarity. In: ICLR (2022)
Google Scholar
Forbes, M., Kaeser-Chen, C., Sharma, P., Belongie, S.: Neural naturalist: generating fine-grained image comparisons. In: EMNLP (2019)
Google Scholar
Goenka, S., et al.: FashionVLP: vision language transformer for fashion retrieval with feedback. In: CVPR (2022)
Google Scholar
Gu, G., Chun, S., Kim, W., Kang, Y., Yun, S.: Language-only efficient training of zero-shot composed image retrieval. arXiv preprint arXiv:2312.01998 (2023)
Guo, X., Wu, H., Cheng, Y., Rennie, S., Tesauro, G., Feris, R.: Dialog-based interactive image retrieval. In: NeurIPS (2018)
Google Scholar
Han, X., et al.: Automatic spatially-aware fashion concept discovery. In: ICCV (2017)
Google Scholar
Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. In: ICLR (2021)
Google Scholar
Levy, M., Ben-Ari, R., Darshan, N., Lischinski, D.: Data roaming and early fusion for composed image retrieval. arXiv preprint arXiv:2303.09429 (2023)
Li, J., et al.: BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: ICML (2022)
Google Scholar
Liang, V.W., Zhang, Y., Kwon, Y., Yeung, S., Zou, J.Y.: Mind the gap: understanding the modality gap in multi-modal contrastive representation learning. In: NeurIPS (2022)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2024)
Google Scholar
Liu, Z., Rodriguez-Opazo, C., Teney, D., Gould, S.: Image retrieval on real-life images with pre-trained vision-and-language models. In: ICCV (2021)
Google Scholar
Liu, Z., Rodriguez-Opazo, C., Teney, D., Gould, S.: Image retrieval on real-life images with pre-trained vision-and-language models. In: CVPR (2021)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2018)
Google Scholar
Ma, J., Jiang, X., Fan, A., Jiang, J., Yan, J.: Image matching from handcrafted to deep features: a survey. IJCV 129, 23–79 (2021)
Article MathSciNet Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML. PMLR (2021)
Google Scholar
Saito, K., et al.: Pic2Word: mapping pictures to words for zero-shot composed image retrieval. In: CVPR (2023)
Google Scholar
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: ACL (2018)
Google Scholar
Shoemake, K.: Animating rotation with quaternion curves. In: Proceedings of the 12th Annual Conference on Computer Graphics and Interactive Techniques (1985)
Google Scholar
Suhr, A., Zhou, S., Zhang, A., Zhang, I., Bai, H., Artzi, Y.: A corpus for reasoning about natural language grounded in photographs. In: ACL (2019)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Google Scholar
Ventura, L., Yang, A., Schmid, C., Varol, G.: CoVR: learning composed video retrieval from web video captions. arXiv:2308.14746 (2023)
Vo, N., et al.: Composing text and image for image retrieval-an empirical odyssey. In: CVPR (2019)
Google Scholar
Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (2020)
Google Scholar
Wu, H., et al.: Fashion IQ: a new dataset towards retrieving images by natural language feedback. In: CVPR (2021)
Google Scholar
Zhai, X., et al.: LiT: zero-shot transfer with locked-image text tuning. In: CVPR (2022)
Google Scholar

Download references

Author information

Authors and Affiliations

Meta AI, New York, USA
Young Kyun Jang, Dat Huynh & Ashish Shah
University of Central Florida, Orlando, USA
Wen-Kai Chen & Ser-Nam Lim

Authors

Young Kyun Jang
View author publications
You can also search for this author in PubMed Google Scholar
Dat Huynh
View author publications
You can also search for this author in PubMed Google Scholar
Ashish Shah
View author publications
You can also search for this author in PubMed Google Scholar
Wen-Kai Chen
View author publications
You can also search for this author in PubMed Google Scholar
Ser-Nam Lim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Young Kyun Jang .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 3960 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jang, Y.K., Huynh, D., Shah, A., Chen, WK., Lim, SN. (2025). Spherical Linear Interpolation and Text-Anchoring for Zero-Shot Composed Image Retrieval. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15077. Springer, Cham. https://doi.org/10.1007/978-3-031-72655-2_14

Download citation

DOI: https://doi.org/10.1007/978-3-031-72655-2_14
Published: 06 December 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72654-5
Online ISBN: 978-3-031-72655-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Spherical Linear Interpolation and Text-Anchoring for Zero-Shot Composed Image Retrieval