skip to main content
10.1145/3581783.3612664acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
demonstration

Zero-Shot Image Retrieval with Human Feedback

Published: 27 October 2023 Publication History

Abstract

Composed image retrieval extends traditional content-based image retrieval (CBIR) combining a query image with additional descriptive text to express user intent and specify supplementary requests related to the visual attributes of the query image. This approach holds significant potential for e-commerce applications, such as interactive multimodal searches and chatbots. In our demo, we present an interactive composed image retrieval system based on the SEARLE approach, which tackles this task in a zero-shot manner efficiently and effectively. The demo allows users to perform image retrieval iteratively refining the results using textual feedback.

Supplemental Material

MP4 File
In this video, we show the interface of our interactive Composed Image Retrieval (CIR) system, which allows users to perform image retrieval on an open-domain dataset by iteratively refining the results using textual feedback. Our demo relies on the SEARLE approach, the current SotA for zero-shot CIR. Thanks to SEARLE, we allow users to upload their images without restrictions on their domain or source. Firstly, the user can either upload a reference image or choose one of the randomly selected examples we provide. Then, they can request some modifications through an arbitrary textual input or choose the suggested one from the CIRCO dataset. Finally, given the multimodal query, our demo shows the retrieved images. If the user is not satisfied with any of the results, they can refine their search by selecting a retrieved image as a new reference image for a subsequent query. This iterative process mimics a dialog-based search system, enabling a more natural and precise retrieval process.

References

[1]
Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, and Alberto Del Bimbo. 2023. Zero-Shot Composed Image Retrieval with Textual Inversion. arXiv preprint arXiv:2303.15247 (2023).
[2]
Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Alberto Del Bimbo. 2022. Effective conditioned and composed image retrieval combining CLIP-based features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21466--21474.
[3]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Proc. of the European Conference on Computer Vision (ECCV). Springer, 740--755.
[4]
Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and Stephen Gould. 2021. Image retrieval on real-life images with pre-trained vision-and-language models. In Proc. of the IEEE/CVF International Conference on Computer Vision (ICCV). 2125--2134.
[5]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In Proc. of International Conference on Machine Learning (ICML). PMLR, 8748--8763.
[6]
Kuniaki Saito, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee, Kate Saenko, and Tomas Pfister. 2023. Pic2word: Mapping pictures to words for zeroshot composed image retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19305--19314.
[7]
Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays. 2019. Composing text and image for image retrieval-an empirical odyssey. In Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6439--6448.
[8]
Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogerio Feris. 2021. Fashion iq: A new dataset towards retrieving images by natural language feedback. In Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 11307--11317.

Cited By

View all
  • (2024)Evaluating Performance and Trends in Interactive Video Retrieval: Insights From the 12th VBS CompetitionIEEE Access10.1109/ACCESS.2024.340563812(79342-79366)Online publication date: 2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '23: Proceedings of the 31st ACM International Conference on Multimedia
October 2023
9913 pages
ISBN:9798400701085
DOI:10.1145/3581783
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Check for updates

Author Tags

  1. clip
  2. composed image retrieval
  3. multimodal learning
  4. multimodal retrieval
  5. textual inversion
  6. vision and language

Qualifiers

  • Demonstration

Funding Sources

  • European Horizon 2020 Programme

Conference

MM '23
Sponsor:
MM '23: The 31st ACM International Conference on Multimedia
October 29 - November 3, 2023
Ottawa ON, Canada

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)122
  • Downloads (Last 6 weeks)9
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Evaluating Performance and Trends in Interactive Video Retrieval: Insights From the 12th VBS CompetitionIEEE Access10.1109/ACCESS.2024.340563812(79342-79366)Online publication date: 2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media