demonstration

Zero-Shot Image Retrieval with Human Feedback

Authors:

Lorenzo Agnolucci,

Alberto Baldrati,

Marco Bertini,

Alberto Del BimboAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 9417 - 9419

https://doi.org/10.1145/3581783.3612664

Published: 27 October 2023 Publication History

Get Access

Abstract

Composed image retrieval extends traditional content-based image retrieval (CBIR) combining a query image with additional descriptive text to express user intent and specify supplementary requests related to the visual attributes of the query image. This approach holds significant potential for e-commerce applications, such as interactive multimodal searches and chatbots. In our demo, we present an interactive composed image retrieval system based on the SEARLE approach, which tackles this task in a zero-shot manner efficiently and effectively. The demo allows users to perform image retrieval iteratively refining the results using textual feedback.

Supplemental Material

MP4 File

In this video, we show the interface of our interactive Composed Image Retrieval (CIR) system, which allows users to perform image retrieval on an open-domain dataset by iteratively refining the results using textual feedback. Our demo relies on the SEARLE approach, the current SotA for zero-shot CIR. Thanks to SEARLE, we allow users to upload their images without restrictions on their domain or source. Firstly, the user can either upload a reference image or choose one of the randomly selected examples we provide. Then, they can request some modifications through an arbitrary textual input or choose the suggested one from the CIRCO dataset. Finally, given the multimodal query, our demo shows the retrieved images. If the user is not satisfied with any of the results, they can refine their search by selecting a retrieved image as a new reference image for a subsequent query. This iterative process mimics a dialog-based search system, enabling a more natural and precise retrieval process.

Download
14.73 MB

References

[1]

Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini, and Alberto Del Bimbo. 2023. Zero-Shot Composed Image Retrieval with Textual Inversion. arXiv preprint arXiv:2303.15247 (2023).

Google Scholar

[2]

Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Alberto Del Bimbo. 2022. Effective conditioned and composed image retrieval combining CLIP-based features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21466--21474.

Crossref

Google Scholar

[3]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Proc. of the European Conference on Computer Vision (ECCV). Springer, 740--755.

Crossref

Google Scholar

[4]

Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and Stephen Gould. 2021. Image retrieval on real-life images with pre-trained vision-and-language models. In Proc. of the IEEE/CVF International Conference on Computer Vision (ICCV). 2125--2134.

Crossref

Google Scholar

[5]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In Proc. of International Conference on Machine Learning (ICML). PMLR, 8748--8763.

Google Scholar

[6]

Kuniaki Saito, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee, Kate Saenko, and Tomas Pfister. 2023. Pic2word: Mapping pictures to words for zeroshot composed image retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19305--19314.

Crossref

Google Scholar

[7]

Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays. 2019. Composing text and image for image retrieval-an empirical odyssey. In Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6439--6448.

Crossref

Google Scholar

[8]

Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogerio Feris. 2021. Fashion iq: A new dataset towards retrieving images by natural language feedback. In Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 11307--11317.

Crossref

Google Scholar

Cited By

View all

Vadicamo LArnold RBailer WCarrara FGurrin CHezel NLi XLokoc JLubos SMa ZMessina NNguyen TPeska LRossetto LSauter LSchöffmann KSpiess FTran MVrochidis S(2024)Evaluating Performance and Trends in Interactive Video Retrieval: Insights From the 12th VBS CompetitionIEEE Access10.1109/ACCESS.2024.340563812(79342-79366)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3405638

Index Terms

Zero-Shot Image Retrieval with Human Feedback
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Visual content-based indexing and retrieval
2. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval
        Image search

Recommendations

Target-Guided Composed Image Retrieval
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Composed image retrieval (CIR) is a new and flexible image retrieval paradigm, which can retrieve the target image for a multimodal query, including a reference image and its corresponding modification text. Although existing efforts have achieved ...
Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval
SIGIR '24: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval

Composed Image Retrieval (CIR) allows users to search target images with a multimodal query, comprising a reference image and a modification text that describes the user's modification demand over the reference image. Nevertheless, due to the expensive ...
Multimodal retrieval with relevance feedback based on genetic programming

This paper presents a framework for multimodal retrieval with relevance feedback based on genetic programming. In this supervised learning-to-rank framework, genetic programming is used for the discovery of effective combination functions of (multimodal)...

Comments

Information & Contributors

Information

Published In

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Check for updates

Author Tags

Qualifiers

Demonstration

Funding Sources

European Horizon 2020 Programme

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
203
Total Downloads

Downloads (Last 12 months)122
Downloads (Last 6 weeks)9

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Vadicamo LArnold RBailer WCarrara FGurrin CHezel NLi XLokoc JLubos SMa ZMessina NNguyen TPeska LRossetto LSauter LSchöffmann KSpiess FTran MVrochidis S(2024)Evaluating Performance and Trends in Interactive Video Retrieval: Insights From the 12th VBS CompetitionIEEE Access10.1109/ACCESS.2024.340563812(79342-79366)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3405638

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

Supplemental Material

References

Cited By

Index Terms

Recommendations

Target-Guided Composed Image Retrieval

Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval

Multimodal retrieval with relevance feedback based on genetic programming

Comments

Information

Published In

Sponsors

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations