short-paper

Golden Retriever: A Real-Time Multi-Modal Text-Image Retrieval System with the Ability to Focus

Authors:

Florian Schneider,

Chris BiemannAuthors Info & Claims

SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 3245 - 3250

https://doi.org/10.1145/3477495.3531666

Published: 07 July 2022 Publication History

Abstract

In this work, we present the Golden Retriever, a system leveraging state-of-the-art visio-linguistic models (VLMs) for real-time text-image retrieval. The unique feature of our system is that it can focus on words contained in the textual query, i.e., locate and high-light them within retrieved images. An efficient two-stage process implements real-time capability and the ability to focus. Therefore, we first drastically reduce the number of images processed by a VLM. Then, in the second stage, we rank the images and highlight the focussed word using the outputs of a VLM. Further, we introduce a new and efficient algorithm based on the idea of TF-IDF to retrieve images for short textual queries. One of multiple use cases where we employ the Golden Retriever is a language learner scenario, where visual cues for "difficult" words within sentences are provided to improve a user's reading comprehension. However, since the backend is completely decoupled from the frontend, the system can be integrated into any other application where images must be retrieved fast. We demonstrate the Golden Retriever with screenshots of a minimalistic user interface.

References

[1]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. Salt Lake City, UT, USA, 6077--6086.

[2]

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics (TACL), Vol. 5 (2017), 135--146.

[3]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, Vol. 33. Virtual, 1877--1901.

[4]

Daniel Cer, Mona Diab, Eneko Agirre, I nigo Lopez-Gazpio, and Lucia Specia. 2017. SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). Vancouver, Canada, 1--14. https://doi.org/10.18653/v1/S17-2001

[5]

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. UNITER: UNiversal Image-TExt Representation Learning. In European Conference on Computer Vision (ECCV). Online, 104--120.

[6]

Paul Clough, Henning Müller, and Mark Sanderson. 2004. The CLEF 2004 cross-language image retrieval track. In Proceedings of the 5th conference on Cross-Language Evaluation Forum: multilingual Information Access for Text, Speech and Images. 597--613.

[7]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics. Minneapolis, MN, USA, 4171--4186.

[8]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021.

[9]

Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. 2020. spaCy: Industrial-strength Natural Language Processing in Python. https://spacy.io/.

[10]

Johnson, Jeff and Douze, Matthijs and Jégou, Hervé. 2019. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data (2019).

[11]

Karen Spärck Jones. 1972. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation (1972).

[12]

Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. 2020. Oscar: Object-Semantics Aligned Pre-training for Vision-and-Language Tasks. In European Conference on Computer Vision (ECCV). Online, 121--137.

[13]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In European Conference on Computer Vision (ECCV). Zurich, Switzerland, 740--755.

[14]

Nicola Messina, Giuseppe Amato, Andrea Esuli, Fabrizio Falchi, Claudio Gennaro, and Stéphane Marchand-Maillet. 2021. Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Vol. 17, 4 (2021), 1--23.

Digital Library

[15]

Ajay Patel, Alexander Sands, Chris Callison-Burch, and Marianna Apidianaki. 2018. Magnitude: A Fast, Efficient Universal Vector Embedding Utility Package. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP). Brussels, Belgium, 120.

[16]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning Transferable Visual Models from Natural Language Supervision. In International Conference on Machine Learning (ICML). Online, 8748--8763.

[17]

Reimers, Nils and Gurevych, Iryna. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China, 3973--3983.

[18]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2016. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Vol. 39, 6 (2016), 1137--1149.

Digital Library

[19]

Florian Schneider. 2021. Self-Supervised Multi-Modal Text-Image Retrieval Methods to Improve Human Reading. Master's thesis. University of Hamburg.

[20]

Florian Schneider, Özge Alacc am, Xintong Wang, and Chris Biemann. 2021. Towards Multi-Modal Text-Image Retrieval to improve Human Reading. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop. Mexico City, Mexico (online).

[21]

Mingxing Tan and Quoc Le. 2021. Efficientnetv2: Smaller models and faster training. In International Conference on Machine Learning (ICML). Online, 10096--10106.

[22]

Christopher Phillip Town. 2004. Ontology based Visual Information Processing. Ph.,D. Dissertation. University of Cambridge.

[23]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, Vol. 2 (2014), 67--78.

[24]

Zhou Yu, Jing Li, Tongan Luo, and Jun Yu. 2020. A PyTorch Implementation of Bottom-Up-Attention. https://github.com/MILVLG/bottom-up-attention.pytorch.

Cited By

Sheridan POnsjö M(2023)The hypergeometric test performs comparably to TF-IDF on standard text analysis tasksMultimedia Tools and Applications10.1007/s11042-023-16615-z83:10(28875-28890)Online publication date: 8-Sep-2023
https://doi.org/10.1007/s11042-023-16615-z

Index Terms

Golden Retriever: A Real-Time Multi-Modal Text-Image Retrieval System with the Ability to Focus
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval
        Image search

Recommendations

Golden retriever: a Java based open source image retrieval engine
MM '13: Proceedings of the 21st ACM international conference on Multimedia

Golden Retriever Image Retrieval Engine (GRire) is an open source light weight Java library developed for Content Based Image Retrieval (CBIR) tasks, employing the Bag of Visual Words (BOVW) model. It provides a complete framework for creating CBIR ...
Golden Retriever: Question Retrieval System
ICHI '15: Proceedings of the 2015 International Conference on Healthcare Informatics

Duplicate questions get posted on Q&A online forums because users may not be aware of similar questions. Our proposed system, Golden Retriever, can recommend existing questions that are semantically related to incoming questions. Compared with other ...
SHREC’22 track: Open-Set 3D Object Retrieval▪
Abstract
This paper reports the results of the SHREC’22 track: Open-Set 3D Object Retrieval, the goal of which is to evaluate the performance of different retrieval algorithms under the Open-Set setting and modality-missing setting, ...
Graphical abstract

Display Omitted
Highlights
- This paper reports the results of the SHREC’22 track: Open-Set 3D Object Retrieval.

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 2022

3569 pages

ISBN:9781450387323

DOI:10.1145/3477495

General Chairs:
Enrique Amigo
UNED
,
Pablo Castells
UAM and Amazon
,
Julio Gonzalo
UNED
,
Program Chairs:
Ben Carterette
Spotify
,
J. Shane Culpepper
RMIT University
,
Gabriella Kazai
Waseda University

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 July 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Funding Sources

Deutsche Forschungsgemeinschaft

Conference

SIGIR '22

Sponsor:

SIGIR

SIGIR '22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 11 - 15, 2022

Madrid, Spain

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
264
Total Downloads

Downloads (Last 12 months)29
Downloads (Last 6 weeks)2

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Sheridan POnsjö M(2023)The hypergeometric test performs comparably to TF-IDF on standard text analysis tasksMultimedia Tools and Applications10.1007/s11042-023-16615-z83:10(28875-28890)Online publication date: 8-Sep-2023
https://doi.org/10.1007/s11042-023-16615-z

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten