skip to main content
10.1145/3626772.3657741acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article
Open access

CFIR: Fast and Effective Long-Text To Image Retrieval for Large Corpora

Published: 11 July 2024 Publication History

Abstract

Text-to-image retrieval aims to find the relevant images based on a text query, which is important in various use-cases, such as digital libraries, e-commerce, and multimedia databases. Although Multimodal Large Language Models (MLLMs) demonstrate state-of-the-art performance, they exhibit limitations in handling large-scale, diverse, and ambiguous real-world needs of retrieval, due to the computation cost and the injective embeddings they produce. This paper presents a two-stage Coarse-to-Fine Index-shared Retrieval (CFIR) framework, designed for fast and effective large-scale long-text to image retrieval. The first stage, Entity-based Ranking (ER), adapts to long-text query ambiguity by employing a multiple-queries-to-multiple-targets paradigm, facilitating candidate filtering for the next stage. The second stage, Summary-based Re-ranking (SR), refines these rankings using summarized queries. We also propose a specialized Decoupling-BEiT-3 encoder, optimized for handling ambiguous user needs and both stages, which also enhances computational efficiency through vector-based similarity inference. Evaluation on the AToMiC dataset reveals that CFIR surpasses existing MLLMs by up to 11.06% in Recall@1000, while reducing training and retrieval times by 68.75% and 99.79%, respectively. We will release our code to facilitate future research at https://github.com/longkukuhi/CFIR.

References

[1]
Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, and Jungong Han. 2020. Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In CVPR. 12655--12663.
[2]
Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. 2023. Reproducible scaling laws for contrastive language-image learning. In CVPR. 2818--2829.
[3]
Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. 2019. Autoaugment: Learning augmentation strategies from data. In CVPR. 113--123.
[4]
Yashar Deldjoo, Markus Schedl, Paolo Cremonesi, and Gabriella Pasi. 2020. Recommender systems leveraging multimedia content. ACM CSUR, Vol. 53, 5 (2020), 1--38.
[5]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. NAACL (2019).
[6]
Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-modal retrieval with correspondence autoencoder. In ACM MM. 7--16.
[7]
Xuri Ge, Fuhai Chen, Joemon M Jose, Zhilong Ji, Zhongqin Wu, and Xiao Liu. 2021. Structured multi-modal feature embedding and alignment for image-sentence retrieval. In ACM MM. 5185--5193.
[8]
Xuri Ge, Fuhai Chen, Songpei Xu, Fuxiang Tao, and Joemon M Jose. 2023. Cross-modal Semantic Enhanced Interaction for Image-Sentence Retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1022--1031.
[9]
Xuri Ge, Songpei Xu, Fuhai Chen, Jie Wang, Guoxin Wang, Shan An, and Joemon M Jose. 2024. 3SHNet: Boosting image--sentence retrieval via visual semantic--spatial self-highlighting. Information Processing & Management, Vol. 61, 4 (2024), 103716.
[10]
Weixiang Hong, Kaixiang Ji, Jiajia Liu, Jian Wang, Jingdong Chen, and Wei Chu. 2021. Gilbert: Generative vision-language pre-training for image-text retrieval. In ACM SIGIR. 1379--1388.
[11]
Xuming Hu, Zhijiang Guo, Junzhe Chen, Lijie Wen, and Philip S. Yu. 2023. MR2: A Benchmark for Multimodal Retrieval-Augmented Rumor Detection in Social Media. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2023, Taipei, Taiwan, July 23--27, 2023, Hsin-Hsi Chen, Wei-Jou (Edward) Duh, Hen-Hsen Huang, Makoto P. Kato, Josiane Mothe, and Barbara Poblete (Eds.). ACM, 2901--2912. https://doi.org/10.1145/3539618.3591896
[12]
Yan Huang, Qi Wu, Chunfeng Song, and Liang Wang. 2018. Learning semantic concepts and order for image and sentence matching. In CVPR. 6163--6171.
[13]
Lyndon S Kennedy, Apostol Natsev, and Shih-Fu Chang. 2005. Automatic discovery of query-class-dependent models for multimodal search. In ACM MM. 882--891.
[14]
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. ACL (2019).
[15]
Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and Daxin Jiang. 2020a. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In AAAI, Vol. 34. 11336--11344.
[16]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML. PMLR, 12888--12900.
[17]
Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2019b. Visual semantic reasoning for image-text matching. In ICCV. 4654--4662.
[18]
Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019a. VisualBERT: A Simple and Performant Baseline for Vision and Language. CoRR, Vol. abs/1908.03557 (2019).
[19]
Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. 2020b. Oscar: Object-semantics aligned pre-training for vision-language tasks. In ECCV. Springer, 121--137.
[20]
Dengtian Lin, Liqiang Jing, Xuemeng Song, Meng Liu, Teng Sun, and Liqiang Nie. 2023. Adapting Generative Pretrained Language Model for Open-domain Multimodal Sentence Summarization. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2023, Taipei, Taiwan, July 23--27, 2023, Hsin-Hsi Chen, Wei-Jou (Edward) Duh, Hen-Hsen Huang, Makoto P. Kato, Josiane Mothe, and Barbara Poblete (Eds.). ACM, 195--204. https://doi.org/10.1145/3539618.3591633
[21]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In ECCV. Springer, 740--755.
[22]
Siqu Long, Soyeon Caren Han, Xiaojun Wan, and Josiah Poon. 2022. Gradual: Graph-based dual-modal representation for image-text matching. In WACV. 3459--3468.
[23]
Zijun Long, George Killick, Richard McCreadie, and Gerardo Aragon Camarasa. 2023 a. Robollm: Robotic vision tasks grounded on multimodal large language models. arXiv preprint arXiv:2310.10221 (2023).
[24]
Zijun Long, George Killick, Richard McCreadie, and Gerardo Aragon Camarasa. 2024. Multiway-Adapter: Adapting Multimodal Large Language Models for Scalable Image-Text Retrieval. In ICASSP 2024--2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6580--6584.
[25]
Zijun Long, George Killick, Richard McCreadie, Gerardo Aragon Camarasa, and Zaiqiao Meng. 2023 b. When hard negative sampling meets supervised contrastive learning. arXiv preprint arXiv:2308.14893.
[26]
Zijun Long, George Killick, Lipeng Zhuang, Richard McCreadie, Gerardo Aragon Camarasa, and Paul Henderson. 2023 c. Elucidating and overcoming the challenges of label noise in supervised contrastive learning. arXiv preprint arXiv:2311.16481.
[27]
Zijun Long and Richard McCreadie. [n.,d.]. Is Multi-Modal Data Key for Crisis Content Categorization on Social Media?. In 19th International Conference on Information Systems for Crisis Response and Management (ISCRAM 2022).
[28]
Zijun Long, Richard McCreadie, Gerardo Aragon Camarasa, and Zaiqiao Meng. [n.,d.]. LACVIT: A Label-aware Contrastive Fine-tuning Framework for Vision Transformers. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2024).
[29]
Zijun Long, Richard McCreadie, and Muhammad Imran. 2023 d. Crisisvit: A robust vision transformer for crisis image classification. arXiv preprint arXiv:2401.02838 (2023).
[30]
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. NeurIPS, Vol. 32 (2019).
[31]
Leigang Qu, Meng Liu, Wenjie Wang, Zhedong Zheng, Liqiang Nie, and Tat-Seng Chua. 2023. Learnable Pillar-based Re-ranking for Image-Text Retrieval. ACM SIGIR (2023).
[32]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In ICML. PMLR, 8748--8763.
[33]
Jie Shao, Zhicheng Zhao, and Fei Su. 2019. Two-stage deep learning for supervised cross-modal retrieval. Multimedia Tools and Applications, Vol. 78 (2019), 16615--16631.
[34]
Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela. 2022. Flava: A foundational language and vision alignment model. In CVPR. 15638--15650.
[35]
Yale Song and Mohammad Soleymani. 2019. Polysemous visual-semantic embedding for cross-modal retrieval. In CVPR. 1979--1988.
[36]
Krishna Srinivasan, Karthik Raman, Jiecao Chen, Michael Bendersky, and Marc Najork. 2021. Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. In ACM SIGIR. 2443--2449.
[37]
Hao Tan and Mohit Bansal. 2019. Lxmert: Learning cross-modality encoder representations from transformers. In EMNLP-IJCNLP. 5099--5110.
[38]
Jialin Tian, Kai Wang, Xing Xu, Zuo Cao, Fumin Shen, and Heng Tao Shen. 2022. Multimodal Disentanglement Variational AutoEncoders for Zero-Shot Cross-Modal Retrieval. In SIGIR '22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11 - 15, 2022, Enrique Amigó, Pablo Castells, Julio Gonzalo, Ben Carterette, J. Shane Culpepper, and Gabriella Kazai (Eds.). ACM, 960--969. https://doi.org/10.1145/3477495.3532028
[39]
Shuhui Wang, Yangyu Chen, Junbao Zhuo, Qingming Huang, and Qi Tian. 2018. Joint global and co-attentive representation learning for image-sentence retrieval. In ACM MM. 1398--1406.
[40]
Tan Wang, Xing Xu, Yang Yang, Alan Hanjalic, Heng Tao Shen, and Jingkuan Song. 2019. Matching images and text with multi-modal tensor fusion and re-ranking. In ACM MM. 12--20.
[41]
Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, et al. 2023. Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks. In CVPR. 19175--19186.
[42]
Yaxiong Wu, Craig Macdonald, and Iadh Ounis. 2021. Partially observable reinforcement learning for dialog-based interactive recommendation. In ACM RecSys. 241--251.
[43]
Jheng-Hong Yang, Carlos Lassance, Rafael Sampaio De Rezende, Krishna Srinivasan, Miriam Redi, Stéphane Clinchant, and Jimmy Lin. 2023. AToMiC: An Image/Text Retrieval Test Collection to Support Multimedia Content Creation. In ACM SIGIR. 2975--2984.
[44]
Zixuan Yi, Zijun Long, Iadh Ounis, Craig Macdonald, and Richard Mccreadie. 2023. Large multi-modal encoders for recommendation. arXiv preprint arXiv:2310.20343.
[45]
Atsuo Yoshitaka and Tadao Ichikawa. 1999. A survey on content-based retrieval for multimedia databases. IEEE TKDE, Vol. 11, 1 (1999), 81--93.
[46]
Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist., Vol. 2 (2014), 67--78.
[47]
Zijia Zhao, Longteng Guo, Xingjian He, Shuai Shao, Zehuan Yuan, and Jing Liu. 2023. MAMO: Fine-Grained Vision-Language Representations Learning with Masked Multimodal Modeling. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2023, Taipei, Taiwan, July 23--27, 2023, Hsin-Hsi Chen, Wei-Jou (Edward) Duh, Hen-Hsen Huang, Makoto P. Kato, Josiane Mothe, and Barbara Poblete (Eds.). ACM, 1528--1538. https://doi.org/10.1145/3539618.3591721

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '24: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval
July 2024
3164 pages
ISBN:9798400704314
DOI:10.1145/3626772
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 July 2024

Check for updates

Author Tags

  1. coarse-to-fine retrieval
  2. document-to-image retrieval
  3. multi-modal large language model.
  4. text-to-image retrieval

Qualifiers

  • Research-article

Conference

SIGIR 2024
Sponsor:

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 515
    Total Downloads
  • Downloads (Last 12 months)515
  • Downloads (Last 6 weeks)103
Reflects downloads up to 28 Feb 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media