research-article

Eliminate Before Align: A Remote Sensing Image-Text Retrieval Framework with Keyword Explicit Reasoning

Authors:

Jungong HanAuthors Info & Claims

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Pages 1662 - 1671

https://doi.org/10.1145/3664647.3681270

Published: 28 October 2024 Publication History

Abstract

Mountains of researches center around the Remote Sensing Image-Text Retrieval (RSITR), aiming at retrieving the corresponding targets based on the given query. Among them, the transfer of Foundation Models (FMs), such as CLIP, to remote sensing domain shows promising results. However, existing FM-based approaches neglect the negative impact of weakly correlated sample pairs and the key distinctions among remote sensing texts, leading to biased and superficial exploration of sample pairs. To address these challenges, we propose a novel Eliminate Before Align strategy with Keyword Explicit Reasoning framework (EBAKER) for RSITR. Specifically, we devise an innovative Eliminate Before Align (EBA) strategy to filter out the weakly correlated sample pairs to mitigate their deviations from optimal embedding space during alignment. Moreover, we introduce a Keyword Explicit Reasoning (KER) module to facilitate the positive role of subtle key concept differences. Without bells and whistles, our method achieves a one-step transformation from FM to RSITR task, obviating the necessity for extra pretraining on remote sensing data. Extensive experiments on three popular benchmark datasets validate that our proposed EBAKER method outperform the state-of-the-art methods with fewer training data. Our source code will be released soon.

References

[1]

Taghreed Abdullah, Yakoub Bazi, Mohamad M Al Rahhal, Mohamed L Mekhalfi, Lalitha Rangarajan, and Mansour Zuair. 2020. TextRS: Deep bidirectional triplet network for matching text to remote sensing images. Remote Sensing 12, 3 (2020), 405.

[2]

Mohamad M Al Rahhal, Yakoub Bazi, Norah A Alsharif, Laila Bashmal, Naif Alajlan, and Farid Melgani. 2022. Multilanguage transformer for improved text to remote sensing image retrieval. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 15 (2022), 9115--9126.

[3]

Wei-Cheng Chang, Daniel Jiang, Hsiang-Fu Yu, Choon Hui Teo, Jiong Zhang, Kai Zhong, Kedarnath Kolluri, Qie Hu, Nikhil Shandilya, Vyacheslav Ievgrafov, et al. 2021. Extreme multi-label learning for semantic matching in product search. In Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining. 2643--2651.

Digital Library

[4]

Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul Voigtlaender, Basil Mustafa, Sebastian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, et al. 2023. Pali-3 vision language models: Smaller, faster, stronger. arXiv:2310.09199 (2023).

[5]

Qimin Cheng, Haiyan Huang, Yuan Xu, Yuzhuo Zhou, Huanying Li, and Zhongyuan Wang. 2022. NWPU-captions dataset and MLCA-net for remote sensing image captioning. IEEE Transactions on Geoscience and Remote Sensing 60 (2022), 1--19.

[6]

Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. 2023. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2818--2829.

[7]

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. 2024. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems (NeurIPS) 36 (2024), 49250--49267.

[8]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2018).

[9]

Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE: Improving Visual-Semantic Embeddings with Hard Negatives. In Proceedings of British Machine Vision Conference (BMVC).

[10]

Zhong Ji, Zhihao Li, Yan Zhang, Haoran Wang, Yanwei Pang, and Xuelong Li. 2024. Hierarchical matching and reasoning for multi-query image retrieval. Neural Networks (2024), 106200.

[11]

Zhong Ji, Changxu Meng, Yan Zhang, Yanwei Pang, and Xuelong Li. 2023. Knowledge-Aided Momentum Contrastive Learning for Remote-Sensing Image Text Retrieval. IEEE Transactions on Geoscience and Remote Sensing 61 (2023), 1--13.

[12]

Ding Jiang and Mang Ye. 2023. Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2787--2797.

[13]

Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR).

[14]

Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad Shahbaz Khan. 2024. Geochat: Grounded large visionlanguage model for remote sensing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 27831--27840.

[15]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning (ICML). PMLR, 19730-- 19742.

[16]

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning (ICML). PMLR, 12888-- 12900.

[17]

Jun Li, Yanqiu Pei, Shaohua Zhao, Rulin Xiao, Xiao Sang, and Chengye Zhang. 2020. A review of remote sensing for environmental monitoring in China. Remote Sensing 12, 7 (2020), 1130.

[18]

Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. 2021. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems (NeurIPS) 34 (2021), 9694--9705.

[19]

Yansheng Li, Jiayi Ma, and Yongjun Zhang. 2021. Image retrieval from remote sensing big data: A survey. Information Fusion 67 (2021), 94--115.

[20]

Yu Liao, Rui Yang, Tao Xie, Hantong Xing, Dou Quan, Shuang Wang, and Biao Hou. 2023. A Fast and Accurate Method for Remote Sensing Image-Text Retrieval Based On Large Model Knowledge Distillation. In IEEE International Geoscience and Remote Sensing Symposium (IGARSS). IEEE, 5077--5080.

[21]

Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, and Jun Zhou. 2024. Remoteclip: A vision language foundation model for remote sensing. IEEE Transactions on Geoscience and Remote Sensing 62 (2024), 1--16.

[22]

Xiang Long, Kaipeng Deng, Guanzhong Wang, Yang Zhang, Qingqing Dang, Yuan Gao, Hui Shen, Jianguo Ren, Shumin Han, Errui Ding, et al. 2020. PP-YOLO: An effective and efficient implementation of object detector. arXiv preprint arXiv:2007.12099 (2020).

[23]

Xiaoqiang Lu, Binqiang Wang, Xiangtao Zheng, and Xuelong Li. 2017. Exploring models and data for remote sensing image caption generation. IEEE Transactions on Geoscience and Remote Sensing 56, 4 (2017), 2183--2195.

[24]

Yafei Lv,Wei Xiong, Xiaohan Zhang, and Yaqi Cui. 2021. Fusion-based correlation learning model for cross-modal remote sensing image retrieval. IEEE Geoscience and Remote Sensing Letters 19 (2021), 1--5.

[25]

Qing Ma, Jiancheng Pan, and Cong Bai. 2024. Direction-oriented visual-semantic embedding model for remote sensing image-text retrieval. IEEE Transactions on Geoscience and Remote Sensing 62 (2024), 1--14.

[26]

Li Mi, Siran Li, Christel Chappuis, and Devis Tuia. 2022. Knowledge-Aware Cross-Modal Text-Image Retrieval for Remote Sensing Images. In Proceedings of the Second Workshop on Complex Data Challenges in Earth Observation (CDCEO 2022).

[27]

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv:1807.03748 (2018).

[28]

Jiancheng Pan, Qing Ma, and Cong Bai. 2023. A Prior Instruction Representation Framework for Remote Sensing Image-text Retrieval. In Proceedings of the 31st ACM International Conference on Multimedia (ACMMM). 611--620.

Digital Library

[29]

Jiancheng Pan, Qing Ma, and Cong Bai. 2023. Reducing Semantic Confusion: Scene-aware Aggregation Network for Remote Sensing Cross-modal Retrieval. In Proceedings of the 2023 ACM International Conference on Multimedia Retrieval (ICMR). 398--406.

Digital Library

[30]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning (ICML). PMLR, 8748--8763.

[31]

Xu Tang, Yijing Wang, Jingjing Ma, Xiangrong Zhang, Fang Liu, and Licheng Jiao. 2023. Interacting-Enhancing Feature Transformer for Cross-modal Remote Sensing Image and Text Retrieval. IEEE Transactions on Geoscience and Remote Sensing 61 (2023), 1--15.

[32]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288 (2023).

[33]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems (NeurIPS) 30 (2017), 5998--6008.

[34]

Marie Weiss, Frédéric Jacob, and Grgory Duveiller. 2020. Remote sensing for agricultural applications: A meta-review. Remote sensing of environment 236 (2020), 111402.

[35]

Rui Yang, Di Zhang, YanHe Guo, and ShuangWang. 2023. A Texture and Saliency Enhanced Image Learning Method For Cross-Modal Remote Sensing Image- Text Retrieval. In IEEE International Geoscience and Remote Sensing Symposium (IGARSS). IEEE, 4895--4898.

[36]

Shuyu Yang, Yinan Zhou, Zhedong Zheng, YaxiongWang, Li Zhu, and YujiaoWu. 2023. Towards unified text-based person retrieval: A large-scale multi-attribute and language search benchmark. In Proceedings of the 31st ACM International Conference on Multimedia (ACMMM). 4492--4501.

Digital Library

[37]

QiFeng Yu, Yang Shang, XiaoChun Liu, ZhiHui Lei, Xiang Li, XianWei Zhu, XiaoLin Liu, Xia Yang, Ang Su, XiaoHu Zhang, et al. 2014. Full-parameter vision navigation based on scene matching for aircrafts. Science China Information Sciences 57 (2014), 1--10.

[38]

Yuan Yuan, Yang Zhan, and Zhitong Xiong. 2023. Parameter-Efficient Transfer Learning for Remote Sensing Image-Text Retrieval. IEEE Transactions on Geoscience and Remote Sensing 61 (2023), 1--14.

[39]

Zhiqiang Yuan,Wenkai Zhang, Kun Fu, Xuan Li, Chubo Deng, HongqiWang, and Xian Sun. 2021. Exploring a Fine-Grained Multiscale Method for Cross-Modal Remote Sensing Image Retrieval. IEEE Transactions on Geoscience and Remote Sensing 60 (2021), 1--19.

[40]

Zhiqiang Yuan,Wenkai Zhang, Xuee Rong, Xuan Li, Jialiang Chen, HongqiWang, Kun Fu, and Xian Sun. 2021. A lightweight multi-scale crossmodal text-image retrieval method in remote sensing. IEEE Transactions on Geoscience and Remote Sensing 60 (2021), 1--19.

[41]

Zhiqiang Yuan, Wenkai Zhang, Changyuan Tian, Yongqiang Mao, Ruixue Zhou, Hongqi Wang, Kun Fu, and Xian Sun. 2022. MCRN: A Multi-source Cross-modal Retrieval Network for remote sensing. International Journal of Applied Earth Observation and Geoinformation 115 (2022), 103071.

[42]

Zhiqiang Yuan, Wenkai Zhang, Changyuan Tian, Xuee Rong, Zhengyuan Zhang, Hongqi Wang, Kun Fu, and Xian Sun. 2022. Remote sensing cross-modal textimage retrieval based on global and local information. IEEE Transactions on Geoscience and Remote Sensing 60 (2022), 1--16.

[43]

Weihang Zhang, Jihao Li, Shuoke Li, Jialiang Chen, Wenkai Zhang, Xin Gao, and Xian Sun. 2023. Hypersphere-based remote sensing cross-modal text-image retrieval via curriculum learning. IEEE Transactions on Geoscience and Remote Sensing 61 (2023), 1--15.

[44]

Yan Zhang, Zhong Ji, Yanwei Pang, and Xuelong Li. 2023. Consensus knowledge exploitation for partial query based image retrieval. IEEE Transactions on Circuits and Systems for Video Technology 33 (2023), 7900--7913.

Digital Library

[45]

Yan Zhang, Zhong Ji, DiWang, Yanwei Pang, and Xuelong Li. 2024. USER: Unified semantic enhancement with momentum contrast for image-text retrieval. IEEE Transactions on Image Processing 33 (2024), 595--609.

Digital Library

[46]

Zilun Zhang, Tiancheng Zhao, Yulong Guo, and Jianwei Yin. 2023. Rs5m: A large scale vision-language dataset for remote sensing vision-language foundation model. arXiv:2306.11300 (2023).

[47]

Fuzhong Zheng, Xu Wang, Luyao Wang, Xiong Zhang, Hongze Zhu, Long Wang, and Haisu Zhang. 2023. A Fine-Grained Semantic Alignment Method Specific to Aggregate Multi-Scale Information for Cross-Modal Remote Sensing Image Retrieval. Sensors 23, 20 (2023), 8437.

Cited By

Zhang YJi ZPang YHan J(2025)Hierarchical and complementary experts transformer with momentum invariance for image-text retrievalKnowledge-Based Systems10.1016/j.knosys.2024.112912309(112912)Online publication date: Jan-2025
https://doi.org/10.1016/j.knosys.2024.112912
Ji ZLiu YWang XLiu JCao JYu Y(2025)Multi-task classification network for few-shot learningInternational Journal of Multimedia Information Retrieval10.1007/s13735-025-00354-y14:1Online publication date: 17-Feb-2025
https://doi.org/10.1007/s13735-025-00354-y

Index Terms

Eliminate Before Align: A Remote Sensing Image-Text Retrieval Framework with Keyword Explicit Reasoning
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

A Prior Instruction Representation Framework for Remote Sensing Image-text Retrieval
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

This paper presents a prior instruction representation framework (PIR) for remote sensing image-text retrieval, aimed at remote sensing vision-language understanding tasks to solve the semantic noise problem. Our highlight is the proposal of a paradigm ...
Entity Semantic Feature Fusion Network for Remote Sensing Image-Text Retrieval
Web and Big Data
Abstract
Recently, there has been remarkable progress in remote sensing image-text retrieval (RSITR), but in the past RSITR methods, researchers often try to extract features in images and texts from global and local perspectives, and the unique entity ...
Scale-Semantic Joint Decoupling Network for Image-Text Retrieval in Remote Sensing
Image-text retrieval in remote sensing aims to provide flexible information for data analysis and application. In recent years, state-of-the-art methods are dedicated to “scale decoupling” and “semantic decoupling” strategies to further enhance the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

October 2024

11719 pages

ISBN:9798400706868

DOI:10.1145/3664647

General Chairs:
Jianfei Cai
Monash University, Australia
,
Mohan Kankanhalli
NUS, Singapore
,
Balakrishnan Prabhakaran
UT Dallas, USA
,
Susanne Boll
University of Oldenburg, Germany
,
Program Chairs:
Ramanathan Subramanian
University of Canberra & IIT Ropar, Australia
,
Liang Zheng
Australian National University, Australia
,
Vivek K. Singh
Rutgers University, USA
,
Pablo Cesar
Centrum Wiskunde & Informatica, Netherlands
,
Lexing Xie
Australian National University, Australia
,
Dong Xu
University of Hong Kong, Hong Kong

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

MM '24

Sponsor:

SIGMM

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
177
Total Downloads

Downloads (Last 12 months)177
Downloads (Last 6 weeks)83

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhang YJi ZPang YHan J(2025)Hierarchical and complementary experts transformer with momentum invariance for image-text retrievalKnowledge-Based Systems10.1016/j.knosys.2024.112912309(112912)Online publication date: Jan-2025
https://doi.org/10.1016/j.knosys.2024.112912
Ji ZLiu YWang XLiu JCao JYu Y(2025)Multi-task classification network for few-shot learningInternational Journal of Multimedia Information Retrieval10.1007/s13735-025-00354-y14:1Online publication date: 17-Feb-2025
https://doi.org/10.1007/s13735-025-00354-y

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten