skip to main content
10.1145/3664647.3681270acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Eliminate Before Align: A Remote Sensing Image-Text Retrieval Framework with Keyword Explicit Reasoning

Published: 28 October 2024 Publication History

Abstract

Mountains of researches center around the Remote Sensing Image-Text Retrieval (RSITR), aiming at retrieving the corresponding targets based on the given query. Among them, the transfer of Foundation Models (FMs), such as CLIP, to remote sensing domain shows promising results. However, existing FM-based approaches neglect the negative impact of weakly correlated sample pairs and the key distinctions among remote sensing texts, leading to biased and superficial exploration of sample pairs. To address these challenges, we propose a novel Eliminate Before Align strategy with Keyword Explicit Reasoning framework (EBAKER) for RSITR. Specifically, we devise an innovative Eliminate Before Align (EBA) strategy to filter out the weakly correlated sample pairs to mitigate their deviations from optimal embedding space during alignment. Moreover, we introduce a Keyword Explicit Reasoning (KER) module to facilitate the positive role of subtle key concept differences. Without bells and whistles, our method achieves a one-step transformation from FM to RSITR task, obviating the necessity for extra pretraining on remote sensing data. Extensive experiments on three popular benchmark datasets validate that our proposed EBAKER method outperform the state-of-the-art methods with fewer training data. Our source code will be released soon.

References

[1]
Taghreed Abdullah, Yakoub Bazi, Mohamad M Al Rahhal, Mohamed L Mekhalfi, Lalitha Rangarajan, and Mansour Zuair. 2020. TextRS: Deep bidirectional triplet network for matching text to remote sensing images. Remote Sensing 12, 3 (2020), 405.
[2]
Mohamad M Al Rahhal, Yakoub Bazi, Norah A Alsharif, Laila Bashmal, Naif Alajlan, and Farid Melgani. 2022. Multilanguage transformer for improved text to remote sensing image retrieval. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 15 (2022), 9115--9126.
[3]
Wei-Cheng Chang, Daniel Jiang, Hsiang-Fu Yu, Choon Hui Teo, Jiong Zhang, Kai Zhong, Kedarnath Kolluri, Qie Hu, Nikhil Shandilya, Vyacheslav Ievgrafov, et al. 2021. Extreme multi-label learning for semantic matching in product search. In Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining. 2643--2651.
[4]
Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul Voigtlaender, Basil Mustafa, Sebastian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, et al. 2023. Pali-3 vision language models: Smaller, faster, stronger. arXiv:2310.09199 (2023).
[5]
Qimin Cheng, Haiyan Huang, Yuan Xu, Yuzhuo Zhou, Huanying Li, and Zhongyuan Wang. 2022. NWPU-captions dataset and MLCA-net for remote sensing image captioning. IEEE Transactions on Geoscience and Remote Sensing 60 (2022), 1--19.
[6]
Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. 2023. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2818--2829.
[7]
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. 2024. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems (NeurIPS) 36 (2024), 49250--49267.
[8]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2018).
[9]
Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE: Improving Visual-Semantic Embeddings with Hard Negatives. In Proceedings of British Machine Vision Conference (BMVC).
[10]
Zhong Ji, Zhihao Li, Yan Zhang, Haoran Wang, Yanwei Pang, and Xuelong Li. 2024. Hierarchical matching and reasoning for multi-query image retrieval. Neural Networks (2024), 106200.
[11]
Zhong Ji, Changxu Meng, Yan Zhang, Yanwei Pang, and Xuelong Li. 2023. Knowledge-Aided Momentum Contrastive Learning for Remote-Sensing Image Text Retrieval. IEEE Transactions on Geoscience and Remote Sensing 61 (2023), 1--13.
[12]
Ding Jiang and Mang Ye. 2023. Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2787--2797.
[13]
Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR).
[14]
Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad Shahbaz Khan. 2024. Geochat: Grounded large visionlanguage model for remote sensing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 27831--27840.
[15]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning (ICML). PMLR, 19730-- 19742.
[16]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning (ICML). PMLR, 12888-- 12900.
[17]
Jun Li, Yanqiu Pei, Shaohua Zhao, Rulin Xiao, Xiao Sang, and Chengye Zhang. 2020. A review of remote sensing for environmental monitoring in China. Remote Sensing 12, 7 (2020), 1130.
[18]
Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. 2021. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems (NeurIPS) 34 (2021), 9694--9705.
[19]
Yansheng Li, Jiayi Ma, and Yongjun Zhang. 2021. Image retrieval from remote sensing big data: A survey. Information Fusion 67 (2021), 94--115.
[20]
Yu Liao, Rui Yang, Tao Xie, Hantong Xing, Dou Quan, Shuang Wang, and Biao Hou. 2023. A Fast and Accurate Method for Remote Sensing Image-Text Retrieval Based On Large Model Knowledge Distillation. In IEEE International Geoscience and Remote Sensing Symposium (IGARSS). IEEE, 5077--5080.
[21]
Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, and Jun Zhou. 2024. Remoteclip: A vision language foundation model for remote sensing. IEEE Transactions on Geoscience and Remote Sensing 62 (2024), 1--16.
[22]
Xiang Long, Kaipeng Deng, Guanzhong Wang, Yang Zhang, Qingqing Dang, Yuan Gao, Hui Shen, Jianguo Ren, Shumin Han, Errui Ding, et al. 2020. PP-YOLO: An effective and efficient implementation of object detector. arXiv preprint arXiv:2007.12099 (2020).
[23]
Xiaoqiang Lu, Binqiang Wang, Xiangtao Zheng, and Xuelong Li. 2017. Exploring models and data for remote sensing image caption generation. IEEE Transactions on Geoscience and Remote Sensing 56, 4 (2017), 2183--2195.
[24]
Yafei Lv,Wei Xiong, Xiaohan Zhang, and Yaqi Cui. 2021. Fusion-based correlation learning model for cross-modal remote sensing image retrieval. IEEE Geoscience and Remote Sensing Letters 19 (2021), 1--5.
[25]
Qing Ma, Jiancheng Pan, and Cong Bai. 2024. Direction-oriented visual-semantic embedding model for remote sensing image-text retrieval. IEEE Transactions on Geoscience and Remote Sensing 62 (2024), 1--14.
[26]
Li Mi, Siran Li, Christel Chappuis, and Devis Tuia. 2022. Knowledge-Aware Cross-Modal Text-Image Retrieval for Remote Sensing Images. In Proceedings of the Second Workshop on Complex Data Challenges in Earth Observation (CDCEO 2022).
[27]
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv:1807.03748 (2018).
[28]
Jiancheng Pan, Qing Ma, and Cong Bai. 2023. A Prior Instruction Representation Framework for Remote Sensing Image-text Retrieval. In Proceedings of the 31st ACM International Conference on Multimedia (ACMMM). 611--620.
[29]
Jiancheng Pan, Qing Ma, and Cong Bai. 2023. Reducing Semantic Confusion: Scene-aware Aggregation Network for Remote Sensing Cross-modal Retrieval. In Proceedings of the 2023 ACM International Conference on Multimedia Retrieval (ICMR). 398--406.
[30]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning (ICML). PMLR, 8748--8763.
[31]
Xu Tang, Yijing Wang, Jingjing Ma, Xiangrong Zhang, Fang Liu, and Licheng Jiao. 2023. Interacting-Enhancing Feature Transformer for Cross-modal Remote Sensing Image and Text Retrieval. IEEE Transactions on Geoscience and Remote Sensing 61 (2023), 1--15.
[32]
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288 (2023).
[33]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems (NeurIPS) 30 (2017), 5998--6008.
[34]
Marie Weiss, Frédéric Jacob, and Grgory Duveiller. 2020. Remote sensing for agricultural applications: A meta-review. Remote sensing of environment 236 (2020), 111402.
[35]
Rui Yang, Di Zhang, YanHe Guo, and ShuangWang. 2023. A Texture and Saliency Enhanced Image Learning Method For Cross-Modal Remote Sensing Image- Text Retrieval. In IEEE International Geoscience and Remote Sensing Symposium (IGARSS). IEEE, 4895--4898.
[36]
Shuyu Yang, Yinan Zhou, Zhedong Zheng, YaxiongWang, Li Zhu, and YujiaoWu. 2023. Towards unified text-based person retrieval: A large-scale multi-attribute and language search benchmark. In Proceedings of the 31st ACM International Conference on Multimedia (ACMMM). 4492--4501.
[37]
QiFeng Yu, Yang Shang, XiaoChun Liu, ZhiHui Lei, Xiang Li, XianWei Zhu, XiaoLin Liu, Xia Yang, Ang Su, XiaoHu Zhang, et al. 2014. Full-parameter vision navigation based on scene matching for aircrafts. Science China Information Sciences 57 (2014), 1--10.
[38]
Yuan Yuan, Yang Zhan, and Zhitong Xiong. 2023. Parameter-Efficient Transfer Learning for Remote Sensing Image-Text Retrieval. IEEE Transactions on Geoscience and Remote Sensing 61 (2023), 1--14.
[39]
Zhiqiang Yuan,Wenkai Zhang, Kun Fu, Xuan Li, Chubo Deng, HongqiWang, and Xian Sun. 2021. Exploring a Fine-Grained Multiscale Method for Cross-Modal Remote Sensing Image Retrieval. IEEE Transactions on Geoscience and Remote Sensing 60 (2021), 1--19.
[40]
Zhiqiang Yuan,Wenkai Zhang, Xuee Rong, Xuan Li, Jialiang Chen, HongqiWang, Kun Fu, and Xian Sun. 2021. A lightweight multi-scale crossmodal text-image retrieval method in remote sensing. IEEE Transactions on Geoscience and Remote Sensing 60 (2021), 1--19.
[41]
Zhiqiang Yuan, Wenkai Zhang, Changyuan Tian, Yongqiang Mao, Ruixue Zhou, Hongqi Wang, Kun Fu, and Xian Sun. 2022. MCRN: A Multi-source Cross-modal Retrieval Network for remote sensing. International Journal of Applied Earth Observation and Geoinformation 115 (2022), 103071.
[42]
Zhiqiang Yuan, Wenkai Zhang, Changyuan Tian, Xuee Rong, Zhengyuan Zhang, Hongqi Wang, Kun Fu, and Xian Sun. 2022. Remote sensing cross-modal textimage retrieval based on global and local information. IEEE Transactions on Geoscience and Remote Sensing 60 (2022), 1--16.
[43]
Weihang Zhang, Jihao Li, Shuoke Li, Jialiang Chen, Wenkai Zhang, Xin Gao, and Xian Sun. 2023. Hypersphere-based remote sensing cross-modal text-image retrieval via curriculum learning. IEEE Transactions on Geoscience and Remote Sensing 61 (2023), 1--15.
[44]
Yan Zhang, Zhong Ji, Yanwei Pang, and Xuelong Li. 2023. Consensus knowledge exploitation for partial query based image retrieval. IEEE Transactions on Circuits and Systems for Video Technology 33 (2023), 7900--7913.
[45]
Yan Zhang, Zhong Ji, DiWang, Yanwei Pang, and Xuelong Li. 2024. USER: Unified semantic enhancement with momentum contrast for image-text retrieval. IEEE Transactions on Image Processing 33 (2024), 595--609.
[46]
Zilun Zhang, Tiancheng Zhao, Yulong Guo, and Jianwei Yin. 2023. Rs5m: A large scale vision-language dataset for remote sensing vision-language foundation model. arXiv:2306.11300 (2023).
[47]
Fuzhong Zheng, Xu Wang, Luyao Wang, Xiong Zhang, Hongze Zhu, Long Wang, and Haisu Zhang. 2023. A Fine-Grained Semantic Alignment Method Specific to Aggregate Multi-Scale Information for Cross-Modal Remote Sensing Image Retrieval. Sensors 23, 20 (2023), 8437.

Cited By

View all
  • (2025)Hierarchical and complementary experts transformer with momentum invariance for image-text retrievalKnowledge-Based Systems10.1016/j.knosys.2024.112912309(112912)Online publication date: Jan-2025
  • (2025)Multi-task classification network for few-shot learningInternational Journal of Multimedia Information Retrieval10.1007/s13735-025-00354-y14:1Online publication date: 17-Feb-2025

Index Terms

  1. Eliminate Before Align: A Remote Sensing Image-Text Retrieval Framework with Keyword Explicit Reasoning

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
    October 2024
    11719 pages
    ISBN:9798400706868
    DOI:10.1145/3664647
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 October 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. foundation model
    2. image-text retrieval
    3. keyword explicit reasoning
    4. remote sensing

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MM '24
    Sponsor:
    MM '24: The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne VIC, Australia

    Acceptance Rates

    MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)177
    • Downloads (Last 6 weeks)83
    Reflects downloads up to 01 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Hierarchical and complementary experts transformer with momentum invariance for image-text retrievalKnowledge-Based Systems10.1016/j.knosys.2024.112912309(112912)Online publication date: Jan-2025
    • (2025)Multi-task classification network for few-shot learningInternational Journal of Multimedia Information Retrieval10.1007/s13735-025-00354-y14:1Online publication date: 17-Feb-2025

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media