skip to main content
10.1145/3503161.3548316acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Visual Grounding in Remote Sensing Images

Published: 10 October 2022 Publication History

Abstract

Ground object retrieval from a large-scale remote sensing image is very important for lots of applications. We present a novel problem of visual grounding in remote sensing images. Visual grounding aims to locate the particular objects (in the form of the bounding box or segmentation mask) in an image by a natural language expression. The task already exists in the computer vision community. However, existing benchmark datasets and methods mainly focus on natural images rather than remote sensing images. Compared with natural images, remote sensing images contain large-scale scenes and the geographical spatial information of ground objects (e.g., longitude, latitude). The existing method cannot deal with these challenges. In this paper, we collect a new visual grounding dataset, called RSVG, and design a new method, namely GeoVG. In particular, the proposed method consists of a language encoder, image encoder, and fusion module. The language encoder is used to learn numerical geospatial relations and represent a complex expression as a geospatial relation graph. The image encoder is applied to learn large-scale remote sensing scenes with adaptive region attention. The fusion module is used to fuse the text and image feature for visual grounding. We evaluate the proposed method by comparing it to the state-of-the-art methods on RSVG. Experiments show that our method outperforms the previous methods on the proposed datasets. https://sunyuxi.github.io/publication/GeoVG

Supplementary Material

MP4 File (MM22-fp2485.mp4)
We propose a novel problem of visual grounding in remote sensing (RS) images. Existing benchmark datasets mainly focus on natural images rather than RS images. Compared with natural images, RS images contain large-scale scenes and the geographical spatial information of ground objects (e.g., latitude). Existing methods cannot deal with these challenges. In this paper, we collect a new visual grounding dataset, called RSVG, and design a new method, namely GeoVG. In particular, our method contains a language encoder, image encoder, and fusion module. The language encoder is used to learn numerical geospatial relations and represent a complex expression as a geospatial relation graph. The image encoder is applied to learn large-scale RS scenes with adaptive region attention. The fusion module is used to fuse the text and image feature. We evaluate our method by comparing it to the state-of-the-art methods on RSVG. Experiments show that our method outperforms the previous methods on the proposed datasets.

References

[1]
Long Chen, Wenbo Ma, Jun Xiao, Hanwang Zhang, and Shih-Fu Chang. 2021. Ref-nms: Breaking proposal bottlenecks in two-stage referring expression grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 1036--1044.
[2]
Zhenfang Chen, Peng Wang, Lin Ma, Kwan-Yee K Wong, and Qi Wu. 2020. Cops-ref: A new dataset and task on compositional referring expression comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10086--10095.
[3]
Marco Chini, Nazzareno Pierdicca, and William J Emery. 2008. Exploiting SAR and VHR optical images to quantify damage caused by the 2003 Bam earthquake. IEEE Transactions on Geoscience and Remote Sensing 47, 1 (2008), 145--152.
[4]
Harm De Vries, Florian Strub, Sarath Chandar, Olivier Pietquin, Hugo Larochelle, and Aaron Courville. 2017. Guesswhat?! visual object discovery through multi-modal dialogue. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5503--5512.
[5]
Chaorui Deng, Qi Wu, Qingyao Wu, Fuyuan Hu, Fan Lyu, and Mingkui Tan. 2018. Visual grounding via accumulated attention. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7746--7755.
[6]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[7]
Jian Ding, Nan Xue, Yang Long, Gui-Song Xia, and Qikai Lu. 2019. Learning RoI transformer for oriented object detection in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2849--2858.
[8]
Ronghang Hu, Marcus Rohrbach, Jacob Andreas, Trevor Darrell, and Kate Saenko. 2017. Modeling relationships in referential expressions with compositional modular networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1115--1124.
[9]
Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi Feng, Kate Saenko, and Trevor Darrell. 2016. Natural language object retrieval. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4555--4564.
[10]
Binbin Huang, Dongze Lian, Weixin Luo, and Shenghua Gao. 2021. Look Before You Leap: Learning Landmark Features for One-Stage Visual Grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16888--16897.
[11]
Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. 2014. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 787--798.
[12]
Anuj Kumar, Hiesik Kim, and Gerhard P Hancke. 2012. Environmental monitoring systems: A review. IEEE Sensors Journal 13, 4 (2012), 1329--1339.
[13]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. 740--755.
[14]
Daqing Liu, Hanwang Zhang, Feng Wu, and Zheng-Jun Zha. 2019. Learning to assemble neural module tree networks for visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4673--4682.
[15]
Haolin Liu, Anran Lin, Xiaoguang Han, Lei Yang, Yizhou Yu, and Shuguang Cui. 2021. Refer-it-in-rgbd: A bottom-up approach for 3d visual grounding in rgbd images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6032--6041.
[16]
Runtao Liu, Chenxi Liu, Yutong Bai, and Alan L Yuille. 2019. Clevr-ref: Diagnosing visual reasoning with referring expressions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4185--4194.
[17]
Sylvain Lobry, Diego Marcos, Jesse Murray, and Devis Tuia. 2020. RSVQA: Visual Question Answering for Remote Sensing Data. IEEE Transactions on Geoscience and Remote Sensing 58, 12 (2020), 8555--8566.
[18]
Gengchen Mai, Krzysztof Janowicz, Rui Zhu, Ling Cai, and Ni Lao. 2021. Geographic question answering: Challenges, uniqueness, classification, and future directions. AGILE: GIScience Series 2 (2021), 1--21.
[19]
Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. 2016. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 11--20.
[20]
Zongshen Mu, Siliang Tang, Jie Tan, Qiang Yu, and Yueting Zhuang. 2021. Disentangled motif-aware graph learning for phrase grounding. In Proc 35th AAAI Conf on Artificial Intelligence.
[21]
Varun K Nagaraja, Vlad I Morariu, and Larry S Davis. 2016. Modeling context between objects for referring expression understanding. In European Conference on Computer Vision. 792--807.
[22]
Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. 2018. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
[23]
Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D. Manning. 2020. Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations.
[24]
Heqian Qiu, Hongliang Li, Qingbo Wu, Fanman Meng, Hengcan Shi, Taijin Zhao, and King Ngi Ngan. 2020. Language-aware fine-grained object representation for referring expression comprehension. In Proceedings of the 28th ACM International Conference on Multimedia. 4171--4180.
[25]
Joseph Redmon and Ali Farhadi. 2018. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018).
[26]
Robyn Speer, Joshua Chin, and Catherine Havasi. 2017. Conceptnet 5.5: An open multilingual graph of general knowledge. In Thirty-first AAAI conference on artificial intelligence.
[27]
Georgios P Spithourakis and Sebastian Riedel. 2018. Numeracy for language models: Evaluating and improving their ability to predict numbers. arXiv preprint arXiv:1805.08154 (2018).
[28]
Yuxi Sun, Shanshan Feng, Yunming Ye, Xutao Li, Jian Kang, Zhichao Huang, and Chuyao Luo. 2022. Multisensor fusion and explicit semantic preserving-based deep hashing for cross-modal remote sensing image retrieval. IEEE Transactions on Geoscience and Remote Sensing 60 (2022), 1--14.
[29]
Peng Wang, Dongyang Liu, Hui Li, and Qi Wu. 2020. Give me something to eat: referring expression comprehension with commonsense knowledge. In Proceedings of the 28th ACM International Conference on Multimedia. 28--36.
[30]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. 2019. Huggingface's transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019).
[31]
Gui-Song Xia, Xiang Bai, Jian Ding, Zhen Zhu, Serge Belongie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, and Liangpei Zhang. 2018. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3974--3983.
[32]
Sibei Yang, Guanbin Li, and Yizhou Yu. 2019. Dynamic graph attention for referring expression comprehension. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4644--4653.
[33]
Sibei Yang, Guanbin Li, and Yizhou Yu. 2020. Graph-structured referring expression reasoning in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9952--9961.
[34]
Sibei Yang, Guanbin Li, and Yizhou Yu. 2020. Relationship-embedded representation learning for grounding referring expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 8 (2020), 2765--2779.
[35]
Zhengyuan Yang, Tianlang Chen, Liwei Wang, and Jiebo Luo. 2020. Improving one-stage visual grounding by recursive sub-query construction. In European Conference on Computer Vision. 387--404.
[36]
Zhengyuan Yang, Boqing Gong, Liwei Wang, Wenbing Huang, Dong Yu, and Jiebo Luo. 2019. A fast and accurate one-stage approach to visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4683--4693.
[37]
Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. 2018. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1307--1315.
[38]
Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. 2016. Modeling context in referring expressions. In European Conference on Computer Vision. 69--85.
[39]
Hanwang Zhang, Yulei Niu, and Shih-Fu Chang. 2018. Grounding referring expressions in images by variational context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4158--4166.
[40]
Bohan Zhuang, Qi Wu, Chunhua Shen, Ian Reid, and Anton Van Den Hengel. 2018. Parallel attention: A unified framework for visual object discovery through dialogs and queries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4252--4261.

Cited By

View all
  • (2025)Efficient Grounding DINO: Efficient Cross-Modality Fusion and Efficient Label Assignment for Visual Grounding in Remote SensingIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2025.353601563(1-14)Online publication date: 2025
  • (2025)View-Based Knowledge-Augmented Multimodal Semantic Understanding for Optical Remote Sensing ImagesIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2025.353234963(1-33)Online publication date: 2025
  • (2025)Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image SegmentationIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.352229363(1-11)Online publication date: 2025
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '22: Proceedings of the 30th ACM International Conference on Multimedia
October 2022
7537 pages
ISBN:9781450392037
DOI:10.1145/3503161
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. dataset
  2. object retrieval
  3. referring expression
  4. remote sensing
  5. visual grounding

Qualifiers

  • Research-article

Funding Sources

Conference

MM '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)372
  • Downloads (Last 6 weeks)39
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Efficient Grounding DINO: Efficient Cross-Modality Fusion and Efficient Label Assignment for Visual Grounding in Remote SensingIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2025.353601563(1-14)Online publication date: 2025
  • (2025)View-Based Knowledge-Augmented Multimodal Semantic Understanding for Optical Remote Sensing ImagesIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2025.353234963(1-33)Online publication date: 2025
  • (2025)Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image SegmentationIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.352229363(1-11)Online publication date: 2025
  • (2025)Visual grounding of remote sensing images with multi-dimensional semantic-guidancePattern Recognition Letters10.1016/j.patrec.2025.01.013Online publication date: Jan-2025
  • (2025)SkyEyeGPT: Unifying remote sensing vision-language tasks via instruction tuning with large language modelISPRS Journal of Photogrammetry and Remote Sensing10.1016/j.isprsjprs.2025.01.020221(64-77)Online publication date: Mar-2025
  • (2024)MGIMM: Multi-Granularity Instruction Multimodal Model for Attribute-Guided Remote Sensing Image Detailed DescriptionIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.349797662(1-13)Online publication date: 2024
  • (2024)A Regionally Indicated Visual Grounding Network for Remote Sensing ImagesIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.349084762(1-11)Online publication date: 2024
  • (2024)A Spatial-Frequency Fusion Strategy Based on Linguistic Query Refinement for RSVGIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.347108262(1-13)Online publication date: 2024
  • (2024)Injecting Linguistic Into Visual Backbone: Query-Aware Multimodal Fusion Network for Remote Sensing Visual GroundingIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.345030362(1-14)Online publication date: 2024
  • (2024)Language Query-Based Transformer With Multiscale Cross-Modal Alignment for Visual Grounding on Remote Sensing ImagesIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.340759862(1-13)Online publication date: 2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media