research-article

Visual Grounding in Remote Sensing Images

Authors:

Xu HuangAuthors Info & Claims

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Pages 404 - 412

https://doi.org/10.1145/3503161.3548316

Published: 10 October 2022 Publication History

Get Access

Abstract

Ground object retrieval from a large-scale remote sensing image is very important for lots of applications. We present a novel problem of visual grounding in remote sensing images. Visual grounding aims to locate the particular objects (in the form of the bounding box or segmentation mask) in an image by a natural language expression. The task already exists in the computer vision community. However, existing benchmark datasets and methods mainly focus on natural images rather than remote sensing images. Compared with natural images, remote sensing images contain large-scale scenes and the geographical spatial information of ground objects (e.g., longitude, latitude). The existing method cannot deal with these challenges. In this paper, we collect a new visual grounding dataset, called RSVG, and design a new method, namely GeoVG. In particular, the proposed method consists of a language encoder, image encoder, and fusion module. The language encoder is used to learn numerical geospatial relations and represent a complex expression as a geospatial relation graph. The image encoder is applied to learn large-scale remote sensing scenes with adaptive region attention. The fusion module is used to fuse the text and image feature for visual grounding. We evaluate the proposed method by comparing it to the state-of-the-art methods on RSVG. Experiments show that our method outperforms the previous methods on the proposed datasets. https://sunyuxi.github.io/publication/GeoVG

Supplementary Material

MP4 File (MM22-fp2485.mp4)

We propose a novel problem of visual grounding in remote sensing (RS) images. Existing benchmark datasets mainly focus on natural images rather than RS images. Compared with natural images, RS images contain large-scale scenes and the geographical spatial information of ground objects (e.g., latitude). Existing methods cannot deal with these challenges. In this paper, we collect a new visual grounding dataset, called RSVG, and design a new method, namely GeoVG. In particular, our method contains a language encoder, image encoder, and fusion module. The language encoder is used to learn numerical geospatial relations and represent a complex expression as a geospatial relation graph. The image encoder is applied to learn large-scale RS scenes with adaptive region attention. The fusion module is used to fuse the text and image feature. We evaluate our method by comparing it to the state-of-the-art methods on RSVG. Experiments show that our method outperforms the previous methods on the proposed datasets.

Download
83.72 MB

References

[1]

Long Chen, Wenbo Ma, Jun Xiao, Hanwang Zhang, and Shih-Fu Chang. 2021. Ref-nms: Breaking proposal bottlenecks in two-stage referring expression grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 1036--1044.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Auto‐encoder‐based shared mid‐level visual dictionary learning for scene classification using very high resolution remote sensing images

Dehazing of remote sensing images using fourth‐order partial differential equations based trilateral filter

Remote Sensing Image Compression: A Review

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations