Abstract:
With the increasing popularity of remote sensing technology applications, some emergency scenarios require rapid retrieval of remote sensing images, such as earthquake re...Show MoreMetadata
Abstract:
With the increasing popularity of remote sensing technology applications, some emergency scenarios require rapid retrieval of remote sensing images, such as earthquake rescue, etc. Due to the high efficiency of voice input, researchers have focused on cross-modal remote sensing image-voice retrieval methods. However, these methods have two major drawbacks: speech input lacks discrimination and the intra-modal semantic information is under used. To address these drawbacks, we propose a novel cross-modal feature fusion retrieval model. Our model provides a more optimized cross-modal common feature space than previous models and thus optimizes the retrieval performance. First, our model adds the extra textual keyword information to the audio feature for remote sensing image retrieval. Second, it introduces inter-modality adversarial learning and intra-modality semantic discrimination into the remote sensing image-voice retrieval task. We conducted experiments on two datasets modified from the UCM-Captions dataset and the Remote Sensing Image Caption Dataset. The experimental results show that our model outperforms state-of-the-art models in this task.
Date of Conference: 11-16 July 2021
Date Added to IEEE Xplore: 12 October 2021
ISBN Information: