Elsevier

Knowledge-Based Systems

Volume 203, 5 September 2020, 105920
Knowledge-Based Systems

Remote sensing image captioning via Variational Autoencoder and Reinforcement Learning

https://doi.org/10.1016/j.knosys.2020.105920Get rights and content

Highlights

  • Introducing VAE to regularize the shared encoder and extract image features more effectively by reconstructing input images.

  • Improving the performance of image caption significantly by virtue of low-level and high-level image features simultaneously.

  • Enhancing the final text description quality by adding self-attention to spatial features.

  • Our proposed model outperforms the state-of-the-art models in the remote sensing image captioning.

Abstract

Image captioning, i.e., generating the natural semantic descriptions of given image, is an essential task for machines to understand the content of the image. Remote sensing image captioning is a part of the field. Most of the current remote sensing image captioning models suffered the overfitting problem and failed to utilize the semantic information in images. To this end, we propose a Variational Autoencoder and Reinforcement Learning based Two-stage Multi-task Learning Model (VRTMM) for the remote sensing image captioning task. In the first stage, we finetune the CNN jointly with the Variational Autoencoder. In the second stage, the Transformer generates the text description using both spatial and semantic features. Reinforcement Learning is then applied to enhance the quality of the generated sentences. Our model surpasses the previous state of the art records by a large margin on all seven scores on Remote Sensing Image Caption Dataset. The experiment result indicates our model is effective on remote sensing image captioning and achieves the new state-of-the-art result.

Introduction

Recently, there have been extensive study and analysis on remote sensing images of high resolution, and deep neural networks achieved satisfactory results in scene classification and object detection. Despite the successful application of deep neural networks in the task aforementioned, it should be pointed out that the existing research usually attaches more importance to the image feature of the remote sensing images. Limited work has been done in capturing the semantic meaning and correlations of different objects in remote sensing images, which is also a key issue for the machine to understand the images better.

We focus on the remote sensing image captioning task in this paper, allowing for generating the semantic descriptions by teaching a machine to comprehend the content of the image. In the past years, a little effort has been devoted to the text descriptions of remote sensing images. Liu et al. [1] applied the semantic mining method in the remote sensing image retrieval model. Zhu et al. [2] proposed SAL-LDA (Semantic Allocation Level-Latent Dirichlet allocation), which is a new strategy based on the semantic distribution. Yang et al. [3] modeled underlying relations between features and the context in the given image with the Conditional Random Field (CRF) theory. Wang and Zhou [4] explored a strategy using semantic information to retrieve remote sensing images in the dataset. Chen et al. [5] proposed to use the graph model theory to extract object semantic relations. Li [6] present an object detection-based semantic model by making comparisons between different themes in different categories on the semantic level. There are some limitations to these approaches to fully utilize the image contents and generate the natural fluent text descriptions. Deep neural networks with the encoder–decoder framework have been proven successful in solving natural image captioning tasks. The theory of Reinforcement Learning [7] is also gradually being applied to the image captioning.

Inspired from work on natural image captioning, some research works have been published for remote sensing image captioning. Qu et al. [8] employed an RNN as the decoder of a multi-modal model to describe the content of remote sensing images. Shi and Zou [9] proposed a remote sensing image captioning model, which first leverages a convolutional neural network (CNN). Lu et al. [10] exposed a dataset, Remote Sensing Image Captioning Dataset (RSICD), and performed several experiments on it with different methods to validate their performance, including multi-modal models and attention-based models. Wang et al. [11] measured the representation of images and captions by embedding them to the same semantic space. Zhang et al. [12] introduced the attribute attention mechanism in their model, which can better capture the correspondence between the semantic information and the specific object in the remote sensing image.

However, there are still some limitations on these approaches:

  • 1.

    Based on the transfer learning theory, the CNN adopted by the models above are pre-trained on the ImageNet dataset to enhance the image feature extraction ability. However, compared with natural images in ImageNet dataset, most remote sensing images lack some salient objects that can attract our attention. Due to the unique “view of God” of remote sensing images, many items are equally important and need to be taken into consideration simultaneously. It may not perform well to directly apply the CNN pre-trained on ImageNet dataset as the encoder of remote sensing image captioning due to the gap between the remote sensing images and natural images. On the other hand, ImageNet dataset is designed for the image classification task. Compared between image classification and image captioning, it is more important for image captioning models to be able to encode complete image information as well as the correlations between the objects in the image.

  • 2.

    The RNN precludes parallelization within training examples due to its inherently sequential nature [13], making it difficult to train. The Transformer [13], constructed completely using the attention mechanism to model the sequence dependency, thus removing recurrence, has been proven superior to RNN in both feature extraction ability and training efficiency. Zhu et al. [14] utilized the Transformer as the decoder of the natural image captioning model, but few works have been investigated on remote sensing image captioning.

  • 3.

    The Reinforcement Learning (RL) has achieved great success in natural image captioning by solving the gap between training loss and evaluation metrics. However, how to further enhance the performance of remote sensing image captioning via RL is still under-explored.

The main purpose of this paper is to overcome the above mentioned limitations, and our main contributions and motivations are listed as follows:

  • 1.

    Introducing VAE to regularize the shared encoder and extract image features more effectively by reconstructing input images. A VAE [15] can be regarded as an autoencoder whose training is regularized to avoid overfitting and ensure that the latent space has good properties to generate some new data. Adding a VAE branch can relieve the overfitting problem caused by the lack of remote sensing images. Furthermore, the reconstruction process in VAE can help CNN pre-trained on ImageNet encodes better representation for the given remote sensing image.

  • 2.

    Improving the performance of image caption significantly by virtue of low-level and high-level image features simultaneously. Zeiler and Fergus [16] visualized the different layers of CNN and found that high-level features contain more semantic information, while low-level features focus more on details. It will be more effective to take advantage of both high and low features so that they can complement each other.

  • 3.

    Enhancing the final text description quality by adding self-attention to spatial features. Vaswani et al. [13] introduced the self-attention mechanism and calculated it with vectors named Query, Key, and Value. Query and Key are used to construct the relationships. Value summarizes all relations within and concludes the output containing relations between input and all other words. Since the high level features focus more on semantic information, different spatial features are semantic representations for different areas in the image. Self-attention mechanism can be utilized to achieve better regional semantic representation by extracting more information from more related fields in the image.

Our paper is organized as follows: In Section 2, we introduce the related works on natural image captioning and remote sensing image captioning. In Section 3, we explain the methods we proposed for remote sensing image captioning. In Section 4, we report our experimental settings and analyze the experiment results. In Section 5, we make the final conclusion of our paper.

Section snippets

Related work

There have been extensive studies and analyses on remote sensing images of high resolution [17], [18]. The task of remote sensing image usually stems from the natural image, e.g., image captioning task. There are three different categories of methods for natural image captioning task: retrieval-based methods, template-based methods, and encoder–decoder based methods. The retrieval-based methods [19], [20], [21] firstly search in the dataset the image most similar to the input image and obtain

VRTMM

In Section 3.1, we mainly introduce the framework of our model. In Section 3.2, the details of the encoder in our model is presented. In Section 3.3, we first briefly describe the overall architecture of the Transformer and then introduce the modifications we make to adapt the Transformer to the task at hand. In Section 3.4, we introduce the training details during the finetuning procedure.

Dataset

The finetuning process for the encoder is performed on NWPU-RESISC45 dataset [46]. NWPU-RESISC45 dataset is a public available dataset on the REmote Sensing Image Scene Classification (RESISC) task. It contains 31,500 images and 45 scene classes. For each class, there are 700 images in it. We conduct the image captioning experiment on RSICD dataset [10], the largest remote sensing image captioning dataset so far. RSICD dataset includes 10,921 remote sensing images with 224 × 224 sizes in

Conclusions

In this paper, we propose a new model for remote sensing image captioning based on the variational autoencoder and the encoder–decoder architecture. We first finetune the CNN with the variational autoencoder branch on the remote sensing image scene classification dataset. The finetuned CNN is then employed to extract both semantic and spatial features of the images. After the self-attention operation on spatial features, both semantic and spatial features are passed to the modified Transformer

CRediT authorship contribution statement

Xiangqing Shen: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Resources, Data curation, Writing - original draft, Writing - review & editing, Visualization, Supervision, Project administration, Funding acquisition. Bing Liu: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Resources, Data curation, Writing - original draft, Writing - review & editing, Visualization, Supervision, Project administration, Funding

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (61801198), Natural Science Foundation of Jiangsu Province, China (BK20180174), Fundamental Research Funds for the Central Universities, China (2017XKQY082), National Natural Science Foundation of China (61806206), Natural Science Foundation of Jiangsu Province, China (BK20180639). The authors would like to thank the anonymous reviewers and the associate editor for their valuable comments.

References (50)

  • ZhuX. et al.

    Captioning transformer with stacked attention modules

    Appl. Sci.

    (2018)
  • BasaeedE. et al.

    Supervised remote sensing image segmentation using boosted convolutional neural networks

    Knowl.-Based Syst.

    (2016)
  • MylonasS.K. et al.

    GeneSIS: A GA-based fuzzy segmentation algorithm for remote sensing images

    Knowl.-Based Syst.

    (2013)
  • LiuT. et al.

    A remote sensing image retrieval model based on semantic mining

    Geomatics Inf. Sci. Wuhan Univ.

    (2009)
  • ZhuQ.Q. et al.

    Multi-feature probability topic scene classifier for high spatial resolution remote sensing imagery

  • J. Yang, Z. Jiang, Q. Zhou, H. Zhang, J. Shi, Remote sensing image semantic labeling based on conditional random field,...
  • WangJ. et al.

    Research on key technologies of remote sensing image data retrieval based on semantics

    Comput. Digit. Eng.

    (2012)
  • K. Chen, Z. Zhou, J. Guo, D. Zhang, X. Sun, Semantic scene understanding oriented high resolution remote sensing image...
  • LiY.

    Target Detection Method of High Resolution Remote Sensing Image Based on Semantic Model

    (2012)
  • RennieS.J. et al.

    Self-critical sequence training for image captioning

  • QuB. et al.

    Deep semantic understanding of high resolution remote sensing image

  • ShiZ.W. et al.

    Can a machine generate humanlike language descriptions for a remote sensing image?

    IEEE Trans. Geosci. Remote Sens.

    (2017)
  • LuX.X. et al.

    Exploring models and data for remote sensing image caption generation

    IEEE Trans. Geosci. Remote Sens.

    (2018)
  • WangB. et al.

    Semantic descriptions of high-resolution remote sensing images

    IEEE Geosci. Remote Sens. Lett.

    (2019)
  • ZhangX.R. et al.

    Description generation for remote sensing images using attribute attention mechanism

    Remote Sens.

    (2019)
  • VaswaniA. et al.

    Attention is all you need

  • KingmaD.P. et al.

    Auto-encoding variational Bayes

  • ZeilerM. et al.

    Visualizing and understanding convolutional neural networks

  • GongY. et al.

    Improving image-sentence embeddings using large weakly annotated photo collections

  • SunC. et al.

    Automatic concept discovery from parallel text and visual corpora

  • HodoshM. et al.

    Framing image description as a ranking task: Data, models and evaluation metrics

    J. Artificial Intelligence Res.

    (2013)
  • FarhadiA. et al.

    Every picture tells a story: Generating sentences from images

  • LiS. et al.

    Composing simple image descriptions using web-scale n-grams

  • KulkarniG. et al.

    Baby talk: Understanding and generating simple image descriptions

  • MaoJ. et al.

    Deep captioning with multimodal recurrent neural networks (m-RNN)

  • Cited by (60)

    View all citing articles on Scopus
    View full text