Remote sensing image captioning via Variational Autoencoder and Reinforcement Learning

doi:10.1016/j.knosys.2020.105920

Knowledge-Based Systems

Volume 203, 5 September 2020, 105920

https://doi.org/10.1016/j.knosys.2020.105920 Get rights and content

Highlights

•
Introducing VAE to regularize the shared encoder and extract image features more effectively by reconstructing input images.
•
Improving the performance of image caption significantly by virtue of low-level and high-level image features simultaneously.
•
Enhancing the final text description quality by adding self-attention to spatial features.
•
Our proposed model outperforms the state-of-the-art models in the remote sensing image captioning.

Abstract

Image captioning, i.e., generating the natural semantic descriptions of given image, is an essential task for machines to understand the content of the image. Remote sensing image captioning is a part of the field. Most of the current remote sensing image captioning models suffered the overfitting problem and failed to utilize the semantic information in images. To this end, we propose a Variational Autoencoder and Reinforcement Learning based Two-stage Multi-task Learning Model (VRTMM) for the remote sensing image captioning task. In the first stage, we finetune the CNN jointly with the Variational Autoencoder. In the second stage, the Transformer generates the text description using both spatial and semantic features. Reinforcement Learning is then applied to enhance the quality of the generated sentences. Our model surpasses the previous state of the art records by a large margin on all seven scores on Remote Sensing Image Caption Dataset. The experiment result indicates our model is effective on remote sensing image captioning and achieves the new state-of-the-art result.

Introduction

Recently, there have been extensive study and analysis on remote sensing images of high resolution, and deep neural networks achieved satisfactory results in scene classification and object detection. Despite the successful application of deep neural networks in the task aforementioned, it should be pointed out that the existing research usually attaches more importance to the image feature of the remote sensing images. Limited work has been done in capturing the semantic meaning and correlations of different objects in remote sensing images, which is also a key issue for the machine to understand the images better.

We focus on the remote sensing image captioning task in this paper, allowing for generating the semantic descriptions by teaching a machine to comprehend the content of the image. In the past years, a little effort has been devoted to the text descriptions of remote sensing images. Liu et al. [1] applied the semantic mining method in the remote sensing image retrieval model. Zhu et al. [2] proposed SAL-LDA (Semantic Allocation Level-Latent Dirichlet allocation), which is a new strategy based on the semantic distribution. Yang et al. [3] modeled underlying relations between features and the context in the given image with the Conditional Random Field (CRF) theory. Wang and Zhou [4] explored a strategy using semantic information to retrieve remote sensing images in the dataset. Chen et al. [5] proposed to use the graph model theory to extract object semantic relations. Li [6] present an object detection-based semantic model by making comparisons between different themes in different categories on the semantic level. There are some limitations to these approaches to fully utilize the image contents and generate the natural fluent text descriptions. Deep neural networks with the encoder–decoder framework have been proven successful in solving natural image captioning tasks. The theory of Reinforcement Learning [7] is also gradually being applied to the image captioning.

Inspired from work on natural image captioning, some research works have been published for remote sensing image captioning. Qu et al. [8] employed an RNN as the decoder of a multi-modal model to describe the content of remote sensing images. Shi and Zou [9] proposed a remote sensing image captioning model, which first leverages a convolutional neural network (CNN). Lu et al. [10] exposed a dataset, Remote Sensing Image Captioning Dataset (RSICD), and performed several experiments on it with different methods to validate their performance, including multi-modal models and attention-based models. Wang et al. [11] measured the representation of images and captions by embedding them to the same semantic space. Zhang et al. [12] introduced the attribute attention mechanism in their model, which can better capture the correspondence between the semantic information and the specific object in the remote sensing image.

However, there are still some limitations on these approaches:

1.
Based on the transfer learning theory, the CNN adopted by the models above are pre-trained on the ImageNet dataset to enhance the image feature extraction ability. However, compared with natural images in ImageNet dataset, most remote sensing images lack some salient objects that can attract our attention. Due to the unique “view of God” of remote sensing images, many items are equally important and need to be taken into consideration simultaneously. It may not perform well to directly apply the CNN pre-trained on ImageNet dataset as the encoder of remote sensing image captioning due to the gap between the remote sensing images and natural images. On the other hand, ImageNet dataset is designed for the image classification task. Compared between image classification and image captioning, it is more important for image captioning models to be able to encode complete image information as well as the correlations between the objects in the image.
2.
The RNN precludes parallelization within training examples due to its inherently sequential nature [13], making it difficult to train. The Transformer [13], constructed completely using the attention mechanism to model the sequence dependency, thus removing recurrence, has been proven superior to RNN in both feature extraction ability and training efficiency. Zhu et al. [14] utilized the Transformer as the decoder of the natural image captioning model, but few works have been investigated on remote sensing image captioning.
3.
The Reinforcement Learning (RL) has achieved great success in natural image captioning by solving the gap between training loss and evaluation metrics. However, how to further enhance the performance of remote sensing image captioning via RL is still under-explored.

The main purpose of this paper is to overcome the above mentioned limitations, and our main contributions and motivations are listed as follows:

1.
Introducing VAE to regularize the shared encoder and extract image features more effectively by reconstructing input images. A VAE [15] can be regarded as an autoencoder whose training is regularized to avoid overfitting and ensure that the latent space has good properties to generate some new data. Adding a VAE branch can relieve the overfitting problem caused by the lack of remote sensing images. Furthermore, the reconstruction process in VAE can help CNN pre-trained on ImageNet encodes better representation for the given remote sensing image.
2.
Improving the performance of image caption significantly by virtue of low-level and high-level image features simultaneously. Zeiler and Fergus [16] visualized the different layers of CNN and found that high-level features contain more semantic information, while low-level features focus more on details. It will be more effective to take advantage of both high and low features so that they can complement each other.
3.
Enhancing the final text description quality by adding self-attention to spatial features. Vaswani et al. [13] introduced the self-attention mechanism and calculated it with vectors named Query, Key, and Value. Query and Key are used to construct the relationships. Value summarizes all relations within and concludes the output containing relations between input and all other words. Since the high level features focus more on semantic information, different spatial features are semantic representations for different areas in the image. Self-attention mechanism can be utilized to achieve better regional semantic representation by extracting more information from more related fields in the image.

Our paper is organized as follows: In Section 2, we introduce the related works on natural image captioning and remote sensing image captioning. In Section 3, we explain the methods we proposed for remote sensing image captioning. In Section 4, we report our experimental settings and analyze the experiment results. In Section 5, we make the final conclusion of our paper.

Section snippets

Related work

There have been extensive studies and analyses on remote sensing images of high resolution [17], [18]. The task of remote sensing image usually stems from the natural image, e.g., image captioning task. There are three different categories of methods for natural image captioning task: retrieval-based methods, template-based methods, and encoder–decoder based methods. The retrieval-based methods [19], [20], [21] firstly search in the dataset the image most similar to the input image and obtain

VRTMM

In Section 3.1, we mainly introduce the framework of our model. In Section 3.2, the details of the encoder in our model is presented. In Section 3.3, we first briefly describe the overall architecture of the Transformer and then introduce the modifications we make to adapt the Transformer to the task at hand. In Section 3.4, we introduce the training details during the finetuning procedure.

Dataset

The finetuning process for the encoder is performed on NWPU-RESISC45 dataset [46]. NWPU-RESISC45 dataset is a public available dataset on the REmote Sensing Image Scene Classification (RESISC) task. It contains 31,500 images and 45 scene classes. For each class, there are 700 images in it. We conduct the image captioning experiment on RSICD dataset [10], the largest remote sensing image captioning dataset so far. RSICD dataset includes 10,921 remote sensing images with 224 × 224 sizes in

Conclusions

In this paper, we propose a new model for remote sensing image captioning based on the variational autoencoder and the encoder–decoder architecture. We first finetune the CNN with the variational autoencoder branch on the remote sensing image scene classification dataset. The finetuned CNN is then employed to extract both semantic and spatial features of the images. After the self-attention operation on spatial features, both semantic and spatial features are passed to the modified Transformer

CRediT authorship contribution statement

Xiangqing Shen: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Resources, Data curation, Writing - original draft, Writing - review & editing, Visualization, Supervision, Project administration, Funding acquisition. Bing Liu: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Resources, Data curation, Writing - original draft, Writing - review & editing, Visualization, Supervision, Project administration, Funding

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (61801198), Natural Science Foundation of Jiangsu Province, China (BK20180174), Fundamental Research Funds for the Central Universities, China (2017XKQY082), National Natural Science Foundation of China (61806206), Natural Science Foundation of Jiangsu Province, China (BK20180639). The authors would like to thank the anonymous reviewers and the associate editor for their valuable comments.

References (50)

ZhuX. et al.
Captioning transformer with stacked attention modules
Appl. Sci.
(2018)
BasaeedE. et al.
Supervised remote sensing image segmentation using boosted convolutional neural networks
Knowl.-Based Syst.
(2016)
MylonasS.K. et al.
GeneSIS: A GA-based fuzzy segmentation algorithm for remote sensing images
Knowl.-Based Syst.
(2013)
LiuT. et al.
A remote sensing image retrieval model based on semantic mining
Geomatics Inf. Sci. Wuhan Univ.
(2009)
ZhuQ.Q. et al.
Multi-feature probability topic scene classifier for high spatial resolution remote sensing imagery
J. Yang, Z. Jiang, Q. Zhou, H. Zhang, J. Shi, Remote sensing image semantic labeling based on conditional random field,...
WangJ. et al.
Research on key technologies of remote sensing image data retrieval based on semantics
Comput. Digit. Eng.
(2012)
K. Chen, Z. Zhou, J. Guo, D. Zhang, X. Sun, Semantic scene understanding oriented high resolution remote sensing image...
LiY.
Target Detection Method of High Resolution Remote Sensing Image Based on Semantic Model
(2012)
RennieS.J. et al.
Self-critical sequence training for image captioning

QuB. et al.

Deep semantic understanding of high resolution remote sensing image

ShiZ.W. et al.

Can a machine generate humanlike language descriptions for a remote sensing image?

IEEE Trans. Geosci. Remote Sens.

(2017)

LuX.X. et al.

Exploring models and data for remote sensing image caption generation

IEEE Trans. Geosci. Remote Sens.

(2018)

WangB. et al.

Semantic descriptions of high-resolution remote sensing images

IEEE Geosci. Remote Sens. Lett.

(2019)

ZhangX.R. et al.

Description generation for remote sensing images using attribute attention mechanism

Remote Sens.

(2019)

VaswaniA. et al.

Attention is all you need

KingmaD.P. et al.

Auto-encoding variational Bayes

ZeilerM. et al.

Visualizing and understanding convolutional neural networks

GongY. et al.

Improving image-sentence embeddings using large weakly annotated photo collections

SunC. et al.

Automatic concept discovery from parallel text and visual corpora

HodoshM. et al.

Framing image description as a ranking task: Data, models and evaluation metrics

J. Artificial Intelligence Res.

(2013)

FarhadiA. et al.

Every picture tells a story: Generating sentences from images

LiS. et al.

Composing simple image descriptions using web-scale n-grams

KulkarniG. et al.

Baby talk: Understanding and generating simple image descriptions

MaoJ. et al.

Deep captioning with multimodal recurrent neural networks (m-RNN)

Cited by (60)

SDGIN: Structure-aware dual-level graph interactive network with semantic roles for visual dialog
2024, Knowledge-Based Systems
Visual Dialog aims to answer an appropriate response based on a multi-round dialog history and a given image. Existing methods either focus on semantic interaction, or implicitly capture coarse-grained structural interaction (e.g., pronoun co-references). The fine-grained and explicit structural interaction feature for dialog history is seldom explored, resulting in insufficient feature learning and difficulty in capturing precise context. To address these issues, we propose a structure-aware dual-level graph interactive network (SDGIN) that integrates verb-specific semantic roles and co-reference resolution to explicitly capture context structural features for discriminative and generative tasks in visual dialog. Specifically, we create a novel structural interaction graph that injects syntactic knowledge priors into dialog by introducing semantic role labeling that imply which words are sentence stems. Furthermore, considering the single perspective limitation of previous algorithms, we design a dual-perspective mechanism that learns fine-grained token-level context structure features and coarse-grained utterance-level interactions in parallel. It possess an elegant view to explore precise context interactions, realizing the mutual complementation and enhancement of different granularity features. Experimental results show the superiority of our proposed approach. Compared to other task-specific models, our SDGIN outperforms previous models and achieves a significant improvement on the benchmark dataset VisDial v1.0.
Crop type recognition of VGI road-side images via hierarchy structure based on semantic segmentation model Deeplabv3+
2024, Displays
The application of artificial intelligence in the field of agricultural crop type recognition is of great significance. There are currently two methods for collecting crop information: remote sensing and volunteered geographic information (VGI). As one pixel in a remote sensed satellite image may correspond to multiple crop types due to its low resolution, more refined VGI images can be a supplementary data source for agricultural crop monitoring. A hierarchal semantic segmentation structure based on Deeplabv3+ is proposed in this paper. Inspired by causality theories, a binary classification is adopted in the first stage of the model to segment crops from background. The second stage aims to recognize the crop types including rape, corn, fallow and bare land, wheat and rice. As wheat and rice have many similar characteristics, they are considered as one category in this stage and will be further processed in the downstream. In the third stage, a priori knowledge such as the location and time of image acquisition is utilized for fine recognition between wheat and rice. We have carried out experiments on the proposed Deeplabv3+ model and other different semantic segmentation models are validated on a VGI dataset with 20% rough labels and 80% accurate labels, including thousands of road-side images collected from different image sensors, under different weather and from different geographic locations. Experiments results show that Deeplabv3+ achieves the best performance. The recognition of five categories of crops has reached a precision of 87%, a recall of 91%, and IoU of 81%, showing good potential for use in the crop type recognition for VGI images.
Learning consensus-aware semantic knowledge for remote sensing image captioning
2024, Pattern Recognition
Tremendous progresses have been made in remote sensing image captioning (RSIC) task in recent years, yet there still some unresolved problems: (1) facing the gap between the visual features and semantic concepts, (2) reasoning the higher-level relationships between semantic concepts. In this work, we focus on injecting high-level visual-semantic interaction into RSIC model. Firstly, the semantic concept extractor (SCE), end-to-end trainable, precisely captures the semantic concepts contained in the RSIs. In particular, the visual-semantic co-attention (VSCA) is designed to grain coarse concept-related regions and region-related concepts for multi-modal interaction. Furthermore, we incorporate the two types of attentive vectors with semantic-level relational features into a consensus exploitation (CE) block for learning cross-modal consensus-aware knowledge. The experiments on three benchmark data sets show the superiority of our approach compared with the reference methods.
Deep image captioning: A review of methods, trends and future challenges
2023, Neurocomputing
Image captioning, also called report generation in medical field, aims to describe visual content of images in human language, which requires to model semantic relationship between visual and textual elements and generate corresponding descriptions that conform to human language cognition. Image captioning is significant for promoting human–computer interaction in all fields and particularly, for computer-aided diagnosis in medical field. Currently, with the rapid development of deep learning technologies, image caption has attracted increasing attention of many researchers in artificial intelligence-related fields. To this end, this study attempts to provide readers with systematic and comprehensive research about different deep image captioning methods in natural and medical fields. We first introduce workflow of image captioning from perspective of simulating human process of describing images, including seeing, focusing and telling, which is respectively behavioralized into feature representation, visual encoding and language generation. Within it, we present common-used feature representation, visual encoding and language generation models. Then, we review datasets, evaluations and basic losses used in image captioning, and summarize typical caption methods which are generally divided into that with or without using reinforcement learning. Besides, we describe advantages and disadvantages of existing methods, and conclusion and challenges are finally presented.
EEG-based cross-subject emotion recognition using multi-source domain transfer learning
2023, Biomedical Signal Processing and Control
Emotion recognition based on electroencephalogram (EEG) has received extensive attention due to its advantages of being objective and not being controlled by subjective consciousness. However, inter-individual differences lead to insufficient generalization of the model on cross-subject recognition tasks. To solve this problem, a cross-subject emotional EEG classification algorithm based on multi-source domain selection and subdomain adaptation is proposed in this paper. We firstly design a multi-representation variational autoencoder (MR-VAE) to automatically extract emotion related features from multi-channel EEG to obtain a consistent EEG representation with as little prior knowledge as possible. Then, a multi-source domain selection algorithm is proposed to select the existing subjects’ EEG data that is closest to the target data distribution in the global distribution and sub-domain distribution, thereby improving the performance of the transfer learning model on the target subject. In this paper, we use a small amount of annotated target data to achieve knowledge transfer and improve the classification accuracy of the model on the target subject as much as possible, which has certain significance in clinical research. The proposed method achieves an average classification accuracy of 92.83% and 79.30% in our experiment on two public datasets SEED and SEED-IV, respectively, which are 26.37% and 22.80% higher than the baseline non-transfer learning method, respectively. Furthermore, we validate the proposed method on other two commonly used public datasets DEAP and DREAMER, which establish SOTA results on the binary classification task of the DEAP dataset. It also achieves comparable accuracy to several transfer learning based methods on the DREAMER dataset. The detailed recognition results on DEAP and DREAMER are in Appendix.
Knowledge and topology: A two layer spatially dependent graph neural networks to identify urban functions with time-series street view image
2023, ISPRS Journal of Photogrammetry and Remote Sensing
With the rise of GeoAI research, streetscape imagery has received extensive attention due to its comprehensive coverage, abundant information, and accessibility. However, obtaining a holistic spatial–temporal scene representation is difficult because places are often composed of multiple images from different angles, times and locations. This problem also exists in other types of geo-tagged imagery. To solve it, we propose a purely visual, robust, and reliable method for urban function identification at the street scale. We introduce a method based on a two-layer spatially dependent graph neural network structure, which handles sequential street view imagery as input (typically available in services such as Google Street View, Baidu Maps, and Mapillary), with full consideration of the spatial dependencies among road networks. In this paper, we construct an urban topological map network using OpenStreetMap data in Wuhan, China, and compute a semantic representation of the scene as a whole at the street scale using a large-scale pre-trained model. We construct the graph network with streets as nodes based on 28,693 mapping relationships constructed from 75,628 street view images and 5,458 streets. Only 5.3% of the node labels were required to obtain 10 categories of functions for all nodes in the study area. The results demonstrate that by using appropriate spatial weights, street encoder, and graph structure, our novel method achieves high accuracy of P@1 46.2%, P@3 73.0%, P@5 82.4%, and P@10 89.9%, fully demonstrating the effectiveness of the introduced approach. We also use the model to sense urban spatial–temporal renewal by computing time series street images. The model is also applicable to the prediction of other attributes, where only a small number of labels are required to obtain valid and reliable scene perception results. The example data and code is shared at: https://github.com/yemanzhongting/Knowledge-and-Topology.

View all citing articles on Scopus

View full text

Remote sensing image captioning via Variational Autoencoder and Reinforcement Learning

Highlights

Abstract

Introduction

Section snippets

Related work

VRTMM

Dataset

Conclusions

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Appl. Sci.

Knowl.-Based Syst.

Knowl.-Based Syst.

A remote sensing image retrieval model based on semantic mining

Geomatics Inf. Sci. Wuhan Univ.

Multi-feature probability topic scene classifier for high spatial resolution remote sensing imagery

Research on key technologies of remote sensing image data retrieval based on semantics

Comput. Digit. Eng.

Target Detection Method of High Resolution Remote Sensing Image Based on Semantic Model

Self-critical sequence training for image captioning

Deep semantic understanding of high resolution remote sensing image

Can a machine generate humanlike language descriptions for a remote sensing image?

IEEE Trans. Geosci. Remote Sens.

Exploring models and data for remote sensing image caption generation

IEEE Trans. Geosci. Remote Sens.

Semantic descriptions of high-resolution remote sensing images

IEEE Geosci. Remote Sens. Lett.

Description generation for remote sensing images using attribute attention mechanism

Remote Sens.

Attention is all you need

Auto-encoding variational Bayes

Visualizing and understanding convolutional neural networks

Improving image-sentence embeddings using large weakly annotated photo collections

Automatic concept discovery from parallel text and visual corpora

Framing image description as a ranking task: Data, models and evaluation metrics

J. Artificial Intelligence Res.

Every picture tells a story: Generating sentences from images

Composing simple image descriptions using web-scale n-grams

Baby talk: Understanding and generating simple image descriptions

Deep captioning with multimodal recurrent neural networks (m-RNN)