A Novel Framework for Image Description Generation

Cai, Qiang; Xue, Ziyu; Zhang, Xiaoyu; Zhu, Xiaobin; Shao, Wei; Wang, Lei

doi:10.1007/978-981-10-7299-4_49

A Novel Framework for Image Description Generation

Qiang Cai¹⁶,
Ziyu Xue¹⁶,
Xiaoyu Zhang¹⁷,
Xiaobin Zhu¹⁶,
Wei Shao¹⁸ &
…
Lei Wang¹⁹

Conference paper
First Online: 30 November 2017

2640 Accesses
1 Citations

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 771))

Abstract

The existing image description generation algorithms always fail to cover rich semantics information in natural images with single sentence or dense object annotations. In this paper, we propose a novel semi-supervised generative visual sentence generation framework by jointly modeling Regions Convolutional Neural Network (RCNN) and improved Wasserstein Generative Adversarial Network (WGAN), for generating diverse and semantically coherent sentence description of images. In our algorithm, the features of candidate regions are extracted with RCNN and the enriched words are polished by their context with an improved WGAN. The improved WGAN consists of a structured sentence generator and a multi-level sentence discriminators. The generator produces sentences recurrently by incorporating region-based visual and language attention mechanisms, while the discriminator assesses the quality of generated sentences. The experimental results on publicly available dataset show the promising performance of our work against other related works.

You have full access to this open access chapter, Download conference paper PDF

1 Introduction

Image description can be used in many real application scenarios, e.g., video retrieval and automatic video subtitling. It is a challenging task due to the inter-section of computer vision, natural language processing and other disciplines. In recent years, image description generation has attracted lots of focus in research domain. And great progress has been made in labeling images with a pre-defined close set of visual categories, as shown in Fig. 1.

Image description generation is thoroughly studied by the computer vision communities, where some well established models have been developed. A majority of algorithms proposed to describe image using individual semantic sentence. Everingham et al. [1] focused on labeling images with a fixed set of visual categories, while Kulkarni et al. [2] relied on hard-coded visual concepts and sentence templates to generate sentence descriptions. These two methods all ignored the location of dense objects in images. Recent works [3,4,5] intended to use Recurrent Neural Network to generate the dense image descriptions and corresponding location information, which were spliced according to the trained sentence model. However, the simplified sentence models always generate single smoothly sentences using fixed set of visual categories without consideration of rich vocabulary, which will fail to cover rich underlying semantics of images.

To address above-mentioned issue, we propose a novel approach for image description generation, by cooperation between modeling Convolutional Neural Network (CNN) and improved Wasserstein Generative Adversarial Network (WGAN). When using CNN in visual feature acquisition, alignment objective method will be used in order to avoid the deviation of regions from words. We have described the transformations that map every image and sentence into a set of vectors in a common h-dimensional space. After matching the dense objects and words, the higher score words are more likely to be used to generate sentence. The improved WGAN model is an adversarial training mechanism, which jointly by a structured sentence generator and multi-level sentence discriminators. The discriminator learns to distinguish between real and synthesized sentences which from generator. The visual features with higher scores are used as the input in improved WGAN to generate sentences.

The rest of the paper is organized as follows. Section 2 overviews the related works. In Sect. 3, we elaborate our method. The experimental evaluation is given in Sect. 4, and we draw conclusion in Sect. 5.

2 Related Work

Image description generation is an active area of research on its own [5]. Traditional methods concentrated on features extraction. Deep model is developing rapidly because of its accuracy and applicability, which generally divide into dense object recognition, visual sentence generation and visual paragraph generation. Generating high score sentence plays an important role in paragraph generation. In this section, we will briefly review sentence generation related works.

Kiros et al. [4] proposed a model that CNN is used to feature extraction and object detection. Borrowing the above method of feature extraction, Karpathy et al. [3] modeled Recurrent Neural Networks framework learns a joint image-label embedding to characterize the semantic label dependency as well as the image-label relevance. However, RNN failed to adapt to generate sentences, because it has no time sequence. RNN with Long Short-Term Memory (LSTM) can effectively model the long-term temporal dependency in a sequence [6]. To reason about long-term linguistic structures with multiple sentences, hierarchical recurrent network has been widely used to directly simulate the hierarchy of language. In [8, 9], a framework based on CNN-LSTM was developed to image description that the generated descriptions significantly outperform baselines. However, LSTM uses fixed set of visual categories without consideration of rich vocabulary, which will fail to cover rich underlying semantics of images.

Dai et al. [15] proposed Conditional GAN (CGAN) to generate sentence, which proposed a cooperation model between native CNN and CGAN. CGAN overcomes the difficulty by Policy Gradient, a strategy stemming from Reinforcement Learning, which allows the generator to receive early feedback along the way. Liang et al. [10] proposed Recurrent Topic-Transition GAN (RTT-GAN) to generate paragraph, which composed of LSTM model and generated diverse and semantically coherent sentences by reasoning over both local semantic regions and global sentence context. The generator selectively incorporates visual and language cues of semantic regions to produce each sentence. Nevertheless, to the best of our knowledge, the previous methods failed to consider the alignment between dense objects and words, which had a great impact on the accuracy of the words.

3 Our Method

The framework of our algorithm is shown as Fig. 2. The red box part illustrates the flowchart of RCNN model. The green box part details the procedure of sentence generation. In our algorithm, we first generate descriptions for candidate image regions with RCNN. Sequentially, the sentence generator will generate meaningful sentences by incorporating the fine-grained visual and textual cues in a selective way. To ensure quality of sentences, we apply a sentence discriminator on each generated sentence to measure the plausibility and smoothness of semantic transition with preceding sentences. The generator and discriminator are learned jointly within an adversarial framework.

3.1 Learning to Align Visual and Language Data

Image-Feature Extraction. Following prior works, we observe that sentence descriptions make frequent references to objects and their attributes. Therefore, we adopt RCNN to detect object, which followed by [3]. The feature vectors for detected regions are computed as below:

$$\begin{aligned} \ v={{W}_{M}}[{{CNN}_{\theta c}}\left( {{I}_{p}} \right) ] . \end{aligned}$$

(1)

where $\theta _c$ denotes CNN model parameter sets; the dimension of matrix $W_M$ is $h * 4096$ (h is the size of the multimodal embedding space). Every image is transformed as a sequence of N words, which encoded in a representation vector. Every sequence will be transformed into a h-dimensional vector $\{V_i \,|\, i=120\}$.

Objects Alignment. We formulate an image-sentence score as a function of the individual region-word scores, which used a Bidirectional Recurrent Neural Network (BRNN) to compute the word representations. Like the model of Karpathy et al. [3] interprets the dot product vi T-st between the i-th region and t-th word as a measure of similarity and use it to define the score between image k and sentence l as Eq. 2:

$$\begin{aligned} \ {S_{\mathrm{{kl}}}} = \sum \limits _{t \in {g_l}} {\sum \limits _{i \in gi} {\max (0,v_i^T{s_t})} } . \end{aligned}$$

(2)

where, $g_i$ denotes the i-th image fragments. l denote the images and sentences in the training set respectively. $s_t$ is a function of all words in the entire sentence.

Every word $s_t$ aligns to the single best image region. As we shown in the experiments, this simplified model also leads to improvements in the final ranking performance. Assuming that $i = l$ denotes a corresponding image and sentence pair, the final max-margin, structured loss remains:

$$\begin{aligned} \ \partial (\theta ) = \sum \limits _i {[\sum \limits _{l(images)} {\max (0,{S_{il}} - {S_{ii}} + 1) + } } \sum \limits _{l(sentence)} {\max (0,{S_{li}} - {S_{ii}} + 1)]} . \end{aligned}$$

(3)

This objective encourages aligned image-sentences pairs to have a higher score than misaligned pairs, by a margin. We get the high scores to form word vectors.

3.2 Sentence Generator

The architecture of the generator G shown in Fig. 3, which recurrently retains different levels of context states with a hierarchy constructed by sentence LSTM, word LSTM, and two attention modules. Firstly, the visual attention module selectively focuses on semantic regions, generating the visual representation of the sentence from the visual vectors. The sentence LSTM can be able to encode a topic vector for a new sentence. Secondly, the language attention module embedded in local phrases of focused semantic regions to facilitate word generation from the word LSTM, meanwhile learnt to incorporate linguistic knowledge.

LSTM for Sentence Description.Sentence LSTM is a single layer model. When we training the model, it takes the image pixels I and a sequence of input vectors from the region description $(v_1, \ldots , v_T)$. And computes a sequence of hidden states $(h_1, \ldots , h_t)$ and a sequence of outputs $(y_1, \ldots , y_t)$ by iterating the following recurrence relation from 1 to T. An attention mechanism is applied in the visual features V of all semantic regions, resulting in a visual context vector $f_vt$ which represents the next sentence at t-th step:

$$\begin{aligned} \ {\mathrm{{y}}_t} = soft\max \{ {W_{oh}} \times f({W_{hx}}{x_t}\mathrm{{ + }}{W_{hh}}{h_{t - 1}} + {b_h} + {W_{hi}}[CN{N_{\theta {}_c}}(i)]) + {b_o}\} \; \end{aligned}$$

(4)

here $W_hi$, $W_hx$, $W_hh$, $W_oh$, $x_i$ and $b_h$, bo are learnable parameters. $CNN_\theta c(i)$ is the last layer in all net. $y_t$ is the output vector, which holds the log probabilities of words in the dictionary, according to the highest score.

Word LSTM with Language Attention. To get plausibility and smoothness sentences, the proposed model should recognize and describe substantial details such as objects, attributes, and relationships, word replacement model is appropriate for context. We selectively incorporate the embedding of local phrases based on the topic vector and use Word LSTM to generate the better word representation. Considering that each local phrase relates to the respective visual feature, we reuse the visual attentive weights to enhance the language attention. By computing the contribution of a word to the whole sentence in hidden layer, word LSTM embeds the words which have the high contribution into sentence.

3.3 Sentence Discriminator

The sentence discriminator $D_s$ aim to distinguish between real sentences and synthesized ones based on the linguistic characteristics of a natural sentence description. In our algorithm, the discriminator $D_s$ is an LSTM that recurrently takes the input word embedding within a sentence, meanwhile produces a real value plausibility score of the synthesized sentence, which evaluates the plausibility of individual sentences.

The discrete nature of text samples disappeared gradient back-propagation from the discriminators to the generator. To overcome this problem, we use deterministic approximation. At each word generation step, we select the most probable word according to the soft-max emission distribution, and propagate gradient only through the words. More simply, we apply max-pooling operation over the soft-max to overcome the gradient disappearance in GAN model. Figure 4 shows the discriminator of the framework.

The Sentence Input (as shown in Fig. 4 with Fake Sentences) is the output of generator in Sect. 3.2. The objective of the adversarial framework, which followed WGAN [10] method by minimizing an approximated Wasserstein distance, is thus written as:

$$\begin{aligned} \ \min {\max _{G,{D^s}}}C(G,{D^s}) = {E_{\hat{s}\sim S}}[{D^s}(\hat{s})] - {E_{\hat{s}\sim {S_{1:t}}}}[{D^s}(\hat{s})] \; \end{aligned}$$

(5)

where $\hat{s}$ is the true sentence, S and $S_{(1:t)}$ denote the true data distributions of generation sentences and true sentences, which are constructed from a sentence description corpus. sentence discriminator $D^s$ that optimizes a critic between real/fake sentences. $\hat{s}\sim S_{(1:t)}$ distribution of generated sentences by the generator G. The objective for the generator G is:

$$\begin{aligned} \ {G^{'}} = \arg \min {\max _{G,{D^s}}}\gamma C(G,{D^s}) \; \end{aligned}$$

(6)

where $\gamma $ is a balancing parameter fixed at 0:001 in implementation. The optimization is performed in an alternating min-max manner [10].

4 Experiments

To validate the effectiveness of our proposed algorithm, we conduct experiments on MSCOCO datasets: the dataset contains 123,000 images respectively and each is annotated with five sentences using Amazon Mechanical Turk. Throughout all the experiments, 5,000 images are used in both validation and testing.

To evaluate the performance of the LSTM model in generating sentences, we adopt VGGNet [11] in the full image experiments to calculate BLEU [12], METEOR [13] and CIDEr [14] scores. We evaluates a candidate sentence by matching five reference sentences written by humans. Word generator can optimize individual vocabulary. This step does not improve the accuracy of the sentence, because the above steps have generated vocabulary and phrases already. However, after the vocabulary of the replacement, we can get a smoother sentence.

4.1 Objective Alignment Evaluation

RCNN is pre-trained on ImageNet and fine-tuned on the 200 classes of the ImageNet Detection Challenge. In our algorithm, the top 19 ranked regions are selected. Image-Sentence alignment evaluation is used to build the Image-Sentence rank in our work, which retrieve the most compatible test sentence and visualize the highest-scoring region for each word and the associated scores, as shown in Fig. 5.

Inputting all the words into the generator in order, until the sentence generated in Sect. 3.2. The word LSTM is recurrently forwarded to optimize the words iteratively. We report the median rank of the closest ground truth result and Recall@K, which can get a correct sentence in top K results. The result of these experiments can be found in Table 1 [3].

Table 1. Image-Sentence ranking experiment results.

Full size table

4.2 The Process of Adversarial Training

The generator in WGAN can be regarded as an image captioning model due to the lack of adversarial loss, similar as Word-Concat. The discriminator identify with the closest ground truth result, comparing with corpus. It justifies that the sentence plausibility is very critical for generating long, convincing sentences.

Table 2 shows the model evaluation of full image predictions on 1,000 test images. Where B-n is BLEU-score that uses up to n-grams, its output is always a number between 0 and 1. This value indicates how similar the candidate text is to the reference texts, with values closer to 1 representing more similar texts.

Table 2. Evaluation of full image predictions on 1,000 test images.

Full size table

We compare our method with others. Vinyals et al. [9] use LSTM, which get the first word from through a bias term on the first step. Karpathy et al. [3] use Multimodel-RNN, the assembly result become better. Donahue et al. [5] use a 2-layer factored LSTM and GoogLeNet, which is a different CNN, and report results of a model ensemble. Others [8, 12] appear to work worse than ours, but this is likely in large part due to their use of the less powerful AlexNet [7] features. Compared to these approaches, our model generates more accurate descriptions.

4.3 Personalized Sentence Generation

Different with prior works, the proposed model supports the personalized sentence generation, which will produce diverse descriptions by first words. The generator can sequentially output diverse and topic-coherent sentences for an image. If we give two different first words, two models will produce two personalized sentences for the same image, respectively. We present qualitative results of our model in Fig. 6.

4.4 Objective Alignment Evaluation

The proposed model generates sensible descriptions of images as shown in Fig. 6. The images shown in Fig. 6 with blue background translated more accurate, the first prediction a large building with a clock tower on top of it does not appear in the training set. However, there are a large building, a clock tower, on top of are occurrence, which the model may have composed to describe the first image. However, picture of the running car, the model cannot accurately determine the state. The last two pictures, when judge the objects (especially color judgment), there have been some problems. They cannot accurately determine the color of the horse and water. Further, we will output the phrase more accurately by adjusting the recognition within the bounding boxes.

In general, we find a relatively large portion of generated sentences can be found in the training data. The proposed method can be repeated many times without any syntactic specification. Therefore, the generated sentence is more smooth and complete, and the generated words are relatively rich and appropriate.

5 Conclusion

In this paper, we propose a novel approach for cooperation between modeling CNN and RTT-GAN. When we use CNN to extract features, Image-Sentence alignment evaluation will be used to build the Image-Sentence rank, higher scores words will be selected to generate sentences. RTT-GAN uses heuristic rules to generate smoothly sentence with rich vocabularies. We evaluated the performance on full frame experiments and showed that our model outperforms retrieval baselines. As our future work, we will attempt to update RTT-GAN, improve levels and types on discriminators to enhance sentences recognition effect, and apply it to some other applications, such as paragraph generation.

References

Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The Pascal visual object classes (VOC) challenge. IJCV 88(2), 303–338 (2010)
Article Google Scholar
Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A.C., Berg, T.L.: Baby talk: understanding and generating simple image descriptions. In: CVPR, pp. 1601–1608 (2011)
Google Scholar
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR, pp. 3128–3137 (2015)
Google Scholar
Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models. Comput. Sci. 3412–3415 (2014)
Google Scholar
Zhang, C., Xue, Z., Zhu, X., Huang, Q., Tian, Q.: Boosted random contextual semantic space based representation for visual recognition. Inf. Sci. 369, 160–170 (2016)
Article Google Scholar
Wang, J., Yang, J., Yu, K., et al.: Locality-constrained linear coding for image classification. In: CVPR, pp. 3360–3367 (2010)
Google Scholar
Fang, H., Gupta, S., Iandola, F., Srivastava, R., Deng, L., Dollar, P., Gao, J., He, X., Mitchell, M., Platt, J., et al.: From captions to visual concepts and back. arXiv:1411.4952 (2014)
Chen, X., Zitnick, C.L.: Learning a recurrent visual representation for image caption generation. CoRR, arXiv:1411.5654 (2014)
Vinyals, O., Toshev, A., Bengio, S., et al.: Show and tell: a neural image caption generator. Comput. Sci. 3156–3164 (2015)
Google Scholar
Liang, X., Hu, Z., Zhang, H.: Recurrent topic-transition GAN for visual paragraph generation. arXiv preprint arXiv: 1703.07022 (2017)
Simonyan, K., Zisserman, A.: very deep convolutional networks for large-scale image recognition. Comput. Sci. 1543–1544 (2014)
Google Scholar
Zhang, X.-Y., Wang, S., Zhu, X., Yun, X., Wu, G.: Update vs. upgrade: modeling with indeterminate multi-class active learning. Neurocomputing (NEUCOM) 162, 163–170 (2015)
Google Scholar
Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: The Workshop on Statistical Machine Translation, pp. 376–380 (2014)
Google Scholar
Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: consensus-based image description evaluation. CoRR, arXiv:1411.5726 (2014)
Zhang, X.-Y., Wang, S., Yun, X.: Bidirectional active learning: a two-way exploration into unlabeled and labeled dataset. IEEE Trans. Neural Netw. Learn. Syst. (TNNLS) 26(12), 3034–3044 (2015)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Beijing Key Laboratory of Big Data Technology for Food Safety, School of Computer and Information Engineering, Beijing Technology and Business University, No. 11, Fucheng Road, Haidian District, Beijing, China
Qiang Cai, Ziyu Xue & Xiaobin Zhu
Institute of Information Engineering, Chinese Academy of Sciences, A, No. 89, Minzhuang Road, Haidian District, Beijing, China
Xiaoyu Zhang
CCTV High-Tech Television Development CO., Ltd., 12B, Huabao building, Lianhua bridge, Haidian District, Beijing, China
Wei Shao
Information Technology InstituteAcademy of Broadcasting Science, SAPPRFT, No. 2 Fuxingmenwai Street, Xicheng District, Beijing, China
Lei Wang

Authors

Qiang Cai
View author publications
You can also search for this author in PubMed Google Scholar
Ziyu Xue
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaobin Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Wei Shao
View author publications
You can also search for this author in PubMed Google Scholar
Lei Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaoyu Zhang .

Editor information

Editors and Affiliations

Civil Aviation University of China, Tianjin, China
Jinfeng Yang
School of Computer Science and Technology, Tianjin University, Tianjin, China
Qinghua Hu
Nankai University, Tianjin, China
Ming-Ming Cheng
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Liang Wang
Information Science and Technology, Nanjing University, Beijing, China
Qingshan Liu
Huazhong University of Science and Technology, Wuhan, Hubei, China
Xiang Bai
Xi’an Jiaotong University, Xi’an, China
Deyu Meng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cai, Q., Xue, Z., Zhang, X., Zhu, X., Shao, W., Wang, L. (2017). A Novel Framework for Image Description Generation. In: Yang, J., et al. Computer Vision. CCCV 2017. Communications in Computer and Information Science, vol 771. Springer, Singapore. https://doi.org/10.1007/978-981-10-7299-4_49

Download citation

DOI: https://doi.org/10.1007/978-981-10-7299-4_49
Published: 30 November 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-7298-7
Online ISBN: 978-981-10-7299-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics