Keywords

1 Introduction

Image Caption is an integrated problem that integrates computer vision and natural language processing. The main task is given an image, the computer will generate a reasonable sentence to describe it. This task is easy for people but difficult for computers. Because it requires computers not only to know what objects and scene in the picture, but also to capture the relationships between objects and objects, also the relationships between objects and scene. Here, relationships include spatial location, attributes, actions and so on. Also, it needs a language model that is strong enough to generate reasonable sentences to describe a picture after fully understanding it. The generated sentences should be reasonable enough without any grammatical and logical errors. Image caption can be used in many scenes. For examples, it can be used to assist visually impaired people and report incident of surveillance. Moreover, it also can be applied to robotic vision and multi-media search tasks.

Fig. 1.
figure 1

An example of image caption.

Figure 1 shows an example of attention-based image caption generation. Given an image, we first use a convolution neural network to obtain a set of feature maps (from low-level) which can preserve much visual information, then utilize attention model to generate a visual representation from those feature maps and last we use Multi-Layer Perception to decode this visual representation to generate next word.

Recently, most existing image caption language models mainly maximize the probability of next word given the image and previous words. This describes how the image and previous words influences the next. The context relationships among the generated words can be guaranteed by this way. However, those models do not explore the relationships between the semantic and visual contents. As a result, the generated sentences from those existing approaches may be logically correct but the semantics (e.g. subjects, verbs or objects) in the sentences are wrong. For example, the sentence generated by existing approaches model for the image in Fig. 1 may be ‘a man is hitting a volleyball’, which is correct in logic but the subject ‘man’ (‘woman’ or ‘girl’ is correct) is not relevant to the image contents.

In [12], the context relationships mentioned above are defined as coherence, while the relationships between the semantic and visual contents are defined as relevance. In [12], the coherence and relevance were explored simultaneously for video description task. However, for image caption task, most existing methods only explored the coherence, while the relevance has not been explored. Specifically, as video description is similar to image description, so we first extend the work of [12] to image caption task which can explore the coherence and relevance simultaneously. Moreover, our method is also based on [21], which will be compared with, to validate the effectiveness of relevance.

2 Related Work

Recently, inspired by the successful use of sequence to sequence training with neural networks for machine translation, several methods based on deep neural networks have been proposed for image caption task. The first to use neural networks for image caption generation was Kiros et al. [8], who used a multi-modal log-bilinear model. Mao et al. [11] used a new approach to generate descriptions which is similar to [8], but Mao used a recurrent neural network as language model instead of a feed-forward one. Similarly, Vinyals et al. [19] and Donahue et al. [3] used LSTM as their language models.

All of above works encoder input image as a single feature vector. But Karpathy and Fei-Fei [6] proposed a method to create a joint embedding space to explore the similarity between the semantic and visual contents, and then generate description sentences. Fang et al. [4] proposed a model incorporating object detections. This model divided the caption generation into several parts: word detection through a CNN, caption candidates through a maximum entropy model, and sentence re-ranking through a deep multi-modal semantic model. Tran et al. [18] followed Fang’s work [4]. They tried to address the challenges of describing images in the wild by adding an entity recognition model. The entity recognition model was used to identify celebrities and landmarks.

In [21], Xu et al. first added attention mechanisms to encoder-decoder image caption model conditions on the next word generation at each time step. Lu et al. [10] proposed an adaptive attention model via a visual sentinel. This model learnt to decide when and where to attend to the image for word generation at each time step.

Yao et al. [22] proposed a model named Long Short-Term Memory with Attributes (LSTM-A). LSTM-A presented attribute concept and added attributes into a based image caption framework [19]. More recently, as the exposure bias [14] and non-differentiable task metric issues can be addressed by Reinforcement Learning (RL) [17], in [14], Ranzato et al. used REINFORCE algorithm [20] to directly optimize non-differentiable and sequence-based test metrics to overcome both exposure bias and non-differentiable task metric issues. Rennie et al. [15] constructed a framework which used a new optimization approach called self-critical sequence training (SCST). Now this framework obtains state-of-the-art results on the MS COCO evaluation sever.

3 Model

Our goal is to generate good description sentences for images. In this section, we first describe the basic attention image caption model, and then we present a joint loss measuring the relevance and coherence simultaneously.

Fig. 2.
figure 2

An illustration of attention-based model.

3.1 Word Embedding and Convolutional Networks

Our model takes a scaled image as input and generates a sentence S encoded as a sequence of words. We first encode each word as one-hot vector, thus the dimension of feature vector \(\mathbf{{x}}_i\), i.e. V, is the vocabulary size.

$$\begin{aligned} \mathbf {S} = \{{\mathbf{{w}}_1},{\mathbf{{w}}_2}, \cdots ,{\mathbf{{w}}_L}\},{\mathbf{{w}}_i} \in {{\mathbf {R}}^{\mathbf{V}}}. \end{aligned}$$
(1)
$$\begin{aligned} {{\mathbf{{x}}^t} = \mathbf{{E}}{\mathbf{{w}}_t},t \in \{ 1,2,...,L\} }. \end{aligned}$$
(2)

where, \(\mathbf{{E}} \in {\mathbf{{R}}^{m*V}}\) is an embedding matrix. m is embedding dimensionality and L is the length of the sentence. Then we use a 2D Convolutional Neural Network to extract a set of features. The network produces K vectors, each of which is a D-dimensional representation corresponding to parts of the input image.

$$\begin{aligned} \mathbf {I} = \{{\mathbf{{v}}_1},{\mathbf{{v}}_2}, \cdots ,{\mathbf{{v}}_K}\},{\mathbf{{v}}_i} \in {{\mathbf {R}}^\mathbf{D}}. \end{aligned}$$
(3)

3.2 Long Short Term Memory

We briefly introduce the standard Long Short-Term Memory (LSTM), a variant of RNN, which is effective and widely used in language generation model. LSTM incorporates a memory cell and non-linear gating units to effectively overcome the gradient vanishing and explosion problems. Our implementation of LSTM shown in Fig. 2. The formulas for LSTM forward pass are given below:

$$\begin{aligned} \begin{array}{l} {\mathbf{{i}}^t} = \sigma ({\mathbf{{W}}_{ix}}{\mathbf{{x}}^t} + {\mathbf{{U}}_{ih}}{\mathbf{{h}}^{t - 1}} + {\mathbf{{V}}_{iv}}{{\mathbf{{\tilde{v}}}}^t} + {\mathbf{{b}}_i})\\ {\mathbf{{f}}^t} = \sigma ({\mathbf{{W}}_{fx}}{\mathbf{{x}}^t} + {\mathbf{{U}}_{fh}}{\mathbf{{h}}^{t - 1}} + {\mathbf{{V}}_{fv}}{{\mathbf{{\tilde{v}}}}^t} + {\mathbf{{b}}_f})\\ {\mathbf{{g}}^t} = \phi ({\mathbf{{W}}_{gx}}{\mathbf{{x}}^t} + {\mathbf{{U}}_{gh}}{\mathbf{{h}}^{t - 1}} + {\mathbf{{V}}_{gv}}{{\mathbf{{\tilde{v}}}}^t} + {\mathbf{{b}}_g})\\ {\mathbf{{c}}^t} = {\mathbf{{i}}^t} \odot {\mathbf{{g}}^t} + {\mathbf{{f}}^t} \odot {\mathbf{{c}}^{t - 1}}\\ {\mathbf{{o}}^t} = \sigma ({\mathbf{{W}}_{ox}}{\mathbf{{x}}^t} + {\mathbf{{U}}_{oh}}{\mathbf{{h}}^{t - 1}} + {\mathbf{{V}}_{ov}}{{\mathbf{{\tilde{v}}}}^t} + {\mathbf{{b}}_o})\\ {\mathbf{{h}}^t} = \phi ({\mathbf{{c}}^t}) \odot {\mathbf{{o}}^t} \end{array}. \end{aligned}$$
(4)

where, \(\mathbf{{i}}^{t}\), \(\mathbf{{f}}^{t}\), \(\mathbf{{o}}^{t}\), \(\mathbf{{g}}^{t}\), \(\mathbf{{c}}^{t}\) and \(\mathbf{{h}}^{t}\) are input gate, forget gate, output gate, cell input, cell state and hidden state of LSTM respectively. \(\sigma \) is logistic sigmoid activation. \(\phi \) is hyperbolic tangent activation. \(\mathbf{{W}}_{**}\), \(\mathbf{{U}}_{**}\), \(\mathbf{{V}}_{**}\) and \(\mathbf{{b}}_{*}\) are learned weight matrices and biases. The initializes of LSTM are given below:

$$\begin{aligned} \begin{array}{*{20}{l}} {{\mathbf{{c}}^0} = {f_{mlp}}(\frac{1}{K}\sum \nolimits _i^K {{\mathbf{{v}}_i}} )} \;and\; {{\mathbf{{h}}^0} = {f_{mlp}}(\frac{1}{K}\sum \nolimits _i^K {{\mathbf{{v}}_i}} )}\\ \end{array}. \end{aligned}$$
(5)

3.3 Attention Model

The context vector \(\tilde{\mathbf {v}}^{t}\) is a dynamic representation corresponding to parts of input image. Attention-based model learns a vector of weights \({\mathbf{{\alpha }}_{i}}(i=1,2,...,\text {K})\) corresponding to the features extracted at different image locations from \(\mathbf{{v}}_{i}\) and previous hidden state \({\mathbf{{h}}^{t - 1}}\). \({\mathbf{{\alpha }}_{i}}\) is a scalar between 0–1 and \(\sum \nolimits _{i = 1}^K {{\mathbf{{\alpha }}_i} = 1}\).

$$\begin{aligned} \mathbf{{z}}_i^t = \sigma (({\mathbf{{W}}_{ah}}{\mathbf{{h}}^{t - 1}}){\mathbf{{1}}^T} + {\mathbf{{V}}_{aI}}{} \mathbf{{I}} + {\mathbf{{b}}_a}). \end{aligned}$$
(6)
$$\begin{aligned} {\varvec{\alpha }}_i^t = soft\max ({\mathbf{{U}}_{az}}{} \mathbf{{z}}_i^t). \end{aligned}$$
(7)

where, \({\mathbf{{1}}}\) is a vector with all elements set to 1. Additionally, attention-based model predicts a gating scalar \({\mathbf{{\beta }}^t}\) from previous hidden state \(\mathbf{{h}}^{t - 1}\) at each time step t. This gating variable allows the decoder to decide to whether put more emphasis on language model or the context at each time step.

$$\begin{aligned} {{\varvec{\beta }}^t} = \sigma ({\mathbf{{W}}_\beta }{{\mathbf {h}}^{t - 1}}). \end{aligned}$$
(8)
$$\begin{aligned} {\tilde{\mathbf {v}}^t} = {{\varvec{\beta }}^t}\sum \nolimits _{i = 1}^K {{\varvec{\alpha }}_i^t} {{\mathbf {v}}_i}. \end{aligned}$$
(9)

In this work, we use MLP to compute the output word probability conditioned on the image (the context vector), the previously generated word, and the decoder hidden state \({\mathbf{{h}}^t}\). In formula 10, \({\mathbf{{W}}_o} \in {\mathbf{{R}}^{V*n}}\), n is LSTM hidden dimensionality, \({\mathbf{{W}}_h} \in {\mathbf{{R}}^{m*n}}\) and \({\mathbf{{V}}_v} \in {\mathbf{{R}}^{m*D}}\).

$$\begin{aligned} p({\mathbf{{w}}^t}) =softmax({\mathbf{{W}}_o}({\mathbf{{W}}_h}{\mathbf{{h}}^t} + {\mathbf{{V}}_v}{\mathbf{{\tilde{v}}}^t} + {\mathbf{{x}}^{t - 1}})). \end{aligned}$$
(10)

3.4 Jointly Measuring Relevance and Coherence

We assume that a low dimensional embedding exists for the representations of image and sentence. To measure the relevance between the visual content and semantics, we compute their distance in the embedding space. Thus, we define the relevance loss as:

$$\begin{aligned} {\mathbf{{E}}_{r}}(\mathbf{{v,S}}) = \sum \nolimits _{t = 1}^L {\left\| {{\mathbf{{V}}_r}{{\mathbf{{\tilde{v}}}}^t} - {\mathbf{{x}}^t}} \right\| _2^2}. \end{aligned}$$
(11)

where \({\mathbf{{V}}_v} \in {\mathbf{{R}}^{m*D}}\), L is the length of sentence. Inspired by the recent success of probabilistic sequence models leveraged in machine translation, the coherence loss is defined as:

$$\begin{aligned} {\mathbf{{E}}_\mathbf{{c}}}(\mathbf{{v}},\mathbf{{S}}) = - \log \Pr (\mathbf{{S}}|\mathbf{{v}}) = - \sum \nolimits _{t = 1}^L {\mathrm{{logPr}}({\mathbf{{w}}^t}|\mathbf{{v}},{\mathbf{{w}}^1},{\mathbf{{w}}^2},...,{\mathbf{{w}}^{t - 1}})}. \end{aligned}$$
(12)

By minimizing the coherence loss, the contextual relationship among the words in the sentence can be guaranteed, making the sentence coherent and smooth. In image caption task, both the relevance and coherence loss are estimated to complete the object function. The training of our model is performed by simultaneously minimizing the relevance loss and coherence loss. Therefore, the last object function is given below:

$$\begin{aligned} {\mathbf{{E}}_{l}}(\mathbf{{v}},\mathbf{{S}}) = {\mathbf{{E}}_\mathbf{{c}}}(\mathbf{{v}},\mathbf{{S}}) + {\lambda }{\mathbf{{E}}_{{r}}}(\mathbf{{v}},\mathbf{{w}}). \end{aligned}$$
(13)

4 Experiments

In this section, we first describe our experimental method and then quantitatively analyse the experimental results.

Table 1. Performance on Flickr8k, Flickr30k and MS COCO datasets. BLEU-1,2,3,4/METEOR metrics compared to [21]. Higher is better in all columns. (-) indicates an unknown metric, bold font indicates the highest metric score.

4.1 Experimental Preparation

Our experiments are validated on the widely used Flickr8k [5], Flickr30k [24] and MS COCO [9] datasets which have 8092, 31,783 and 123,287 images respectively.

For Flickr8k and Flickr30k datasets, each image is paired with 5 references. For Flickr8k dataset, We use the predefined splits containing 6,000 images for training, 1,000 images for validation and 1,000 images for test. For Flickr30k dataset, we use the publicly available splitsFootnote 1 containing 29,000 images for training, 1,000 images for validation and 1,000 images for test.

For COCO dataset, some images have references in excess of 5, so we only keep 5 references for each image for consistency across our dataset. We use the same data split as in [6, 21] containing 82,782 images for training, 5,000 images for validation and 5,000 images for test. However, a small part of images are not RGB format which are discard for feature extraction expediently. So the last number of images for training, validation and test are 113,079, 4989 and 4982 respectively.

We use only basic tokenization to Microsoft COCO which is same as the tokenization present in Flickr8k and Flickr30k. We don’t convert any sentences to lowercase. And we keep all non-alphanumeric characters except double quotes. In our experiment, we use a fixed vocabulary size of 10000 and others are agreed to be marked as ‘UNK’. For each sentence, we add a terminator ‘\(\mathbf {{<}eos{>}}\)’ to the end of a sentence.

We use the Oxford VGG19 [16] pretrained on ImageNet without finetuning for image feature extraction. Our model’s hyper-parameters are all the same as [21]. Also, the hyper-parameters of models training on three datsets are the same. Our model is not end to end. We pre-extract the feature of images by VGG19 pretrained on ImageNet without finetuning. We only train the parameters of the encoder LSTM and decoder MLP. All learning parameters are initialized randomly and model uses ADAM [7] optimizer with an initial learning rate of \(2 \times {10^{- 4}}\).

4.2 Experiments Analysis

Our model is based on [21] which we comparing with. By this way, we can verify that it can improve the performance of attention-based caption model significantly when considering the relationship of semantic and visual contents.

We report the results using the COCO captioning evaluation tool which reports the following metrics: BLEU [13] and Meteor [1]. Table 1 shows the results measured on Flickr8k, Flickr30k and MS COCO datasets. Compared with based-attention model [21] we can see that our model improves the performance of original attention based model significantly. When considering the relevance of visual and semantic contents, our model improves all metrics scores on Flickr8k, Flickr30k and MS COCO respectively. This means that the semantics (e.g. subjects, verbs or objects) in the sentence generated by our model are more precise. For Flickr8k dataset, our model improves BLEU-4 score from 21.3 to 22.7, METEOR score from 20.3 to 21.5 and increases by an average of 1.36 points for all metrics. BLEU-4 score improves from 19.9 to 23.2 and METEOR score from 18.5 to 20.4 on Flickr30k dataset. For COCO, BLEU-4 score increases from 25.0 to 30.4, METEOR score from 23.0 to 25.5 and average score increases by 4.36 points for all metrics. On the other hand, we can see that the improvement of all score metrics is most obvious on the COCO dataset, followed by Flickr30k dataset. This proves that the richer the data and the wider the data distribution, the better the model works.

Fig. 3.
figure 3

Sentences 1–3 are ground-truth references. SAT: [21].

We also reproduce the method of [21], which we called SAT here, and generate descriptions for test-set images of MS COCO dataset (beam search size is fixed as 3). Then we compare the results with ours. Figure 3 shows some examples. We can see that the sentences, generating by our model, are more precise, reasonable and robust than SAT. In the first picture of Fig. 3, subject ‘people’ is right but ‘women’ is more precise and ‘down a street’ is more reasonable than ‘on a beach’. In the third picture, verb ‘grazing’ is more suitable for ‘walking across’. For the fourth picture, SAT generates a sentence with logical error but ours do not.

Through analysis, we also find that SAT and our model always produce the same sentence for similar pictures, both have the same trend but our model is relatively more flexible. We list some examples in Fig. 4. For 4982 test samples, the sentence of ‘A man riding a snowboard down a snow covered slope’ generated by SAT appears 36 times vs 10 times of ours, and ‘A baseball player swinging a bat at a ball’ generated by SAT appears 37 times vs 11 times of ours. From Fig. 4, we can see that the sentences generated by our model, which considers the relevance, are more diverse and precise than SAT.

Fig. 4.
figure 4

Examples corresponding to the same sentence generated by SAT: [21].

In formula 13, we use tradeoff parameter \({\lambda }\) for relevance. We fixed the value of beam size as 3. Then we illustrate the performance curves with different trade of parameter values in Fig. 5(a). To make all performance curves fall into a comparable scale, all BLEU-1,2,3,4 and METEOR scores are normalized as:

$$\begin{aligned} {s'_\lambda } = \frac{{{s_\lambda } - \mathop {\min }\limits _\lambda \{ {s_\lambda }\} }}{{\mathop {\max }\limits _\lambda \{ {s_\lambda }\} - \mathop {\min }\limits _\lambda \{ {s_\lambda }\} }}. \end{aligned}$$
(14)

where \({s'_\lambda }\) and \({s_\lambda }\) denote normalized scores and original scores with a set of \({\lambda }\), respectively. When the value of \({\lambda }\) exceeds 1.5, the corresponding value decreases rapidly. When \({\lambda }=10\), the corresponding normalized score values decrease to 0.0 for all metrics scores. So in order to be easy to observe, We do not show corresponding values greater than 3 in the chart. From Fig. 5(a) we can see that the best performance is achieved when \({\lambda }\) is about 0.8. When \({\lambda }=0.8\), all metrics scores are more concentrated and larger than others relatively. This proves that it’s reasonable to jointly learn the visual-semantic embedding space in the deep RNNs.

Fig. 5.
figure 5

(a): The performances of different \({\lambda }\). (b)–(d): The performances of different beam search sizes.

Then we explore the influences on all score metrics when changing the beam search size on the three benchmark datasets. In this experiment, we fix the value tradeoff parameter \({\lambda }\) as 0.8. We illustrate the performance curves with different beam search sizes in Fig. 5(b)–(d). To make all performance curves fall into a comparable scale, all BLEU-1,2,3,4 and METEOR scores are normalized which is same as Eq. 14.

From Fig. 5(b)–(d) we can see that different datasets and models have different best beam search sizes. For Flickr8k dataset, our model obtains the best performance when beam search size is 2. But 4 is best for Flickr30k dataset and 3 is best for COCO dataset. However, our experiment need to compare with [21] who fixed beam search size as 3 for all datasets and models in experiment. In order to be persuasive, we fix beam search size as 3 for all datasets and models.

5 Conclusion

In this paper, we present a new method, which can explore the learning of semantic-visual embedding and attention-based LSTM. In particular, the semantic-embedding space is incorporated with attention-based LSTM learning. We explore the local and global relationship between the semantic and visual contents. Experimental results show that this method can improve the performances of attention-based image caption model significantly on three benchmark datasets measured by the BLEU and METEOR score metrics. We also explore the influences of different hyper-parameters on model. We will continue to explore the relationship between the visual and semantic contents from different perspectives by different methods.