1 Introduction

Describing the visual scene or some details after a glance of an image is simple for people but quite difficult for visual recognition models. The task has great impact in some areas such as helping the blind people perceive the environment. So it becomes a focus in the computer vision community.

The task involves some subproblems such as object detection, behavior recognition etc. Therefore, most previous works are based on these tasks, they rely on hardcoded visual concepts and sentence templates [5, 12]. However, the sentences templates limit the variety of generated sentences. But with the success of recurrent neural networks in processing the sequence problems like machine translation [25], some neural encoder-decoder models have been explored [23]. The neural encoder-decoder models take the task as a translation problem, which translate the image information to text information. In these models, the encoder is convolutional neural networks, which get a great success in some computer vision problems such as image classification, object detection, and the decoder is recurrent neural networks, which show a great potential in handling the sequence. And without the sentence templates, the generated sentences are more flexible and more similar to natural language.

Recently, with the attention scheme, which usually align each word of the sentence with some regions in the image, the neural encoder-decoder models make a great progress. However, these attention schemes need to get the region features and change the initial model structure. But our attention scheme is based on the global feature of the image, which is the input of the initial model. Thus we do not need extra procedure to get the feature of image regions. In summary, our main contribution is we propose a new attention scheme without any region features of the image, which can be directly appended to the initial model.

2 Related Work

2.1 Convolutional Neural Networks for Feature Extraction

Recently, convolutional neural networks achieve great success in the field of computer vision like image classification [8, 19, 20, 22], object detection [6, 7], image annotation. The success of CNNs is believed to be the result of its powerful ability to extract features. The feature extracted by CNNs works better than most of outstanding handcrafted features like HOG [3] and SIFT [16]. Thanks to the remarkable performance of the CNNs in the object detection and image annotation, which are the subproblems of image caption, we can represent nearly all the image information in a simple way. And it’s comprehensive representation of objects and attribute, which build an essential foundation for the image caption task.

2.2 Image Caption with Templates

Based on the previous work of the image processing and sentence processing, a number of approaches take the task as a retrieval problem. They break up all the samples in the training set and the predict result is the combination of the result, which is stitched from the training annotations [13,14,15]. Other approaches generate caption with fixed templates whose place are filled with the results of object detection, attribute classification and scene recognition. But these approaches limit the variety of possible outputs.

2.3 Image Caption with Neural Networks

Inspired by the success of sequence-to-sequence encoder-decoder frameworks in the field of machine translation [1, 2, 19, 21], the image caption task is regard as translating from image to text. [17] first develop a multimodal recurrent neural networks to translate the image into the text. [11] propose a feed forward neural network with a multimodal log-bilinear model to predict the next word given the previous output and image. [9] simplify the model and replace the feed forward network with a LSTM as the decoder, it also shows that the image only need to be input into the LSTM at the beginning of the model. [21] developes a new method to align the text with image, especially the word with region, by the output of bidirectional RNN [18] and the object detection results of the image from R-CNN [13]. The result of alignment in the joint embedding helps the caption generation. [4] replaces the vanilla recurrent neural networks with a multilayer LSTM as the decoder to generate caption. In their models, the first and second LSTM produce a joint representation of the visual and language inputs, the third and fourth LSTM transform the outputs of the previous LSTMs to produce the output.

2.4 Attention Scheme for Image Caption

Recently, attention scheme has been integrated into the encoder-decoder frameworks. [26] learns a latent alignment from scratch by incorporating the attention mechanisms when generating the word. [24] introduces a method to incorporate the high-level semantic concepts into the encoder-decoder approach. [27] reveals several architectures for augmenting high-level attributes from image to image representation for description generation. In all above attention schemes, some of them use the region features, which usually are mid-level features. Though [10] uses high-level semantic concepts, the high-level feature is extracted by manual selection which is unpractical for large data.

3 Our Model

In this section, we introduce our end-to-end neural network model. Our model is based on the encoder-decoder network in natural language translation. So it’s a probabilistic framework, which directly maximize the probability of the correct sentence given the image. But the difference between the natural language translation model and our model is we use the CNNs as the encoder for the image, not the recurrent neural network for sequence input. Both of the models use recurrent neural networks as the decoder to generate the target sequence output. Therefore the problem can be solved by the following formulation

$$\begin{aligned} \theta ^*=\mathop {\mathrm{arg max}}\limits _{\theta }\sum _{(I,S)}\log p(S|I;\theta ) \end{aligned}$$
(1)

where \(\theta \) is the parameters of our model, I is the image, and S is the correct description. Because S is a sequence data, its length varies in different sentences. Therefore, we apply the chain rule to model the joint probability over \({S_0,\dots ,S_N}\) where N is the length of this particular sentence as

$$\begin{aligned} \log p(S|I;\theta )=\sum _{t=0}^{N}\log p(s_t|I,s_0,\dots ,s_{t-1};\theta ) \end{aligned}$$
(2)

And given the image information and the information extracted from syntax structure, the attention scheme we used can control the model to focus on specific part of images information and decide the output content. Therefore the complete version of our model can be written as

$$\begin{aligned} \log p(S|I;\theta )=\sum _{t=0}^{N}\log p(s_t|I_t,s_0,\dots ,s_{t-1};\theta ) \end{aligned}$$
(3)
$$\begin{aligned} I_t=f_a(I,s_0,\dots ,s_{t-1})\bigodot I,t\in [0,N] \end{aligned}$$
(4)
$$\begin{aligned} \theta ^*,\theta ^*_{a}=\mathop {\mathrm{arg max}}\limits _{\theta ,\theta _a}\sum _{(I,S)}\log p(S|I;\theta ,\theta _a) \end{aligned}$$
(5)

where \(\theta \) is the parameter of basic model, \(\theta _a\) is the parameter of attention scheme, \(f_a(\cdot )\) means the attention scheme.

3.1 Encode Images

Following prior works [10, 12], we regard the generated description as the combination of objects and their attributes which can be represented as some high-level features. Then we follow the work of NIC [23] to encode the image with CNNs, and we use the VGGNet as the image feature extractor. The VGGNet is pretrained on ImageNet, it can classify 1000 classes. So we believe it has powerful ability to extract features from image, which can obtain near all information we need such as object classes, attributes etc. Concretely, in our model, we use the 16-layers version of VGG-Net. Following NIC, we only use the second full connected layer in the CNNs, it’s a 4096-dimensional vector. Since our model need to process both image information and text information, so we need convert all information we needed into the multimodal space. The encoder can be represented as

$$\begin{aligned} a=f(CNN_{\theta _c}(I)) \end{aligned}$$
(6)
$$\begin{aligned} v=W_Ia+b_I \end{aligned}$$
(7)

where I represents the image, \({CNN_{\theta _c}(I)}\) translates the image pixel information into 4096-dimensinal feature vector, \(W_I\in R^{m\times d}\) is weight to be learned, which translate the image information into the multimodal space, \(b_I\) is the bias. We set the activation function \(f(\cdot )\) to the rectified linear unit (ReLU), which computes \(f:x\mapsto \max (x,0)\).

Fig. 1.
figure 1

The image encoding procedure.

Specifically, there are two reasons why we use the second full connected layer of the CNNs. On the one hand, the features extracted from the convolutional layer are distributed, and the data has stronger sparsity with the convolutional layer going deeper, while the bottom layer extract the underlying features we do not need. On the other hand, for the three full connected layers, we’d like to choose the top layers because we think it contains more higher-level features. But the last full connected layer has 1000 dimensions, which corresponds to the 1000 classes in the ImageNet Large Scale Visual Recognition Challenge, not all the 1000 classes are included in our image. Therefore, we choose the features extracted from the second full connected layer in VGGNet to represent the image. And before encoding the image features, we use ReLU function to remove the negative output (Fig. 1).

3.2 Decoding Sentences

Since our target is a sentence, which is in the form of a sequence of words, our decoder need to be able to handle sequence problem. So we choose the recurrent neural networks, it achieved a great success in sequence problem, especially in machine translation [25], which is the base of our model. In our model, we build a recurrent neural network which has 2 inputs, image and previous output. Both of them need be encoded into the multimodal space, the process for image has been declared in Sect. 3.1. When encoding the previous output, which is a word, we traverse all the sentences in the training data to formulate a vocabulary. And following [10], we cut out the words whose frequency is lower than the threshold, which is 5 in our model. And for convenience, we add the start and end symbol to each sentence. Then every word has a unique code by the method of one-hot encoding. The encoding result is sparse and high dimensional, thus we need to convert it into the dense multimodal space in the following from

$$\begin{aligned} x_t=W_ss_{t-1}+b_s \end{aligned}$$
(8)

where \(s_{t-1}\) is the output of RNNs at time step \(t-1\), which is the encoding result by the method of one-hot encoding, \(x_t\) is the encoding result in the multimodal space, \(W_s\) is the weight to be learned, which convert the word into the multimodal space, \(b_s\) is the bias.

Then with the multimodal data as inputs and the sequence of words as the labels, we can train our model by minimizing the cross-entropy loss. The full recurrent neural networks can be represented by the following formulations

$$\begin{aligned} h_t=f(v+x_t+W_hh_{t-1}) \end{aligned}$$
(9)
$$\begin{aligned} s_t=softmax(W_oh_t+b_o) \end{aligned}$$
(10)

where \(x_t\) is the input to the recurrent neural networks at time step t, v is the image encoding result in the multimodal result, \(h_t\) and \(h_{t-1}\)are the hidden layers state at time step t and \(t-1\), \(s_t\) is the output of RNNs at time step t, \(softmax(\cdot )\) is the softmax function. We use the Stochastic Gradient Descent algorithm to train our model (Fig. 2).

Fig. 2.
figure 2

The word decoding procedure.

3.3 Global Feature Based Attention Scheme

Attention scheme is the key of a good model when generating the description. Because the image is in a static state, but the description is dynamitic (i.e. the description is a list of words which are generated over time). So when generating different words, we need focus on different regions in the image, sometimes maybe the center of the image, sometimes just the background of the full image. It means the input should be dynamic not static. So when we generating the word, we need to decide which part of image information to be focused based on the content we previous generated, it’s the core insight of attention scheme. The most similar model to our global feature based attention scheme (GFA) is [26], but the difference is we do not use features extracted from regions but from the full image and we use a different activation function.

In our model, we use a simple feed forward network with only one hidden layer to serve as attention scheme to decide the weights of different part of image information. The scheme can be represented as the following formulation

$$\begin{aligned} h_t^a=\sigma (W_gh_{t-1}+W_ys_{t-1}+a) \end{aligned}$$
(11)
$$\begin{aligned} o_t^a=\sigma (W_mh_t^a) \end{aligned}$$
(12)
$$\begin{aligned} a_t=o_t^a\bigodot a \end{aligned}$$
(13)

where v is the image feature in the multimodal space, \(s_{t-1}\) is the previous output, \(h_t^a\) is the hidden layer state of our feed forward network at time step t, \(\sigma (\cdot )\) means we use sigmoid function as the active function, \(o_t^a\) is the weight, \(\bigodot \) means we use element-wise product, \(W_g,W_y,W_m\) are weights to be learned (Fig. 3).

Fig. 3.
figure 3

Our global feature based attention scheme. And \(\bigodot \) means the element-wise product.

The reason why we use sigmoid function is it can normalize the weight into the range of (0, 1) and add nonlinearity as an activation function, and more importantly, the global feature based attention scheme do not need the competition between different parts of image information but other nonlinearity function like \(softmax(\cdot )\) will compete between each other. Therefore, the input of our model can be written as

$$\begin{aligned} h_t=f(v_t+x_t+W_hh_{t-1}) \end{aligned}$$
(14)
$$\begin{aligned} v_t=W_aa_t+b_a \end{aligned}$$
(15)

where \(v_t\) is the result of global feature based attention scheme worked in multimodal space.

3.4 Our Full Model

Because our attention scheme is based on the image feature, we do not need other information. We can append our attention scheme to the initial model directly. And we can train the attention scheme with our original model as an end to end system.

Therefore, at each time step, the image feature will be input into the attention scheme first, then the output will instead the original image feature. So we do not need change the structure of the initial model.

Fig. 4.
figure 4

Our neural image caption model with global feature based attention scheme.

4 Results

4.1 Implementation Details

Encoder-CNN. We use the 16-layer version of VGG-Net as the encoder-CNN, specifically, the second fully connected layer outputs are used as the representation of images, which is a 4096-dimension vector. Besides, we add a ReLU function to remove the negative outputs. In our training stage, we fix the convolutional layers and fine-tune the full connected layers on the ISLVRC16 dataset for 10 epochs (Fig. 4).

Decoder-RNN. We concatenate the word embedding vector and image feature to get the input in multimodal space. Our decoder-RNN only has one hidden layer with the size of 256. And in order to make our model be a nonlinear system, we add ReLU function after the hidden layer, which can also avoid the vanishing gradient problem. Besides, we use dropout to avoid over-fitting problem, the dropout ratio is set to 0.4 in our model.

Training. We set the learning rate to 0.001 and decayed by 0.1 after every 50 epochs. Because every image has several descriptions in the dataset, we extract the image with only one description for each sample. We use the Stochastic Gradient Descent algorithm to train our model and the batch size is 128.

Evaluation. Because the target output of our image caption model is the same as the machine translation problem. We use the same evaluation criteria to evaluate our model and we choose the BLEU algorithm and METEOR.

4.2 Quantitative Analysis

We test our model on Flickr8k dataset and Flickr30k dataset, and we contrast our model with some classic models. Our original model is based on RNNs, which is proved to have poor performance than LSTM used in NIC. But our GFA shows the same improvement performance compared with classic models. Our GFA improves the BLEU-n score and METEOR score by 5.0/4.8/4.8/3.9/0.017 in Flickr8k and 1.6/0.9/0.8/0.5/0.012 in Flickr30k, compared with hard-attention, which improves the score by 4/4.7/4.4/-/0.015 in Flickr8k and 0.6/1.6/1.9/1.6/0.003 in Flickr30k, and soft-attention, which improves the score by 4/3.8/2.9/-/0.001 in Flickr8k and 0.4/1.1/1.1/0.8/0.002 in Flickr30k.

Since our GFA is based on the global feature and we know the weights and biases of full connected layers in CNNs. Then we can derive the equation to get the weights of the different parts of the last convolutional neural networks. In the derivation procedure, we use least-square method to solve the over-determined system of equations, and we ignore the influence of ReLU function attached to the second full connected layer. Finally, we use Polynomial Interpolation Algorithm to recover the weighs of different parts of original image. Because in the procedure of derivation, we ignore the affection of the nonlinearity function, the derivation result is just the inference of the alignment between the word and the image. Even so, the result shows our global feature based attention scheme can achieve aligning the word with relevant regions (Tables 1 and 2) (Fig. 5).

Table 1. Results compared with other models.
Table 2. The improvement of different attention models.
Fig. 5.
figure 5

The alignment result between regions and words.

5 Conclusion

We propose a neural image caption extractor with a new attention scheme. Our global feature based attention scheme is based on the features of the full image rather than regions and we use a simpler structure. And we achieve the same improvement result compared with other attention scheme.