Elsevier

Neurocomputing

Volume 328, 7 February 2019, Pages 48-55
Neurocomputing

VD-SAN: Visual-Densely Semantic Attention Network for Image Caption Generation

https://doi.org/10.1016/j.neucom.2018.02.106Get rights and content

Abstract

Recently, attribute has demonstrated its effectiveness in guiding image captioning system. However, most attributes based image captioning methods treat the attributes prediction task as a separate task and rely on a standalone stage to obtain the attributes for the given image, e.g., a pre-trained network like Fully Convolutional Neural Network (FCN) is usually adopted. Inherently, they ignore the correlation between the attribute prediction task and image representation extraction task, and at the same time increases the complexity of the image captioning system. In this paper, we aim to couple the attributes prediction stage and image representation extraction stage tightly and propose a novel and efficient image captioning framework called Visual-Densely Semantic Attention Network(VD-SAN). In particular, the whole captioning system consists of shared convolutional layers from Dense Convolutional Network (DenseNet), which are further split into a semantic attributes prediction branch and an image feature extraction branch, two semantic attention models, and a long short-term memory networks (LSTM) for caption generation. To evaluate the proposed architecture, we construct Flickr30K-ATT and MS-COCO-ATT datasets based on the original popular image caption datasets Flickr30K and MS COCO respectively, and each image from Flickr30K-ATT or MS-COCO-ATT is annotated with an attribute list in addition to the corresponding caption. Empirical results demonstrate that our captioning system can achieve significant improvements over state-of-the-art approaches.

Introduction

Image captioning, which attempts to generate captions (e.g., usually in English) to describe images or videos automatically, has become a hot topic due to recent advances in both computer vision and machine translation communities, especially on deep learning.

Recent work in this area adopts the famous encoder-decoder framework. The encoder-decoder framework was originally introduced for the sequence-to-sequence problem in machine translation society, where a recurrent neural network (RNN) is used to encode source sentence, and another RNN is employed to predict the target sentence. Such a framework has found applications in many areas like speech recognition [1], text recognition [2] etc. The sequence-to-sequence problem is also somehow analogous to image captioning problem. However, instead of translating a source sentence, image caption aims at “translating” the visual input into a sentence. Therefore, a convolutional neural network (CNN) is adopted to encode the image.

Most work uses the fully connected layer output of CNN as the encoded image representation, which will then be fed into the top RNN language model to inform it the image content, like [3], [4]. On the other hand, Xu et al. [3] propose to use the conv-layer output of CNN, and apply a spatial attention mechanism over the convolutional features. Further, Jia et al. [5] propose to feed semantic information extracted from the image into the model such that the generated caption are tightly coupled to the image content. He et al. [6] propose to exploit Part of Speech (PoS) tags as a guidance signal for caption generation. Until recently, many researchers have found that explicit visual attributes are also very effective in guiding the image caption generation process.

In this paper, we mainly focus on using explicit attributes to guide the image caption model. The work that bears the most similarity to our work is [7]. In their work, by extracting a set of visual attributes for a given image, and then feeding them into the RNN after using an attention model over the predicted semantic attributes, they achieve a big improvement. However, their method relies on an extra off-line attributes extraction stage, e.g., a separately trained fully convolutional network (FCN). Moreover, a standalone CNN is adopted to extract the image representations and thus increase the complexity of their captioning system. However, the attribute prediction task and caption task are highly related and should not be treated as separate tasks. By sharing features for the top attribute prediction sub-network and image representation extraction sub-network, we can jointly optimize the two tasks with multi-label classification loss and cross entropy loss applied on the Long short-term Memory (LSTM) respectively.

Further, we also collect two new datasets, named MS-COCO-ATT and Flickr30K-ATT. In essence, it is just MS COCO/Flickr30K dataset with attributes attached to each image besides its caption 1.

In summary, our key contributions are as follows: (1) Based on MS COCO and Flickr30K dataset, we construct two datasets, i.e., MS-COCO-ATT and Flickr30K-ATT, where each image has an attribute list in addition to caption data. (2) We propose a novel framework for attribute-based image caption system, which couples the attributes prediction task and image representation extraction task for the captioning task more closely by sharing one base network. To the best of our knowledge, this is the first time for an attributes based image caption system that does not rely on rely on standalone systems like FCN to extract attributes. (3) We have done experiments on two image caption benchmarks and demonstrate its effectiveness. (4) We analyze the effects of different number of attributes on the caption performance, and show that with the increases of the attributes list length, the captioning performance converges.

The rest of this paper is organized as follows. Section 2 gives a brief review of related work on image caption. Section 3 introduces the details of our proposed method. Section 4 evaluates the proposed approach on two public benchmark datasets. Finally, Section 5 gives the conclusion.

Section snippets

Related work

The problem of image caption has recently received an upsurge of interests. For a comprehensive survey of the methods for image caption, see [8]. In general, approaches to this problem can be broadly categorized into three categories. The first family of methods is template-based, see [9] for example. In their approach, a predefined template sentence is used. Therefore the image caption task is reduced to a problem of detecting labels for objects, relations, or scene type of the given image.

The proposed approach

In this section, we will describe the main components of our model in detail. An overview of the model can be seen in Fig. 1. As is shown, the whole model is composed by five components: the shared low-level CNN for image feature extraction, the high-level image feature re-encoding branch, attribute prediction branch, the LSTM as caption generator and the attention model.

Datasets

To evaluate the effectiveness of our method, we conduct experiments on two most popular and challenging image caption benchmarks: Flickr30K [18] and MS COCO [19] datasets.

For experiments on MS COCO dataset, We first construct a new dataset named MS-COCO-ATT based on MS COCO dataset. The original MS COCO [19] contains 82,783 training images and 40,504 validation images of which each image is annotated with 5 captions. In addition, it also consists of an official test set of 40,775 images with

Conclusions

We have presented a novel image captioning model named VD-SAN, which employs explicit attributes to guide the caption generation, and it unifies the attributes prediction task and image captioning task more closely with shared base network-DenseNet201-Conv. And we have achieved competitive results on the popular MS COCO caption benchmark and Flickr30K benchmark against state-of-the-art methods. To train the attributes prediction network, we construct MS-COCO-ATT and Flickr30K-ATT based on MS

Acknowledgments

The authors would like to thank the anonymous reviewers for their valuable comments. This research was supported by the National Natural Science Foundation of China (No. 61573160).

Xinwei He received the B.S. degree in Electronics and Information Engineering from Hebei University of Science and Technology (HEBUST), Shijia zhuang, China in 2013. He is currently a Ph.D. student in the School of Electronic Information and Communications, HUST. His research interests include image caption, 3D shape analysis.

References (26)

  • D. Amodei et al.

    Deep speech 2: End-to-end speech recognition in english and mandarin

    Proceedings of the International Conference on Machine Learning

    (2016)
  • B. Shi et al.

    An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2017)
  • K. Xu et al.

    Show, attend and tell: Neural image caption generation with visual attention.

    Proceedings of the ICML

    (2015)
  • J. Mao et al.

    Deep captioning with multimodal recurrent neural networks (M-RNN)

    Proceedings of the ICLR

    (2015)
  • X. Jia et al.

    Guiding the long-short term memory model for image caption generation

    Proceedings of the IEEE International Conference on Computer Vision

    (2015)
  • X. He et al.

    Image caption generation with part of speech guidance

    Pattern Recognit. Lett.

    (2017)
  • Q. You et al.

    Image captioning with semantic attention

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • R. Bernardi et al.

    Automatic description generation from images: A survey of models, datasets, and evaluation measures

    J. Artif. Intell. Res.

    (2016)
  • A. Farhadi et al.

    Every picture tells a story: Generating sentences from images

    Proceedings of the European Conference on Computer Vision

    (2010)
  • J. Devlin, S. Gupta, R. Girshick, M. Mitchell, C.L. Zitnick, Exploring nearest neighbor approaches for image...
  • O. Vinyals et al.

    Show and tell: A neural image caption generator

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2015)
  • C. Szegedy et al.

    Going deeper with convolutions

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2015)
  • A. Karpathy et al.

    Deep visual-semantic alignments for generating image descriptions

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2015)
  • Cited by (51)

    • Image caption generation using a dual attention mechanism

      2023, Engineering Applications of Artificial Intelligence
    • Learning Double-Level Relationship Networks for image captioning

      2023, Information Processing and Management
    • Learning joint relationship attention network for image captioning

      2023, Expert Systems with Applications
      Citation Excerpt :

      By taking the above ablation results into account, we compare our best model UD + JRAN (i.e., JRAN) (see Section 4.2) with previous related state-of-the-art models on popular MSCOCO, Flickr30k and Flickr8k datasets. As introduced in related work, these comparative methods are mainly divided into two categories: (1) single feature methods including (Anderson et al., 2018; Chen et al., 2017; Ding, Qu, Xi, & Wan, 2020; Gao, Wang, Wang, Ma, & Gao, 2019; He, Yang, Shi, & Bai, 2019; Li & Jiang, 2020; Lu et al., 2017; Wang et al., 2020; Xiao et al., 2019; Ye, Han, & Liu, 2018; You, Jin, Wang, Fang, & Luo, 2016; Zha, Liu, Zhang, Zhang, & Wu, 2022; Zhang, Mei et al., 2021; Zhou, Zhang, Jiang, Zhang, & Fan, 2020; Zhu et al., 2018). and (2) feature fusion methods including (Jiang, Ma, Jiang, Liu, & Zhang, 2018; Li, Tang et al., 2018; Wang et al., 2019; Wu, Chen et al., 2021; Wu, Xu et al., 2021; Zhang, Shi, Mi and Yang, 2021).

    • Local-global visual interaction attention for image captioning

      2022, Digital Signal Processing: A Review Journal
    • Dynamic-balanced double-attention fusion for image captioning

      2022, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      Therefore, Chen et al. (2017) solved this problem by using the sequential spatial and channel attentions. Although the existing attention approaches (Zhu et al., 2016; Lu et al., 2017; Li et al., 2018; He et al., 2019b; Wang et al., 2019b; Wu et al., 2020) have successfully utilize the natures of CNN for image captioning, there remain two significant concerns: (1) a lower reliability of the model, and (2) the interference of unrelated features. Specifically, for the concern (1), the spatial attention is based on the attentive feature map that modulated by the channel attention (Chen et al., 2017).

    View all citing articles on Scopus

    Xinwei He received the B.S. degree in Electronics and Information Engineering from Hebei University of Science and Technology (HEBUST), Shijia zhuang, China in 2013. He is currently a Ph.D. student in the School of Electronic Information and Communications, HUST. His research interests include image caption, 3D shape analysis.

    Yang Yang received the B.S. degree in Electronics and Information Engineering from Huazhong University of Science and Technology (HUST), Wuhan, China in 2015. He is currently working toward the M.S. degree in the School of Electronic Information and Communications, HUST. His research interests include scene text detection, deep learning and its applications.

    Baoguang Shi received the B.S. degree in Electronics and Information Engineering from the Huazhong University of Science and Technology (HUST), Wuhan, China, in 2012, where he is currently working toward the Ph.D. degree at the School of Electronic Information and Communications. His research interests include scene text detection and recognition, script identification and face alignment.

    Xiang Bai received the B.S., M.S., and Ph.D. degrees from the Huazhong University of Science and Technology (HUST), Wuhan, China, in 2003, 2005, and 2009, respectively, all in electronics and information engineering. He is currently a Professor with the School of Electronic Information and Communications, HUST. He is also the Vice-director of the National Center of Anti-Counterfeiting Technology, HUST. His research interests include object recognition, shape analysis, scene text recognition and intelligent systems.

    View full text