VD-SAN: Visual-Densely Semantic Attention Network for Image Caption Generation

doi:10.1016/j.neucom.2018.02.106

Neurocomputing

Volume 328, 7 February 2019, Pages 48-55

https://doi.org/10.1016/j.neucom.2018.02.106 Get rights and content

Abstract

Recently, attribute has demonstrated its effectiveness in guiding image captioning system. However, most attributes based image captioning methods treat the attributes prediction task as a separate task and rely on a standalone stage to obtain the attributes for the given image, e.g., a pre-trained network like Fully Convolutional Neural Network (FCN) is usually adopted. Inherently, they ignore the correlation between the attribute prediction task and image representation extraction task, and at the same time increases the complexity of the image captioning system. In this paper, we aim to couple the attributes prediction stage and image representation extraction stage tightly and propose a novel and efficient image captioning framework called Visual-Densely Semantic Attention Network(VD-SAN). In particular, the whole captioning system consists of shared convolutional layers from Dense Convolutional Network (DenseNet), which are further split into a semantic attributes prediction branch and an image feature extraction branch, two semantic attention models, and a long short-term memory networks (LSTM) for caption generation. To evaluate the proposed architecture, we construct Flickr30K-ATT and MS-COCO-ATT datasets based on the original popular image caption datasets Flickr30K and MS COCO respectively, and each image from Flickr30K-ATT or MS-COCO-ATT is annotated with an attribute list in addition to the corresponding caption. Empirical results demonstrate that our captioning system can achieve significant improvements over state-of-the-art approaches.

Introduction

Image captioning, which attempts to generate captions (e.g., usually in English) to describe images or videos automatically, has become a hot topic due to recent advances in both computer vision and machine translation communities, especially on deep learning.

Recent work in this area adopts the famous encoder-decoder framework. The encoder-decoder framework was originally introduced for the sequence-to-sequence problem in machine translation society, where a recurrent neural network (RNN) is used to encode source sentence, and another RNN is employed to predict the target sentence. Such a framework has found applications in many areas like speech recognition [1], text recognition [2] etc. The sequence-to-sequence problem is also somehow analogous to image captioning problem. However, instead of translating a source sentence, image caption aims at “translating” the visual input into a sentence. Therefore, a convolutional neural network (CNN) is adopted to encode the image.

Most work uses the fully connected layer output of CNN as the encoded image representation, which will then be fed into the top RNN language model to inform it the image content, like [3], [4]. On the other hand, Xu et al. [3] propose to use the conv-layer output of CNN, and apply a spatial attention mechanism over the convolutional features. Further, Jia et al. [5] propose to feed semantic information extracted from the image into the model such that the generated caption are tightly coupled to the image content. He et al. [6] propose to exploit Part of Speech (PoS) tags as a guidance signal for caption generation. Until recently, many researchers have found that explicit visual attributes are also very effective in guiding the image caption generation process.

In this paper, we mainly focus on using explicit attributes to guide the image caption model. The work that bears the most similarity to our work is [7]. In their work, by extracting a set of visual attributes for a given image, and then feeding them into the RNN after using an attention model over the predicted semantic attributes, they achieve a big improvement. However, their method relies on an extra off-line attributes extraction stage, e.g., a separately trained fully convolutional network (FCN). Moreover, a standalone CNN is adopted to extract the image representations and thus increase the complexity of their captioning system. However, the attribute prediction task and caption task are highly related and should not be treated as separate tasks. By sharing features for the top attribute prediction sub-network and image representation extraction sub-network, we can jointly optimize the two tasks with multi-label classification loss and cross entropy loss applied on the Long short-term Memory (LSTM) respectively.

Further, we also collect two new datasets, named MS-COCO-ATT and Flickr30K-ATT. In essence, it is just MS COCO/Flickr30K dataset with attributes attached to each image besides its caption ¹.

In summary, our key contributions are as follows: (1) Based on MS COCO and Flickr30K dataset, we construct two datasets, i.e., MS-COCO-ATT and Flickr30K-ATT, where each image has an attribute list in addition to caption data. (2) We propose a novel framework for attribute-based image caption system, which couples the attributes prediction task and image representation extraction task for the captioning task more closely by sharing one base network. To the best of our knowledge, this is the first time for an attributes based image caption system that does not rely on rely on standalone systems like FCN to extract attributes. (3) We have done experiments on two image caption benchmarks and demonstrate its effectiveness. (4) We analyze the effects of different number of attributes on the caption performance, and show that with the increases of the attributes list length, the captioning performance converges.

The rest of this paper is organized as follows. Section 2 gives a brief review of related work on image caption. Section 3 introduces the details of our proposed method. Section 4 evaluates the proposed approach on two public benchmark datasets. Finally, Section 5 gives the conclusion.

Section snippets

Related work

The problem of image caption has recently received an upsurge of interests. For a comprehensive survey of the methods for image caption, see [8]. In general, approaches to this problem can be broadly categorized into three categories. The first family of methods is template-based, see [9] for example. In their approach, a predefined template sentence is used. Therefore the image caption task is reduced to a problem of detecting labels for objects, relations, or scene type of the given image.

The proposed approach

In this section, we will describe the main components of our model in detail. An overview of the model can be seen in Fig. 1. As is shown, the whole model is composed by five components: the shared low-level CNN for image feature extraction, the high-level image feature re-encoding branch, attribute prediction branch, the LSTM as caption generator and the attention model.

Datasets

To evaluate the effectiveness of our method, we conduct experiments on two most popular and challenging image caption benchmarks: Flickr30K [18] and MS COCO [19] datasets.

For experiments on MS COCO dataset, We first construct a new dataset named MS-COCO-ATT based on MS COCO dataset. The original MS COCO [19] contains 82,783 training images and 40,504 validation images of which each image is annotated with 5 captions. In addition, it also consists of an official test set of 40,775 images with

Conclusions

We have presented a novel image captioning model named VD-SAN, which employs explicit attributes to guide the caption generation, and it unifies the attributes prediction task and image captioning task more closely with shared base network-DenseNet201-Conv. And we have achieved competitive results on the popular MS COCO caption benchmark and Flickr30K benchmark against state-of-the-art methods. To train the attributes prediction network, we construct MS-COCO-ATT and Flickr30K-ATT based on MS

Acknowledgments

The authors would like to thank the anonymous reviewers for their valuable comments. This research was supported by the National Natural Science Foundation of China (No. 61573160).

Xinwei He received the B.S. degree in Electronics and Information Engineering from Hebei University of Science and Technology (HEBUST), Shijia zhuang, China in 2013. He is currently a Ph.D. student in the School of Electronic Information and Communications, HUST. His research interests include image caption, 3D shape analysis.

References (26)

D. Amodei et al.
Deep speech 2: End-to-end speech recognition in english and mandarin
Proceedings of the International Conference on Machine Learning
(2016)
B. Shi et al.
An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition
IEEE Trans. Pattern Anal. Mach. Intell.
(2017)
K. Xu et al.
Show, attend and tell: Neural image caption generation with visual attention.
Proceedings of the ICML
(2015)
J. Mao et al.
Deep captioning with multimodal recurrent neural networks (M-RNN)
Proceedings of the ICLR
(2015)
X. Jia et al.
Guiding the long-short term memory model for image caption generation
Proceedings of the IEEE International Conference on Computer Vision
(2015)
X. He et al.
Image caption generation with part of speech guidance
Pattern Recognit. Lett.
(2017)
Q. You et al.
Image captioning with semantic attention
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2016)
R. Bernardi et al.
Automatic description generation from images: A survey of models, datasets, and evaluation measures
J. Artif. Intell. Res.
(2016)
A. Farhadi et al.
Every picture tells a story: Generating sentences from images
Proceedings of the European Conference on Computer Vision
(2010)
J. Devlin, S. Gupta, R. Girshick, M. Mitchell, C.L. Zitnick, Exploring nearest neighbor approaches for image...

O. Vinyals et al.

Show and tell: A neural image caption generator

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2015)

C. Szegedy et al.

Going deeper with convolutions

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2015)

A. Karpathy et al.

Deep visual-semantic alignments for generating image descriptions

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2015)

Cited by (51)

Multilevel-based algorithm for hyperspectral image interpretation
2024, Computers and Electrical Engineering
Hyperspectral imagery contains spatial and spectral information, which can reveal the material properties of the target while intuitively displaying its spatial attributes. It has been applied in target recognition, search and rescue, and other fields. However, manual detection inevitably leads to missed detections and false alarms, necessitating the assistance of artificial intelligence for detection. To address this, we propose the multilevel-based algorithm for hyperspectral image interpretation. 1) From the spatial and spectral dimensions, we propose a semantic segmentation algorithm based on multidimensional information fusion to achieve semantic segmentation. 2) From the semantic and textual representation dimensions, we introduce a context interpretation module based on visual attention. We construct both real and simulated databases to validate the effectiveness of the algorithm. Experimental results demonstrate that the average accuracy of semantic segmentation achieved by the proposed algorithm is 74.3%. Additionally, the BLEU₁ score reaches 71.2, outperforming mainstream algorithms by 1.4.
Image caption generation using a dual attention mechanism
2023, Engineering Applications of Artificial Intelligence
In order to create a statement that accurately captures the main idea of an ambiguous visual, which is said to be a significant and demanding task? Conventional image captioning schemes are categorized into 2 classes: retrieval-oriented schemes and generation-oriented schemes. The image caption generating system should provide precise, fluid, natural, and informative phrases as well as accurately identify the content of the image, such as scene, object, relationship, and properties of the object in the image. However, it can be challenging to accurately express the image’s content when creating image captions because not all visual information can be used. In this article, a new image captioning model is introduced that includes 3 main phases like (1) Extraction of Inception V3 features (2) Dual (Visual and Textual) attention generation and (3) generation of image caption. Convolutional Neural Network (CNN) is used to generate visual attention after first deriving initial V3 features. The input texts for the associated images, on the other hand, are analyzed and given to LSTM for the creation of textual attention. To create image captions, Bidirectional LSTM (BI-LSTM) is used to combine textual and visual attention. The Self Improved Electric Fish Optimization (SI-EFO) algorithm is used in particular to optimize the weights of the BI-LSTM. In the end, several measures confirm that the implemented system has improved. The adopted model is 35.21%, 33.76%, 39.52%, 29.69%, 30.12%, 21.49%, and 31.71% better than GAN-RL, LSTM, GRU, EC + GOA, EC + CMBO, EC + DA, EC + EFO models.
Learning Double-Level Relationship Networks for image captioning
2023, Information Processing and Management
Image captioning aims to generate descriptive sentences to describe image main contents. Existing attention-based approaches mainly focus on the salient visual features in the image. However, ignoring the learning relationship between local features and global features may cause local features to lose the interaction with global concepts, generating impropriate or inaccurate relationship words/phrases in the sentences. To alleviate the above issue, in this work we propose the Double-Level Relationship Networks (DLRN) that novelly exploits the complementary local features and global features in the image, and enhances the relationship between features. Technically, DLRN builds two types of networks, separate relationship network and unified relationship embedding network. The former learns different hierarchies of visual relationship by performing graph attention for local-level relationship enhancement and pixel-level relationship enhancement respectively. The latter takes the global features as the guide to learn the local–global relationship between local regions and global concepts, and obtains the feature representation containing rich relationship information. Further, we devise an attention-based feature fusion module to fully utilize the contribution of different modalities. It effectively fuses the previously obtained relationship features and original region features. Extensive experiments on three typical datasets verify that our DLRN significantly outperforms several state-of-the-art baselines. More remarkably, DLRN achieves the competitive performance while maintaining notable model efficiency. The source code is available at the GitHub https://github.com/RunCode90/ImageCaptioning.
Learning joint relationship attention network for image captioning
2023, Expert Systems with Applications
Citation Excerpt :
By taking the above ablation results into account, we compare our best model UD + JRAN (i.e., JRAN) (see Section 4.2) with previous related state-of-the-art models on popular MSCOCO, Flickr30k and Flickr8k datasets. As introduced in related work, these comparative methods are mainly divided into two categories: (1) single feature methods including (Anderson et al., 2018; Chen et al., 2017; Ding, Qu, Xi, & Wan, 2020; Gao, Wang, Wang, Ma, & Gao, 2019; He, Yang, Shi, & Bai, 2019; Li & Jiang, 2020; Lu et al., 2017; Wang et al., 2020; Xiao et al., 2019; Ye, Han, & Liu, 2018; You, Jin, Wang, Fang, & Luo, 2016; Zha, Liu, Zhang, Zhang, & Wu, 2022; Zhang, Mei et al., 2021; Zhou, Zhang, Jiang, Zhang, & Fan, 2020; Zhu et al., 2018). and (2) feature fusion methods including (Jiang, Ma, Jiang, Liu, & Zhang, 2018; Li, Tang et al., 2018; Wang et al., 2019; Wu, Chen et al., 2021; Wu, Xu et al., 2021; Zhang, Shi, Mi and Yang, 2021).
Image captioning aims at automatically describing the main content of an image with a complete and natural sentence. Existing attention-based methods often focus on visual features individually, while ignoring relationship information between image features that provides important guidance for generating captions. To alleviate this issue, in this work we propose the new Joint Relationship Attention Network (JRAN) that novelly explores the relationships between the features in the image. Technically, JRAN capitalizes on semantic features as s supplementary to the region features, fully learn two types of relationships, the visual relationships between region features and the visual–semantic relationships between region features and semantic features. Then, JRAN further make a dynamic trade-off between them during outputting the relationship representation. Moreover, we devise a new feature fusion based attention, which can adaptively fuse the region features and previously obtained relationship representation when generating different words. Extensive experiments on MSCOCO and Flickr30k, Flickr8k datasets show the superiority of our JRAN method qualitatively and quantitatively compared with several related state-of-the-art methods. More remarkably, JRAN achieves 28.3% and 58.2% on Flickr30k, and 22.7% and 55.3% on Flickr8k for the metrics BLEU4 and CIDEr.
Local-global visual interaction attention for image captioning
2022, Digital Signal Processing: A Review Journal
Image captioning is a typical cross-modal task, which aims to automatically describe the main content of an image with a complete and natural sentence. Existing attention based approaches treat local feature and global feature in the image individually, neglecting the intrinsic interaction between them that provides important guidance for generating caption. To alleviate the above issue, in this work we propose a novel Local-Global Visual Interaction Attention (LGVIA) structure that novelly explores the intrinsic interactions between local feature and global feature in the image. Specifically, we devise a new visual interaction graph network that mainly consists of visual interaction encoding module and visual interaction fusion module. The former implicitly encodes the visual relationships between local feature and global feature to obtain an enhanced visual representation containing rich local-global feature relationship. The latter fuses the previously obtained multiple relationship features to further enrich different-level relationship attribute information. In addition, we introduce a new relationship attention based LSTM module to guide the word generation by dynamically focusing on the previously output fusion relationship information. Extensive qualitative and quantitative experimental results show that the superiority of our LGVIA approach on the large-scale MSCOCO dataset. More remarkably, LGVIA outperforms the related state-of-the-art methods on the small-scale Flickr30k dataset.
Dynamic-balanced double-attention fusion for image captioning
2022, Engineering Applications of Artificial Intelligence
Citation Excerpt :
Therefore, Chen et al. (2017) solved this problem by using the sequential spatial and channel attentions. Although the existing attention approaches (Zhu et al., 2016; Lu et al., 2017; Li et al., 2018; He et al., 2019b; Wang et al., 2019b; Wu et al., 2020) have successfully utilize the natures of CNN for image captioning, there remain two significant concerns: (1) a lower reliability of the model, and (2) the interference of unrelated features. Specifically, for the concern (1), the spatial attention is based on the attentive feature map that modulated by the channel attention (Chen et al., 2017).
Image captioning has received significant attention in the cross-modal field in which spatial and channel attentions play a crucial role. However, such attention-based approaches ignore two issues: (1) errors or noise in the channel feature map amplifies in the spatial feature map, leading to a lower model reliability; (2) image spatial feature and channel feature provide different contributions to the prediction both function words (e.g., “in”, “out” and “on”) and notional words (e.g., “girl”, “teddy” and “bear”). To alleviate the above issues, in this paper we propose the Dynamic-Balanced Double-Attention Fusion (DBDAF) for image captioning task that novelly exploits the attention variation and enhances the overall performance of the model. Technically, DBDAF first integrates a parallel Double Attention Network (DAN) in which channel attention is capitalized on as a supplement to the region attention, enhancing the model reliability. Then, a attention variation based Balancing Attention Fusion Mechanism (BAFM) module is devised. When predicting function words and notional words, BAFM makes a dynamic balance between channel attention and region attention based on attention variation. Moreover, to achieve the richer image description, we further devise a Doubly Stochastic Regularization (DSR) penalty and integrate it into the model loss function. Such DSR makes the model equally focus on every pixel and every channel in generating entire sentence. Extensive experiments on the three typical datasets show our DBDAF outperforms the related end-to-end leading approaches clearly. More remarkably, DBDAF achieves 1.04% and 1.75% improvement in terms of BLEU4 and CIDEr on the MSCOCO datasets.

View all citing articles on Scopus

Yang Yang received the B.S. degree in Electronics and Information Engineering from Huazhong University of Science and Technology (HUST), Wuhan, China in 2015. He is currently working toward the M.S. degree in the School of Electronic Information and Communications, HUST. His research interests include scene text detection, deep learning and its applications.

Baoguang Shi received the B.S. degree in Electronics and Information Engineering from the Huazhong University of Science and Technology (HUST), Wuhan, China, in 2012, where he is currently working toward the Ph.D. degree at the School of Electronic Information and Communications. His research interests include scene text detection and recognition, script identification and face alignment.

Xiang Bai received the B.S., M.S., and Ph.D. degrees from the Huazhong University of Science and Technology (HUST), Wuhan, China, in 2003, 2005, and 2009, respectively, all in electronics and information engineering. He is currently a Professor with the School of Electronic Information and Communications, HUST. He is also the Vice-director of the National Center of Anti-Counterfeiting Technology, HUST. His research interests include object recognition, shape analysis, scene text recognition and intelligent systems.

View full text

VD-SAN: Visual-Densely Semantic Attention Network for Image Caption Generation

Abstract

Introduction

Section snippets

Related work

The proposed approach

Datasets

Conclusions

Acknowledgments

Deep speech 2: End-to-end speech recognition in english and mandarin

Proceedings of the International Conference on Machine Learning

An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition

IEEE Trans. Pattern Anal. Mach. Intell.

Show, attend and tell: Neural image caption generation with visual attention.

Proceedings of the ICML

Deep captioning with multimodal recurrent neural networks (M-RNN)

Proceedings of the ICLR

Guiding the long-short term memory model for image caption generation

Proceedings of the IEEE International Conference on Computer Vision

Image caption generation with part of speech guidance

Pattern Recognit. Lett.

Image captioning with semantic attention

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Automatic description generation from images: A survey of models, datasets, and evaluation measures

J. Artif. Intell. Res.

Every picture tells a story: Generating sentences from images

Proceedings of the European Conference on Computer Vision

Show and tell: A neural image caption generator

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Going deeper with convolutions

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Deep visual-semantic alignments for generating image descriptions

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition