Elsevier

Neurocomputing

Volume 468, 11 January 2022, Pages 48-59
Neurocomputing

A visual persistence model for image captioning

https://doi.org/10.1016/j.neucom.2021.10.014Get rights and content

Abstract

Object-level features from Faster R-CNN and attention mechanism have been used extensively in image captioning based on Encoder-Decoder frameworks. However, most existing methods feed the average pooling of object features as the global representation to the captioning model and recalculate the attention weights of object regions when generating a new word without considering the visual persistence like humans. In this paper, we respectively build Visual Persistence modules in encoder and decoder: The visual persistence module in encoder seeks the core object features to replace the image global representation; the visual persistence module in decoder evaluates the correlation between previous attention results and current attention results, and fuses them as the final attended feature to generate a new word. The experimental results on MSCOCO validate the effectiveness and competitiveness of our Visual Persistence Model (VPNet). Remarkably, VPNet also achieves competitive scores in most metrics on MSCOCO online test server compared to the existing state-of-the-art methods.

Introduction

Image captioning aims to generate credible and fluent natural language descriptions for given images. It needs to ensure the correctness of the objects, attributes, and semantic information involved in the captions. The task is connected with Computer Vision and Natural Language Processing, and which is a challenging task and has attracted more and more attention.

Recently, the encoder-decoder model with attention mechanism has become the fundamental framework of image captioning and achieved significant progress [40], [43], [2], [34], where the encoder based on Convolutional Neural Network (CNN) is used to extract image feature vectors with fixed-length, and the decoder based on Recurrent Neural Network (RNN) generates captions word by word.

Peter Anderson et al. used the results of object detection with Faster R-CNN as the feature vectors of the image and applied the attention mechanism to these object features [2]. Due to its outstanding effect, the visual attention mechanism and object features have been widely adopted in recent image captioning models [46], [34], [17], [20]. In these works, the attention mechanism aims to connect the generated word with object regions at each time step where attention layers between adjacent time steps only interact through the hidden states of LSTM, without direct information interaction. However, M. Coltheart mentioned that human continues to receive visual stimulation for a period of time after the disappearance of visual stimulation, the information extracted before continues to be kept as “iconic memory” for some time. This phenomenon is called “Visual Persistence” [9]. Therefore, we believe that there should be direct interaction between two attention layers in adjacent time steps to simulate the information retention. Besides, the average pooling of object features vector is fed as the image global feature to the captioning model instead of deriving the core object features, which may leave out some simple and useful prior information.

To address these issues above, we propose a Visual Persistence Model (VPNet) for image captioning. In Encoder, we extract image object features following [2] and compute the average pooling of object features as image global feature representation. Different from previous works which directly feed the image global feature to the captioning model [2], [20], [17], [34], we first use the image global feature as query information to derive the main object feature from the raw object features with multi-head self-attention [38]. In Decoder, we replace the image global feature with the core object feature and feed it into the captioning model at each time step. Besides, when generating the t-th word, we design a visual persistence module to evaluate the relevance between the previous and current attention results with current context information (e.g. current LSTM hidden state), which can guide to fuse the previous attention results and current attention results as the final visual attended feature to generate t-th word finally. In summary, we simply introduce the “Visual Persistence” into image captioning task, where the visual persistence modules in encoder and decoder can be regarded as global and local “iconic memory”. In addition, we extend the self-attention module [38] from spatial attention to spatial-channel attention by adding extra channel-wise queries.

We evaluate our model on MSCOCO [27] “Karpathy” [19] test split and online test server under cross-entropy loss and CIDEr optimization. Our proposed model has a good outperformance and competitiveness compared to other existing state-of-the-art methods. On MSCOCO “Karpathy” test split, a single VPNet achieves BLEU-1/ BLEU-4/ METEOR/ ROUGE-L/ CIDEr/ SPICE scores of 78.6/ 38.2/ 28.6/ 58.0/ 121.0/ 21.9 with cross-entropy loss and 80.9/ 39.7/ 29.3/ 59.2/ 130.4/ 23.2 with CIDEr optimization. On MSCOCO online test server, compared with the works officially published in recent years, an ensemble of 4 VPNet models achieves the top-3 performance over all the metrics.

Our main contributions are as follows: (1) We propose Visual Persistence modules in Encoder and Decoder that simulate the “iconic memory” retention in human “Visual Persistence”, which can be regarded as the introduction of more semantic and interactional information; (2) We simply improve the self-attention block from spatial wise to spatial-channel wise, which aims to capture more internal interactional information to further enhance the attended feature; (3) We conduct extensive experiments on the MSCOCO dataset, which demonstrate the effectiveness of our proposed model.

The rest of this paper is organized as follows. Section 2 introduces related work on Image Captioning. Section 3 presents the details of VPNet, which includes the structure of VPNet and adopted object functions. Section 4 includes experimental design, performance comparisons with the existing state-of-the-art models and carries out the visualization analysis of attention regions in caption generation process. Section 5 is the conclusion.

Section snippets

Related work

Image Captioning. Inspired by the successful application of encoder-decoder framework in Machine Translation [5], [8], Oriol Vinyals et al. first applied encoder-decoder framework to image captioning, where the image is encoded as fix-length feature vector by pre-trained CNN, and then decoded into a description using Long Short-Term Memory Network (LSTM) [15] word by word; Kelvin Xu et al. first introduced attention mechanism into image captioning, which can dynamically focus on the salient

Model

The overall architecture of our Visual Persistence model (VPNet) is shown in Fig. 1. The model adopts the widely used encoder-decoder framework.

Given an image I, we first obtain the object region features with a pre-trained Faster R-CNN [35]. Then the encoder refines the features and derives the feature of main object region by Visual Persistence (En). Finally, the Decoder generates caption word by word, where the Visual Persistence (De) fuses previous time step attended feature attt-1 and

Dataset and evaluation metrics

We train and evaluate our proposed model on the MSCOCO 2014 dataset [27], which contains a total of 123287 images (82783 for training and 40504 for validation), and each image has 5 reference captions; besides, MS COCO also provides 40775 images for online testing. In this paper, we use the “Karpathy” split [19] to redivide the MSCOCO 2014, where 113287 images for training, 5000 images for validation and 5000 images for offline evaluation. We convert all captions in “Karpathy” training set to

Conclusion and future work

In this paper, we propose a Visual Persistence Model (VPNet) based on encoder-decoder architecture to simulate the “iconic memory” retention in human “Visual Persistence”. In the encoder part, we use the image global representation as the query to seek the main object regions from object features. In the decoder part, we evaluate the correlation between the previous attention results and current attention results, and fuse them as the final attended feature to generate a new word. Besides, we

CRediT authorship contribution statement

Yiyu Wang: Conceptualization, Methodology, Software, Writing - original draft. Jungang Xu: Conceptualization, Methodology, Resources, Supervision, Writing – review & editing. Yingfei Sun: Conceptualization, Methodology, Validation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Yiyu Wang is a Ph.D. student in School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences. He received his B.S. degree in Computer Science and Technology from Northwest Agricultural and Forestry University in 2018. His research interest is computer vision.

References (50)

  • Anderson, P., Fernando, B., Johnson, M., Gould, S., 2016. SPICE: semantic propositional image caption evaluation, in:...
  • Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L., 2018. Bottom-up and top-down attention...
  • J. Ba et al.

    Multiple object recognition with visual attention

  • Ba, L.J., Kiros, J.R., Hinton, G.E., 2016. Layer normalization. arXiv preprint arXiv:...
  • Bahdanau, D., Cho, K., Bengio, Y., 2015. Neural machine translation by jointly learning to align and translate, in:...
  • Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N., 2015. Scheduled sampling for sequence prediction with recurrent...
  • Chen, L., Zhang, H., Xiao, J., Nie, L., Shao, J., Liu, W., Chua, T., 2017. SCA-CNN: spatial and channel-wise attention...
  • Cho, K., van Merrienboer, B., Gülçehre, Ç., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y., 2014. Learning phrase...
  • M. Coltheart

    The persistences of vision

    Philosophical Transactions of the Royal Society of London B, Biological Sciences

    (1980)
  • M. Corbetta et al.

    Control of goal-directed and stimulus-driven attention in the brain

    Nature Reviews Neuroscience

    (2002)
  • Cornia, M., Stefanini, M., Baraldi, L., Cucchiara, R., 2020. Meshed-memory transformer for image captioning, in:...
  • Dauphin, Y.N., Fan, A., Auli, M., Grangier, D., 2017. Language modeling with gated convolutional networks, in:...
  • He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition, in: Proceedings of the CVPR,...
  • Herdade, S., Kappeler, A., Boakye, K., Soares, J., 2019. Image captioning: Transforming objects into words, in:...
  • S. Hochreiter et al.

    Long short-term memory

    Neural Computation

    (1997)
  • Hu, J., Shen, L., Sun, G., 2018. Squeeze-and-excitation networks, in: Proceedings of the CVPR, pp. 7132–7141....
  • Huang, L., Wang, W., Chen, J., Wei, X., 2019. Attention on attention for image captioning, in: Proceedings of the ICCV,...
  • Jiang, W., Ma, L., Jiang, Y., Liu, W., Zhang, T., 2018. Recurrent fusion network for image captioning, in: Proceedings...
  • A. Karpathy et al.

    Deep visual-semantic alignments for generating image descriptions

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2017)
  • Ke, L., Pei, W., Li, R., Shen, X., Tai, Y., 2019. Reflective decoding network for image captioning, in: Proceedings of...
  • D.P. Kingma et al.

    Adam: A method for stochastic optimization

  • T.N. Kipf et al.

    Semi-supervised classification with graph convolutional networks

  • R. Krishna et al.

    Visual genome: Connecting language and vision using crowdsourced dense image annotations

    Int. J. Comput. Vis.

    (2017)
  • Lavie, A., Agarwal, A., 2007. METEOR: an automatic metric for MT evaluation with high levels of correlation with human...
  • Li, G., Zhu, L., Liu, P., Yang, Y., 2019. Entangled transformer for image captioning, in: Proceedings of the ICCV, pp....
  • Cited by (8)

    • A contrastive triplet network for automatic chest X-ray reporting

      2022, Neurocomputing
      Citation Excerpt :

      The encoder first represents the input image with visual tokens, and then the decoder accepts and decodes these tokens to generate the caption. Among these models, plenty of visual encoding, linguistic decoding, training strategies and applications are proposed [2–4,28–37]. Vinyals et al. [2] first proposed the seminal encoder-decoder paradigm for image captioning, which combined advances in computer vision and machine translation.

    • Exploring better image captioning with grid features

      2024, Complex and Intelligent Systems
    View all citing articles on Scopus

    Yiyu Wang is a Ph.D. student in School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences. He received his B.S. degree in Computer Science and Technology from Northwest Agricultural and Forestry University in 2018. His research interest is computer vision.

    Jungang Xu is a full professor in School of Computer Science and Technology, University of Chinese Academy of Sciences. He received his Ph.D. degree in Computer Applied Technology from University of Chinese Academy of Sciences in 2003. His current research interests are computer vision and automated machine learning.

    Yingfei Sun is a full professor in School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences. He received his Ph.D. degree in Applied Mathematics from Beijing Institute of Technology in 1999. His current research interests are machine learning and pattern recognition.

    View full text