A visual persistence model for image captioning
Introduction
Image captioning aims to generate credible and fluent natural language descriptions for given images. It needs to ensure the correctness of the objects, attributes, and semantic information involved in the captions. The task is connected with Computer Vision and Natural Language Processing, and which is a challenging task and has attracted more and more attention.
Recently, the encoder-decoder model with attention mechanism has become the fundamental framework of image captioning and achieved significant progress [40], [43], [2], [34], where the encoder based on Convolutional Neural Network (CNN) is used to extract image feature vectors with fixed-length, and the decoder based on Recurrent Neural Network (RNN) generates captions word by word.
Peter Anderson et al. used the results of object detection with Faster R-CNN as the feature vectors of the image and applied the attention mechanism to these object features [2]. Due to its outstanding effect, the visual attention mechanism and object features have been widely adopted in recent image captioning models [46], [34], [17], [20]. In these works, the attention mechanism aims to connect the generated word with object regions at each time step where attention layers between adjacent time steps only interact through the hidden states of LSTM, without direct information interaction. However, M. Coltheart mentioned that human continues to receive visual stimulation for a period of time after the disappearance of visual stimulation, the information extracted before continues to be kept as “iconic memory” for some time. This phenomenon is called “Visual Persistence” [9]. Therefore, we believe that there should be direct interaction between two attention layers in adjacent time steps to simulate the information retention. Besides, the average pooling of object features vector is fed as the image global feature to the captioning model instead of deriving the core object features, which may leave out some simple and useful prior information.
To address these issues above, we propose a Visual Persistence Model (VPNet) for image captioning. In Encoder, we extract image object features following [2] and compute the average pooling of object features as image global feature representation. Different from previous works which directly feed the image global feature to the captioning model [2], [20], [17], [34], we first use the image global feature as query information to derive the main object feature from the raw object features with multi-head self-attention [38]. In Decoder, we replace the image global feature with the core object feature and feed it into the captioning model at each time step. Besides, when generating the t-th word, we design a visual persistence module to evaluate the relevance between the previous and current attention results with current context information (e.g. current LSTM hidden state), which can guide to fuse the previous attention results and current attention results as the final visual attended feature to generate t-th word finally. In summary, we simply introduce the “Visual Persistence” into image captioning task, where the visual persistence modules in encoder and decoder can be regarded as global and local “iconic memory”. In addition, we extend the self-attention module [38] from spatial attention to spatial-channel attention by adding extra channel-wise queries.
We evaluate our model on MSCOCO [27] “Karpathy” [19] test split and online test server under cross-entropy loss and CIDEr optimization. Our proposed model has a good outperformance and competitiveness compared to other existing state-of-the-art methods. On MSCOCO “Karpathy” test split, a single VPNet achieves BLEU-1/ BLEU-4/ METEOR/ ROUGE-L/ CIDEr/ SPICE scores of 78.6/ 38.2/ 28.6/ 58.0/ 121.0/ 21.9 with cross-entropy loss and 80.9/ 39.7/ 29.3/ 59.2/ 130.4/ 23.2 with CIDEr optimization. On MSCOCO online test server, compared with the works officially published in recent years, an ensemble of 4 VPNet models achieves the top-3 performance over all the metrics.
Our main contributions are as follows: (1) We propose Visual Persistence modules in Encoder and Decoder that simulate the “iconic memory” retention in human “Visual Persistence”, which can be regarded as the introduction of more semantic and interactional information; (2) We simply improve the self-attention block from spatial wise to spatial-channel wise, which aims to capture more internal interactional information to further enhance the attended feature; (3) We conduct extensive experiments on the MSCOCO dataset, which demonstrate the effectiveness of our proposed model.
The rest of this paper is organized as follows. Section 2 introduces related work on Image Captioning. Section 3 presents the details of VPNet, which includes the structure of VPNet and adopted object functions. Section 4 includes experimental design, performance comparisons with the existing state-of-the-art models and carries out the visualization analysis of attention regions in caption generation process. Section 5 is the conclusion.
Section snippets
Related work
Image Captioning. Inspired by the successful application of encoder-decoder framework in Machine Translation [5], [8], Oriol Vinyals et al. first applied encoder-decoder framework to image captioning, where the image is encoded as fix-length feature vector by pre-trained CNN, and then decoded into a description using Long Short-Term Memory Network (LSTM) [15] word by word; Kelvin Xu et al. first introduced attention mechanism into image captioning, which can dynamically focus on the salient
Model
The overall architecture of our Visual Persistence model (VPNet) is shown in Fig. 1. The model adopts the widely used encoder-decoder framework.
Given an image I, we first obtain the object region features with a pre-trained Faster R-CNN [35]. Then the encoder refines the features and derives the feature of main object region by Visual Persistence (En). Finally, the Decoder generates caption word by word, where the Visual Persistence (De) fuses previous time step attended feature and
Dataset and evaluation metrics
We train and evaluate our proposed model on the MSCOCO 2014 dataset [27], which contains a total of 123287 images (82783 for training and 40504 for validation), and each image has 5 reference captions; besides, MS COCO also provides 40775 images for online testing. In this paper, we use the “Karpathy” split [19] to redivide the MSCOCO 2014, where 113287 images for training, 5000 images for validation and 5000 images for offline evaluation. We convert all captions in “Karpathy” training set to
Conclusion and future work
In this paper, we propose a Visual Persistence Model (VPNet) based on encoder-decoder architecture to simulate the “iconic memory” retention in human “Visual Persistence”. In the encoder part, we use the image global representation as the query to seek the main object regions from object features. In the decoder part, we evaluate the correlation between the previous attention results and current attention results, and fuse them as the final attended feature to generate a new word. Besides, we
CRediT authorship contribution statement
Yiyu Wang: Conceptualization, Methodology, Software, Writing - original draft. Jungang Xu: Conceptualization, Methodology, Resources, Supervision, Writing – review & editing. Yingfei Sun: Conceptualization, Methodology, Validation.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Yiyu Wang is a Ph.D. student in School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences. He received his B.S. degree in Computer Science and Technology from Northwest Agricultural and Forestry University in 2018. His research interest is computer vision.
References (50)
- Anderson, P., Fernando, B., Johnson, M., Gould, S., 2016. SPICE: semantic propositional image caption evaluation, in:...
- Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L., 2018. Bottom-up and top-down attention...
- et al.
Multiple object recognition with visual attention
- Ba, L.J., Kiros, J.R., Hinton, G.E., 2016. Layer normalization. arXiv preprint arXiv:...
- Bahdanau, D., Cho, K., Bengio, Y., 2015. Neural machine translation by jointly learning to align and translate, in:...
- Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N., 2015. Scheduled sampling for sequence prediction with recurrent...
- Chen, L., Zhang, H., Xiao, J., Nie, L., Shao, J., Liu, W., Chua, T., 2017. SCA-CNN: spatial and channel-wise attention...
- Cho, K., van Merrienboer, B., Gülçehre, Ç., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y., 2014. Learning phrase...
The persistences of vision
Philosophical Transactions of the Royal Society of London B, Biological Sciences
(1980)- et al.
Control of goal-directed and stimulus-driven attention in the brain
Nature Reviews Neuroscience
(2002)
Long short-term memory
Neural Computation
Deep visual-semantic alignments for generating image descriptions
IEEE Trans. Pattern Anal. Mach. Intell.
Adam: A method for stochastic optimization
Semi-supervised classification with graph convolutional networks
Visual genome: Connecting language and vision using crowdsourced dense image annotations
Int. J. Comput. Vis.
Cited by (8)
Context-aware transformer for image captioning
2023, NeurocomputingTowards local visual modeling for image captioning
2023, Pattern RecognitionA contrastive triplet network for automatic chest X-ray reporting
2022, NeurocomputingCitation Excerpt :The encoder first represents the input image with visual tokens, and then the decoder accepts and decodes these tokens to generate the caption. Among these models, plenty of visual encoding, linguistic decoding, training strategies and applications are proposed [2–4,28–37]. Vinyals et al. [2] first proposed the seminal encoder-decoder paradigm for image captioning, which combined advances in computer vision and machine translation.
Exploring better image captioning with grid features
2024, Complex and Intelligent SystemsComplementary Shifted Transformer for Image Captioning
2023, Neural Processing Letters
Yiyu Wang is a Ph.D. student in School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences. He received his B.S. degree in Computer Science and Technology from Northwest Agricultural and Forestry University in 2018. His research interest is computer vision.
Jungang Xu is a full professor in School of Computer Science and Technology, University of Chinese Academy of Sciences. He received his Ph.D. degree in Computer Applied Technology from University of Chinese Academy of Sciences in 2003. His current research interests are computer vision and automated machine learning.
Yingfei Sun is a full professor in School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences. He received his Ph.D. degree in Applied Mathematics from Beijing Institute of Technology in 1999. His current research interests are machine learning and pattern recognition.