Contrastive semantic similarity learning for image captioning evaluation
Introduction
Generating proper descriptions from images has gained much attention from computer vision and natural language processing researchers. Moreover, since neural networks have been prospering, the research community has witnessed the proliferation of various neural image captioning models. The earliest neural captioning methods date back to [39], [21], which are based on sequential modeling with the Recurrent Neural Networks [31]. When attention mechanism comes into stage, the visual attention model [45] and attributes attention model [47] are proposed. However, The evaluation metrics for image captioning seem to be less explored for years. Conventional captioning metrics include BLEU [33], METEOR [5], ROUGE [23], CIDEr [38] and SPICE [3], aiming to calculate an alignment between candidate captions and ground truth sentences. There are some key challenges faced with the automatic evaluation metrics for image captioning. Firstly, current metrics are struggling with the problem of deviating from human judgments. Thus metrics based on token overlap would miss the semantic similarity among sentences. For one thing, there can be various expressions pointing to the same meaning. For another, there might be altering meanings of the same word, which are difficult to be considered by the conventional metrics. Secondly, there can be various blind spots for rule-based metrics. For example, the SPICE metric captures the visual information by scene graphs but is not as good for sentence structural information. To the best of our knowledge, sentence-level embedding is currently not as explored as token-level embedding. In addition, as shown in Fig. 1, some intrinsic variances exist in the ground truth sentences and the candidate caption.
To address the challenges mentioned above, we propose a learning-based metric that captures intrinsic information entailed among different sentences. We use the auto-encoder to recover the input sentence from itself. The input would be first transformed into a vector representation with an encoder, followed by a text decoder. In the image captioning research, there are works showing that we can gain diverse semantics by sampling from the latent space of captioning models [41], [29], [30]. Inspired by these observations, we carry out the proposed Intrinsic Image Captioning Evaluation () metric.
Specially, we use MSCOCO label sentences to train the sentence auto-encoder. We regularize the original sentence reconstruction loss with a semantic distance loss term to learn a semantic distance-aware model. The model learns to distinguish among different meanings in the latent embedding space. Overall, the main contributions of this paper include:
- •
We propose the Intrinsic Image Captioning Evaluation () metric, a self-supervised learning method based on auto-encoding mechanism and contrastive semantic learning. The conventional metrics typically use token-level matching, which may lose sentence-level information. On the contrary, the proposed method can calculate semantic similarity by generating sentence-level embeddings.
- •
We propose explicitly adding a semantic loss term in the overall training objective to make the learned representation semantically distance-aware. Furthermore, by forming a training corpus in the manner of contrastive learning, we develop model structures of both dual and triple branches for learning semantic similarity.
- •
We perform an empirical study on our collected dataset. The results show that has a higher consistency with human judgments than both conventional metrics and learning-based metrics. We then conduct extensive experiments on Composite-coco and PASCAL-50S dataset. The results also validate the effectiveness of our proposed metric. In addition, we test performances of various state-of-the-art image captioning models on the MSCOCO dataset with both contemporary metrics and . Moreover, our intuitive results show that has dynamic and highly semantic related properties on scoring for testing captions.
The remainder of this paper is organized as follows. Section 2 includes related work, and Section 3 illustrates our method. The experimental results are then presented in Section 4. Finally, Section 5 concludes the paper.
Section snippets
Related Work
This work mainly relates to the tasks of automatic image captioning and especially its evaluation. In terms of matching strategies, contemporary image captioning evaluation metrics can be categorized into rule-based and learning-based. According to the information used for matching, captioning metrics can also be classified into image-guided metrics and text-guided metrics. Our proposed is a text-guided and learning-based metric.
Approach
This paper aims to obtain a self-supervised learning-based image captioning evaluation metric, which can generate scores for captions without using human-labeled scores. To achieve this goal, we train an auto-encoder to extract the gist embedded in the caption so that captions with similar semantic meanings would be mapped to neighboring areas, and captions with different meanings would be located apart. Our work mainly contains two components, the self-supervised auto-encoder to learn the
Experiments
In this section, we carry out several experiments to validate the effectiveness of the proposed method for evaluating the quality of image captions. Our motivation is to propose intrinsic and semantic distance-aware sentence level embeddings for image caption evaluation. In the next subsections we will first introduce the experimental settings and then present the results.
Conclusion
In this paper, we introduce the proposed metric, which uses the intrinsic sentence vectors to calculate similarity instead of matching on n-gram tokens or word chunks. The benefits from sentence intrinsic information and gains an understanding of semantic similarity at the sentence level. To make the intrinsic vectors more distance aware, we further develop two different variations of the original single branch approach, which provides explicit distance loss. We conduct several
CRediT authorship contribution statement
Chao Zeng: Investigation, Conceptualization, Methodology, Software, Data curation, Writing - original draft. Sam Kwong: Conceptualization, Resources, Writing - review & editing, Supervision, Funding acquisition. Tiesong Zhao: Resources, Writing - review & editing. Hanli Wang: Formal analysis, Writing - review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work is supported by Key Project of Science and Technology Innovation 2030 supported by the Ministry of Science and Technology of China (Grant No. 2018AAA0101301), the Hong Kong Innovation and Technology Commission (InnoHK Project CIMDA), and in part by the Hong Kong GRF-RGC General Research Fund under Grant 11209819 (CityU 9042816) and Grant 11203820 (9042598).
References (50)
- et al.
Multitask learning approach for understanding the relationship between two sentences
Information Sciences
(2019) - et al.
Attention pooling-based convolutional neural network for sentence modelling
Information Sciences
(2016) - et al.
A novel lifelong learning model based on cross domain knowledge extraction and transfer to classify underwater images
Information Sciences
(2021) - et al.
Knowledge extraction and retention based continual learning by using convolutional autoencoder-based learning classifier system
Information Sciences
(2022) - et al.
Paraphrase thought: Sentence embedding module imitating human language recognition
Information Sciences
(2020) - et al.
Cnn-based encoder-decoder networks for salient object detection: A comprehensive review and recent advances
Information Sciences
(2021) - et al.
Single image super-resolution using multi-scale deep encoder–decoder with phase congruency edge map guidance
Information Sciences
(2019) - et al.
A robust generative classifier against transfer attacks based on variational auto-encoders
Information Sciences
(2021) - Aditya, S., Yang, Y., Baral, C., Fermuller, C., Aloimonos, Y., 2015. From images to sentences through scene description...
- Agarwal, P., Betancourt, A., Panagiotou, V., Díaz-Rodríguez, N., 2020. Egoshots, an ego-vision life-logging dataset and...
Spice: Semantic propositional image caption evaluation
Bottom-up and top-down attention for image captioning and visual question answering
Meshed-memory transformer for image captioning
Learning to evaluate image captioning
Attention on attention for image captioning
Deep visual-semantic alignments for generating image descriptions
Rouge: A package for automatic evaluation of summaries
Text summarization branches out
Cited by (7)
CA-Captioner: A novel concentrated attention for image captioning
2024, Expert Systems with ApplicationsContrastive fine-tuning for low-resource graph-level transfer learning
2024, Information SciencesDHCF: Dual disentangled-view hierarchical contrastive learning for fake news detection on social media
2023, Information SciencesRandomly shuffled convolution for self-supervised representation learning
2023, Information SciencesCitation Excerpt :However, these methods have limited performance compared to supervised methods. Contrastive learning, which is a branch of self-supervised learning, has achieved competitive performance compared to the supervised methods [5–8,10,18]. Contrastive methods for visual representations [7,8] aim to minimize the distance between the representations of two different augmented views from one image and push apart the representations of augmented views of different images.