Contrastive semantic similarity learning for image captioning evaluation

doi:10.1016/j.ins.2022.07.142

Information Sciences

Volume 609, September 2022, Pages 913-930

https://doi.org/10.1016/j.ins.2022.07.142 Get rights and content

Abstract

Automatically evaluating the quality of image captions can be very challenging since human language is quite flexible that there can be various expressions for the same meaning. Most current captioning metrics rely on token-level matching between candidate caption and the ground truth label sentences. It usually neglects the sentence-level information. Motivated by the auto-encoder mechanism and contrastive representation learning advances, we propose a learning-based metric $I^{2} CE$ (Intrinsic Image Captioning Evaluation). For learning the evaluation metric, we develop three progressive model structures capturing the sentence level representations–single branch model, dual branches model, and triple branches model. For evaluation of the proposed metric, we select one automatic captioning model and collect human scores on the quality of the generated captions. We introduce a statistical test on the correlation between human scores and metric scores. Our proposed metric $I^{2} CE$ achieves the Spearman correlation value of 51.42, which is better than the score of 41.95 achieved by one recently proposed BERT-based metric. The result is also better than the conventional rule-based metrics. Extensive results on the Composite-coco dataset and PASCAL-50S also validate the effectiveness of our proposed metric. The proposed metric could serve as a novel indicator of the intrinsic information between captions, which complements the existing ones.

Introduction

Generating proper descriptions from images has gained much attention from computer vision and natural language processing researchers. Moreover, since neural networks have been prospering, the research community has witnessed the proliferation of various neural image captioning models. The earliest neural captioning methods date back to [39], [21], which are based on sequential modeling with the Recurrent Neural Networks [31]. When attention mechanism comes into stage, the visual attention model [45] and attributes attention model [47] are proposed. However, The evaluation metrics for image captioning seem to be less explored for years. Conventional captioning metrics include BLEU [33], METEOR [5], ROUGE [23], CIDEr [38] and SPICE [3], aiming to calculate an alignment between candidate captions and ground truth sentences. There are some key challenges faced with the automatic evaluation metrics for image captioning. Firstly, current metrics are struggling with the problem of deviating from human judgments. Thus metrics based on token overlap would miss the semantic similarity among sentences. For one thing, there can be various expressions pointing to the same meaning. For another, there might be altering meanings of the same word, which are difficult to be considered by the conventional metrics. Secondly, there can be various blind spots for rule-based metrics. For example, the SPICE metric captures the visual information by scene graphs but is not as good for sentence structural information. To the best of our knowledge, sentence-level embedding is currently not as explored as token-level embedding. In addition, as shown in Fig. 1, some intrinsic variances exist in the ground truth sentences and the candidate caption.

To address the challenges mentioned above, we propose a learning-based metric that captures intrinsic information entailed among different sentences. We use the auto-encoder to recover the input sentence from itself. The input would be first transformed into a vector representation with an encoder, followed by a text decoder. In the image captioning research, there are works showing that we can gain diverse semantics by sampling from the latent space of captioning models [41], [29], [30]. Inspired by these observations, we carry out the proposed Intrinsic Image Captioning Evaluation ( $I^{2} CE$ ) metric.

Specially, we use MSCOCO label sentences to train the sentence auto-encoder. We regularize the original sentence reconstruction loss with a semantic distance loss term to learn a semantic distance-aware model. The model learns to distinguish among different meanings in the latent embedding space. Overall, the main contributions of this paper include:

•
We propose the Intrinsic Image Captioning Evaluation ( $I^{2} CE$ ) metric, a self-supervised learning method based on auto-encoding mechanism and contrastive semantic learning. The conventional metrics typically use token-level matching, which may lose sentence-level information. On the contrary, the proposed method can calculate semantic similarity by generating sentence-level embeddings.
•
We propose explicitly adding a semantic loss term in the overall training objective to make the learned representation semantically distance-aware. Furthermore, by forming a training corpus in the manner of contrastive learning, we develop model structures of both dual and triple branches for learning semantic similarity.
•
We perform an empirical study on our collected dataset. The results show that $I^{2} CE$ has a higher consistency with human judgments than both conventional metrics and learning-based metrics. We then conduct extensive experiments on Composite-coco and PASCAL-50S dataset. The results also validate the effectiveness of our proposed metric. In addition, we test performances of various state-of-the-art image captioning models on the MSCOCO dataset with both contemporary metrics and $I^{2} CE$ . Moreover, our intuitive results show that $I^{2} CE$ has dynamic and highly semantic related properties on scoring for testing captions.

The remainder of this paper is organized as follows. Section 2 includes related work, and Section 3 illustrates our $I^{2} CE$ method. The experimental results are then presented in Section 4. Finally, Section 5 concludes the paper.

Section snippets

Related Work

This work mainly relates to the tasks of automatic image captioning and especially its evaluation. In terms of matching strategies, contemporary image captioning evaluation metrics can be categorized into rule-based and learning-based. According to the information used for matching, captioning metrics can also be classified into image-guided metrics and text-guided metrics. Our proposed $I^{2} CE$ is a text-guided and learning-based metric.

Approach

This paper aims to obtain a self-supervised learning-based image captioning evaluation metric, which can generate scores for captions without using human-labeled scores. To achieve this goal, we train an auto-encoder to extract the gist embedded in the caption so that captions with similar semantic meanings would be mapped to neighboring areas, and captions with different meanings would be located apart. Our work mainly contains two components, the self-supervised auto-encoder to learn the

Experiments

In this section, we carry out several experiments to validate the effectiveness of the proposed $I^{2} CE$ method for evaluating the quality of image captions. Our motivation is to propose intrinsic and semantic distance-aware sentence level embeddings for image caption evaluation. In the next subsections we will first introduce the experimental settings and then present the results.

Conclusion

In this paper, we introduce the proposed $I^{2} CE$ metric, which uses the intrinsic sentence vectors to calculate similarity instead of matching on n-gram tokens or word chunks. The $I^{2} CE$ benefits from sentence intrinsic information and gains an understanding of semantic similarity at the sentence level. To make the intrinsic vectors more distance aware, we further develop two different variations of the original single branch approach, which provides explicit distance loss. We conduct several

CRediT authorship contribution statement

Chao Zeng: Investigation, Conceptualization, Methodology, Software, Data curation, Writing - original draft. Sam Kwong: Conceptualization, Resources, Writing - review & editing, Supervision, Funding acquisition. Tiesong Zhao: Resources, Writing - review & editing. Hanli Wang: Formal analysis, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is supported by Key Project of Science and Technology Innovation 2030 supported by the Ministry of Science and Technology of China (Grant No. 2018AAA0101301), the Hong Kong Innovation and Technology Commission (InnoHK Project CIMDA), and in part by the Hong Kong GRF-RGC General Research Fund under Grant 11209819 (CityU 9042816) and Grant 11203820 (9042598).

References (50)

H. Choi et al.
Multitask learning approach for understanding the relationship between two sentences
Information Sciences
(2019)
M.J. Er et al.
Attention pooling-based convolutional neural network for sentence modelling
Information Sciences
(2016)
M. Irfan et al.
A novel lifelong learning model based on cross domain knowledge extraction and transfer to classify underwater images
Information Sciences
(2021)
M. Irfan et al.
Knowledge extraction and retention based continual learning by using convolutional autoencoder-based learning classifier system
Information Sciences
(2022)
M. Jang et al.
Paraphrase thought: Sentence embedding module imitating human language recognition
Information Sciences
(2020)
Y. Ji et al.
Cnn-based encoder-decoder networks for salient object detection: A comprehensive review and recent advances
Information Sciences
(2021)
H. Liu et al.
Single image super-resolution using multi-scale deep encoder–decoder with phase congruency edge map guidance
Information Sciences
(2019)
C. Zhang et al.
A robust generative classifier against transfer attacks based on variational auto-encoders
Information Sciences
(2021)
Aditya, S., Yang, Y., Baral, C., Fermuller, C., Aloimonos, Y., 2015. From images to sentences through scene description...
Agarwal, P., Betancourt, A., Panagiotou, V., Díaz-Rodríguez, N., 2020. Egoshots, an ego-vision life-logging dataset and...

P. Anderson et al.

Spice: Semantic propositional image caption evaluation

P. Anderson et al.

Bottom-up and top-down attention for image captioning and visual question answering

Banerjee, S., Lavie, A., 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human...

Chung, J., Gulcehre, C., Cho, K., Bengio, Y., 2014. Empirical evaluation of gated recurrent neural networks on sequence...

M. Cornia et al.

Meshed-memory transformer for image captioning

Y. Cui et al.

Learning to evaluate image captioning

Devlin, J., Chang, M.W., Lee, K., Toutanova, K., 2018. Bert: Pre-training of deep bidirectional transformers for...

Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D., 2010. Speaking the...

Guo, L., Liu, J., Zhu, X., He, X., Jiang, J., Lu, H., 2020. Non-autoregressive image captioning with...

L. Huang et al.

Attention on attention for image captioning

Jiang, M., Hu, J., Huang, Q., Zhang, L., Diesner, J., Gao, J., 2019a. Reo-relevance, extraness, omission: A...

Jiang, M., Huang, Q., Zhang, L., Wang, X., Zhang, P., Gan, Z., Diesner, J., Gao, J., 2019b. Tiger: text-to-image...

A. Karpathy et al.

Deep visual-semantic alignments for generating image descriptions

Kingma, D.P., Ba, J., 2014. Adam: A method for stochastic optimization. arXiv preprint...

C.Y. Lin

Rouge: A package for automatic evaluation of summaries

Text summarization branches out

(2004)

Cited by (7)

CA-Captioner: A novel concentrated attention for image captioning
2024, Expert Systems with Applications
Image captioning is a task that involves understanding scenes by combining computer vision (CV) and natural language processing (NLP). While many advanced image captioning models only focus on extracting visual features for sentence generation, they neglect the importance of descriptions. To address this issue, we propose a novel concentrated attention within a fully Transformer-based image captioning model. Our approach first incorporates a positional encoding technique known as HAPE, which offers better spatial position information of objects compared with conventional positional encoding methods. Additionally, to enhance the correlation among feature pixels and direct the model’s attention towards important objects, we introduce a learnable sparse mechanism (LSM) that eliminates unnecessary noises from visual representation. Within LSM, a new RNorm function is utilized to improve the allocation of feature weights and extract emphasized object features. Furthermore, to address the limitation of self-attention in capturing local features, we employ local feature enhancement (LFE) which integrates a single layer of depth-separable convolution network to contribute to visual representation. Finally, the proposed model, named CA-Captioner, is validated on the MSCOCO, Fickr8k, and Flickr30k datasets, and the evaluation results demonstrate its robustness and effectiveness, with overall improved quantitative scores. Specifically, on the MSCOCO dataset, our model achieved a 1.4% increase in BLEU4 and a 4.0% increase in CIDEr metrics, demonstrating competitive performance compared to some advanced generators. Code is available at:https://github.com/y78h11b09/Ca-Captioner.
Contrastive fine-tuning for low-resource graph-level transfer learning
2024, Information Sciences
Due to insufficient supervision and the gap between pre-training pretext tasks and downstream tasks, transferring pre-trained graph neural networks (GNNs) to downstream tasks in low-resource scenarios remains challenging. In this paper, a Contrastive Fine-tuning (Con-tuning) framework is proposed for low-resource graph-level transfer learning, and a graph-level supervised contrastive learning (SCL) task is designed within the framework as the first attempt to introduce SCL for fine-tuning processes of pre-trained GNNs. The SCL task compensates for the insufficient supervision problem in low-resource scenarios and narrows the gap between pretext tasks and downstream tasks. To further reinforce the supervision signal in the SCL task, we devise a graphon theory based labeled graph generator to extract the generalized knowledge of a specific class of graphs. Based on this knowledge, graph-level templates are generated for each class and used as contrastive samples in the SCL task. Then, the proposed Con-tuning framework jointly learns the SCL task and downstream tasks to effectively fine-tune the pre-trained GNNs for downstream tasks. Extensive experiments with eight real-world datasets show that Con-tuning framework enables pre-trained GNNs to achieve better performance on graph-level downstream tasks in low-resource settings.
DHCF: Dual disentangled-view hierarchical contrastive learning for fake news detection on social media
2023, Information Sciences
Widespread fake news on social media threatens public security and the cyber environment, making fake news detection an essential area of study. The majority of existing fake news detection methods rely on news content (e.g., text and images) and/or social contexts (e.g., comment interactions between posts) to determine the veracity of news. However, existing methods still have the following drawbacks: (1) Overreliance on sufficient reliable labeled data. (2) Lack of robustness to noise and fraudster-designed harmful disguises. (3) Inability to differentiate between the multiple intentions behind retweet and comment behaviors, resulting in generating entangled representations. To address the above aforementioned three issues, we introduce contrastive learning and disentangled representation learning for fake news detection. Specifically, to mine supervised signals from unlabeled data and improve the model's robustness, we design a hierarchical contrastive learning framework that includes multiple data augmentation strategies and three contrastive learning tasks. In addition, to infer the latent intentions of retweets and comments between posts, we propose the disentangled graph encoder (Disen-GraphEnc) and disentangled sequence encoder (Disen-SeqEnc). Extensive experiments demonstrate the superiority of our model over other state-of-the-art methods and is resistant to limited training data and noise attacks. Our code is available on the GitHub (https://github.com/senllh/DHCF).
Randomly shuffled convolution for self-supervised representation learning
2023, Information Sciences
Citation Excerpt :
However, these methods have limited performance compared to supervised methods. Contrastive learning, which is a branch of self-supervised learning, has achieved competitive performance compared to the supervised methods [5–8,10,18]. Contrastive methods for visual representations [7,8] aim to minimize the distance between the representations of two different augmented views from one image and push apart the representations of augmented views of different images.
Many self-supervised representation learning methods have achieved high performance in image classification tasks. However, these methods have limited performance on localization tasks such as object detection or semantic segmentation. Most self-supervised representation learning methods are optimized with only one global representation, which does not pay much attention to the spatial information in an image. We propose a simple and effective method that uses the positional relationships between the entities in an image by shuffling the convolution kernels. Our method extends current self-supervised learning and calculates the pixel-wise (dis) similarities between the output of the standard convolution kernels and that of the randomly shuffled convolution kernels. Our proposed method achieves higher performance on object detection, instance segmentation, and semantic segmentation when attached to recent self-supervised learning methods.
Bottom-Up Propagation of Hierarchical Dependency for Multi-Behavior Recommendation
2024, SSRN
Gadnet: Improving Image-Text Matching Via Graph-Based Aggregation and Disentanglement
2023, SSRN

View all citing articles on Scopus

View full text