research-article

Public Access

Look Twice as Much as You Say: Scene Graph Contrastive Learning for Self-Supervised Image Caption Generation

Authors:

Xiangliang Zhang,

Chuxu ZhangAuthors Info & Claims

CIKM '22: Proceedings of the 31st ACM International Conference on Information & Knowledge Management

Pages 2519 - 2528

https://doi.org/10.1145/3511808.3557382

Published: 17 October 2022 Publication History

Abstract

Images are commonly used for various information and knowledge applications, such as advertising and recommendation. Automating image caption generation will significantly improve image accessibility. This cross-modal task, which takes image as input and text as output, however, is difficult for learning. Though prior methods achieve good performance for image caption generation, they rely on either supervised learning which requires sufficient labeled data or unsupervised learning which needs external dataset as language pivot. In this paper, we propose SGCL, a novel Scene Graph Contrastive Learning model for self-supervised image caption generation. SGCL adopts the pre-training and fine-tuning pipeline. Specifically, we first apply scene graph generation and objection detection method to encode scene graph and visual information in the image as feature representation. Later, a decoder network based on graph attention network and recurrent neural network is further designed to generate sequential text as caption. To enable contrastive learning in SGCL, we design scene graph augmentations as contrastive views of images and train the model effectively without ground-truth labels through contrastive learning. Additionally, we introduce the pre-trained word embedding and the context projector to enrich the text representation in the decoder network, which benefits model pre-training. Once the pre-training phase is finished, we further fine-tune the model for the image caption generation task with limited labeled data. Extensive experiments on benchmark dataset demonstrate that SGCL outperforms state-of-the-art models (both supervised and unsupervised).

References

[1]

Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. In ECCV. 382--398.

[2]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR. 6077--6086.

[3]

Shizhe Chen, Qin Jin, Peng Wang, and Qi Wu. 2020a. Say as you wish: Fine-grained control of image caption generation with abstract scene graphs. In CVPR. 9962--9971.

[4]

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020b. A simple framework for contrastive learning of visual representations. In ICML. 1597--1607.

[5]

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. 2015. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015).

[6]

Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. 2019. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. In ICLR.

[7]

Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. 2020. Meshed-Memory Transformer for Image Captioning. In CVPR. 10578--10587.

[8]

Michael Denkowski and Alon Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. In ACL Workshop. 376--380.

[9]

Yang Feng, Lin Ma, Wei Liu, and Jiebo Luo. 2019. Unsupervised image captioning. In CVPR. 4125--4134.

[10]

Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple Contrastive Learning of Sentence Embeddings. In EMNLP. 6894--6910.

[11]

Jiuxiang Gu, Shafiq Joty, Jianfei Cai, and Gang Wang. 2018. Unpaired image captioning by language pivoting. In ECCV. 503--519.

[12]

Longteng Guo, Jing Liu, Jinhui Tang, Jiangwei Li, Wei Luo, and Hanqing Lu. 2019. Aligning linguistic words and visual semantic units for image captioning. In ACM MM. 765--773.

[13]

Zhichun Guo, Wenhao Yu, Chuxu Zhang, Meng Jiang, and Nitesh V Chawla. 2020. GraSeq: graph and sequence fusion learning for molecular property prediction. In CIKM. 435--443.

[14]

Kaveh Hassani and Amir Hosein Khasahmadi. 2020. Contrastive multi-view representation learning on graphs. In ICML. 4116--4126.

[15]

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In CVPR. 9729--9738.

[16]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.

[17]

Chao Huang, Huance Xu, Yong Xu, Peng Dai, Lianghao Xia, Mengyin Lu, Liefeng Bo, Hao Xing, Xiaoping Lai, and Yanfang Ye. 2021. Knowledge-aware coupled graph neural network for social recommendation. In AAAI. 4115--4122.

[18]

Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David Shamma, Michael Bernstein, and Li Fei-Fei. 2015. Image retrieval using scene graphs. In CVPR. 3668--3678.

[19]

Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR. 3128--3137.

[20]

Diederik P Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR.

[21]

Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In ICLR.

[22]

Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In ACL Workshop. 74--81.

[23]

Xihui Liu, Hongsheng Li, Jing Shao, Dapeng Chen, and Xiaogang Wang. 2018. Show, tell and discriminate: Image captioning by self-retrieval with partially labeled data. In ECCV. 338--354.

[24]

Xiao Liu, Li Mian, Yuxiao Dong, Fanjin Zhang, Jing Zhang, Jie Tang, Peng Zhang, Jibing Gong, and Kuansan Wang. 2021. OAG_know: Self-supervised Learning for Linking Knowledge Graphs. TKDE (2021).

[25]

Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In CVPR. 375--383.

[26]

Yunpeng Luo, Jiayi Ji, Xiaoshuai Sun, Liujuan Cao, Yongjian Wu, Feiyue Huang, Chia-Wen Lin, and Rongrong Ji. 2021. Dual-level collaborative transformer for image captioning. In AAAI. 2286--2293.

[27]

Victor Siemen Janusz Milewski, Marie Francine Moens, and Iacer Calixto. 2020. Are Scene Graphs Good Enough to Improve Image Captioning?. In IJCNLP. 504--515.

[28]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In ACL. 311--318.

[29]

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In EMNLP. 1532--1543.

[30]

Xin Qian, Eunyee Koh, Fan Du, Sungchul Kim, Joel Chan, Ryan A Rossi, Sana Malik, and Tak Yeon Lee. 2021. Generating Accurate Caption Units for Figure Captioning. In WWW. 2792--2804.

[31]

Jiezhong Qiu, Qibin Chen, Yuxiao Dong, Jing Zhang, Hongxia Yang, Ming Ding, Kuansan Wang, and Jie Tang. 2020. Gcc: Graph contrastive coding for graph neural network pre-training. In KDD. 1150--1160.

Digital Library

[32]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. NeurIPS (2015), 91--99.

[33]

Zhan Shi, Xu Zhou, Xipeng Qiu, and Xiaodan Zhu. 2020. Improving Image Captioning with Better Use of Caption. In ACL. 7454--7464.

[34]

Ruixiang Tang, Mengnan Du, Yuening Li, Zirui Liu, Na Zou, and Xia Hu. 2021. Mitigating Gender Bias in Captioning Systems. In WWW. 633--645.

[35]

Yijun Tian, Chuxu Zhang, Zhichun Guo, Chao Huang, Ronald Metoyer, and Nitesh V Chawla. 2022. RecipeRec: A Heterogeneous Graph Learning Model for Recipe Recommendation. In IJCAI.

[36]

Nenad Tomasev, Ioana Bica, Brian McWilliams, Lars Buesing, Razvan Pascanu, Charles Blundell, and Jovana Mitrovic. 2022. Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet? arXiv preprint arXiv:2201.05119 (2022).

[37]

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In CVPR. 4566--4575.

[38]

Petar Veličković, William Fedus, William L Hamilton, Pietro Liò, Yoshua Bengio, and R Devon Hjelm. 2018. Deep Graph Infomax. In ICLR.

[39]

Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph Attention Networks. In ICLR.

[40]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In CVPR. 3156--3164.

[41]

Dongkuan Xu, Wei Cheng, Dongsheng Luo, Haifeng Chen, and Xiang Zhang. 2021. Infogcl: Information-aware graph contrastive learning. In NeurIPS. 30414--30425.

[42]

Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. 2017. Scene graph generation by iterative message passing. In CVPR. 5410--5419.

[43]

Yuanmeng Yan, Rumei Li, Sirui Wang, Fuzheng Zhang, Wei Wu, and Weiran Xu. 2021. ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer. In ACL. 5065--5075.

[44]

Xu Yang, Kaihua Tang, Hanwang Zhang, and Jianfei Cai. 2019. Auto-encoding scene graphs for image captioning. In CVPR. 10685--10694.

[45]

Yuning You, Tianlong Chen, Yang Shen, and Zhangyang Wang. 2021. Graph Contrastive Learning Automated. In ICML. 12121--12132.

[46]

Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and Yang Shen. 2020. Graph contrastive learning with augmentations. In NeurIPS. 5812--5823.

[47]

Lu Yu, Shichao Pei, Lizhong Ding, Jun Zhou, Longfei Li, Chuxu Zhang, and Xiangliang Zhang. 2022. SAIL: Self-Augmented Graph Contrastive Learning. In AAAI. 8927--8935.

[48]

Chuxu Zhang, Dongjin Song, Chao Huang, Ananthram Swami, and Nitesh V Chawla. 2019. Heterogeneous graph neural network. In KDD. 793--803.

[49]

Jianan Zhao, Qianlong Wen, Shiyu Sun, Yanfang Ye, and Chuxu Zhang. 2021. Multi-view Self-supervised Heterogeneous Graph Embedding. In ECML/PKDD. 319--334.

[50]

Yiwu Zhong, Liwei Wang, Jianshu Chen, Dong Yu, and Yin Li. 2020. Comprehensive image captioning via scene graph decomposition. In ECCV. 211--229.

Cited By

Li ZLiu DWang HZhang CCai W(2024)Exploring Annotation-free Image Captioning with Retrieval-augmented Pseudo Sentence GenerationProceedings of the 6th ACM International Conference on Multimedia in Asia10.1145/3696409.3700223(1-8)Online publication date: 3-Dec-2024
https://dl.acm.org/doi/10.1145/3696409.3700223
Liu ZDou GChien EZhang CTian YZhu ZChua TNgo CKa-Wei Lee RKumar RLauw H(2024)Breaking the Trilemma of Privacy, Utility, and Efficiency via Controllable Machine UnlearningProceedings of the ACM Web Conference 202410.1145/3589334.3645669(1260-1271)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645669
Ouyang ZZhang CHou SMa SChen CLi TXiao XZhang CYe Y(2024)Symbolic Prompt Tuning Completes the App Promotion GraphMachine Learning and Knowledge Discovery in Databases. Applied Data Science Track10.1007/978-3-031-70381-2_12(183-198)Online publication date: 22-Aug-2024
https://doi.org/10.1007/978-3-031-70381-2_12

Index Terms

Look Twice as Much as You Say: Scene Graph Contrastive Learning for Self-Supervised Image Caption Generation
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Information systems
  1. World Wide Web
    1. Web applications
      1. Social networks

Recommendations

SGCL: Semi-supervised Graph Contrastive Learning with confidence propagation algorithm for node classification
Abstract
Semi-Supervised Graph Learning (SSGL) aims to predict massive unknown labels based on a subset of known labels within a graph. Recently, graph neural network, one of the most popular SSGL approaches, has garnered considerable research interest ...
Semi-supervised domain adaptation on graphs with contrastive learning and minimax entropy
Abstract
Label scarcity in a graph is frequently encountered in real-world applications due to the high cost of data labeling. To this end, semi-supervised domain adaptation (SSDA) on graphs aims to leverage the knowledge of a labeled source graph to aid ...
Label-guided graph contrastive learning for semi-supervised node classification
Abstract
Semi-supervised node classification is a task of predicting the labels of unlabeled nodes using limited labeled nodes and numerous unlabeled nodes. Recently, Graph Neural Networks (GNNs) have achieved remarkable success in this task. However, ...
Highlights
- The framework explores semantic-level feature similarity.
- The self-checking mechanism ensures the authenticity of the positive nodes.
- The reweighting strategy enhances the effect of hard negative nodes.
- The training algorithm ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '22: Proceedings of the 31st ACM International Conference on Information & Knowledge Management

October 2022

5274 pages

ISBN:9781450392365

DOI:10.1145/3511808

General Chairs:
Mohammad Al Hasan
Indiana University Purdue University, Indianapolis, USA
,
Li Xiong
Emory University, Atlanta, USA

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

CIKM '22

Sponsor:

CIKM '22: The 31st ACM International Conference on Information and Knowledge Management

October 17 - 21, 2022

GA, Atlanta, USA

Acceptance Rates

CIKM '22 Paper Acceptance Rate 621 of 2,257 submissions, 28%;

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
528
Total Downloads

Downloads (Last 12 months)216
Downloads (Last 6 weeks)30

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Li ZLiu DWang HZhang CCai W(2024)Exploring Annotation-free Image Captioning with Retrieval-augmented Pseudo Sentence GenerationProceedings of the 6th ACM International Conference on Multimedia in Asia10.1145/3696409.3700223(1-8)Online publication date: 3-Dec-2024
https://dl.acm.org/doi/10.1145/3696409.3700223
Liu ZDou GChien EZhang CTian YZhu ZChua TNgo CKa-Wei Lee RKumar RLauw H(2024)Breaking the Trilemma of Privacy, Utility, and Efficiency via Controllable Machine UnlearningProceedings of the ACM Web Conference 202410.1145/3589334.3645669(1260-1271)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645669
Ouyang ZZhang CHou SMa SChen CLi TXiao XZhang CYe Y(2024)Symbolic Prompt Tuning Completes the App Promotion GraphMachine Learning and Knowledge Discovery in Databases. Applied Data Science Track10.1007/978-3-031-70381-2_12(183-198)Online publication date: 22-Aug-2024
https://doi.org/10.1007/978-3-031-70381-2_12

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten