research-article

Neural Symbolic Representation Learning for Image Captioning

Authors:

Xiangyang XueAuthors Info & Claims

ICMR '21: Proceedings of the 2021 International Conference on Multimedia Retrieval

Pages 312 - 321

https://doi.org/10.1145/3460426.3463637

Published: 01 September 2021 Publication History

Abstract

Traditional image captioning models mainly rely on one encoder-decoder architecture to generate one natural sentence for a given image. Such an architecture mostly uses deep neural networks to extract the neural representations of the image while ignoring the information of abstractive concepts as well as their intertwined relationships conveyed in the image. To this end, to comprehensively characterize the image content and bridge the gap between neural representations and high-level abstractive concepts, we make the first attempt to investigate the ability of neural symbolic representation of the image for the image captioning task. We first parse and convert a given image to neural symbolic representation in the form of an attributed relational graph, with the nodes denoting the abstractive concepts and the branches indicating the relationships between connected nodes, respectively. By performing computations over the attributed relational graph, the neural symbolic representation evolves step by step, with the node and branch representations as well as their corresponding importance weights transiting step by step. Empirically, extensive experiments validate the effectiveness of the proposed method. It enables a more comprehensive understanding of the given image by integrating the neural representation and neural symbolic representation, with the state-of-the-art results being achieved on both the MSCOCO and Flickr30k datasets. Besides, the proposed neural symbolic representation is demonstrated to better generalize to other domains with significant performance improvements compared with existing methods on the cross domain image captioning task.

References

[1]

Aishwarya Agrawal, Dhruv Batra, and Devi Parikh. 2016. Analyzing the Behavior of Visual Question Answering Models. In EMNLP.

[2]

Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. In European Conference on Computer Vision. Springer, 382--398.

[3]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077--6086.

[4]

Lisa Anne Hendricks, Subhashini Venugopalan, Marcus Rohrbach, Raymond Mooney, Kate Saenko, and Trevor Darrell. 2016. Deep compositional captioning: Describing novel object categories without paired training data. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1--10.

[5]

Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65--72.

[6]

Hui Chen, Guiguang Ding, Zijia Lin, Sicheng Zhao, and Jungong Han. 2019. Cross-Modal Image-Text Retrieval with Semantic Consistency. In Proceedings of the 27th ACM International Conference on Multimedia. 1749--1757.

Digital Library

[7]

Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat- Seng Chua. 2017. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5659--5667.

[8]

Tseng-Hung Chen, Yuan-Hong Liao, Ching-Yao Chuang,Wan-Ting Hsu, Jianlong Fu, and Min Sun. 2017. Show, adapt and tell: Adversarial training of cross-domain image captioner. In Proceedings of the IEEE International Conference on Computer Vision. 521--530.

[9]

Jia Deng,Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248--255.

[10]

Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2625--2634.

[11]

MA Eshera and King-Sun Fu. 1986. An image understanding system using attributed symbolic representation and inexact graph-matching. IEEE Transactions on Pattern Analysis and Machine Intelligence 5 (1986), 604--618.

Digital Library

[12]

Yang Feng, Lin Ma,Wei Liu, and Jiebo Luo. 2019. Unsupervised image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4125--4134.

[13]

Junlong Gao, Shiqi Wang, Shanshe Wang, Siwei Ma, and Wen Gao. 2019. Self- Critical N-Step Training for Image Captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6300--6308.

[14]

Jiuxiang Gu, Shafiq Joty, Jianfei Cai, Handong Zhao, Xu Yang, and Gang Wang. 2019. Unpaired image captioning via scene graph alignments. In Proceedings of the IEEE International Conference on Computer Vision. 10323--10332.

[15]

Longteng Guo, Jing Liu, Jinhui Tang, Jiangwei Li,Wei Luo, and Hanqing Lu. 2019. Aligning Linguistic Words and Visual Semantic Units for Image Captioning. In Proceedings of the 27th ACM International Conference on Multimedia. 765--773.

Digital Library

[16]

Simao Herdade, Armin Kappeler, Kofi Boakye, and Joao Soares. 2019. Image Captioning: Transforming Objects into Words. arXiv preprint arXiv:1906.05963 (2019).

[17]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.

Digital Library

[18]

Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. 2019. Attention on attention for image captioning. In Proceedings of the IEEE International Conference on Computer Vision. 4634--4643.

[19]

Lun Huang, Wenmin Wang, Yaxian Xia, and Jie Chen. 2019. Adaptively Aligned Image Captioning via Adaptive Attention Time. arXiv preprint arXiv:1909.09060 (2019).

[20]

Drew Hudson and Christopher D Manning. 2019. Learning by abstraction: The neural state machine. In Advances in Neural Information Processing Systems. 5901-- 5914.

[21]

Wenhao Jiang, Lin Ma, Yu-Gang Jiang,Wei Liu, and Tong Zhang. 2018. Recurrent fusion network for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV). 499--515.

Digital Library

[22]

Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3128--3137.

[23]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[24]

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123, 1 (2017), 32--73.

Digital Library

[25]

Brenden M. Lake and Marco Baroni. 2018. Generalization without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks. In ICML, Jennifer G. Dy and Andreas Krause (Eds.).

[26]

Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, and Tao Mei. 2019. Pointing Novel Objects in Image Captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 12497--12506.

[27]

Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74--81.

[28]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740--755.

[29]

Zehang Lin, Zhenguo Yang, Feitao Huang, and Junhong Chen. 2018. Regional maximum activations of convolutions with attention for cross-domain beauty and personal care product retrieval. In Proceedings of the 26th ACM international conference on Multimedia. 2073--2077.

Digital Library

[30]

Daqing Liu, Zheng-Jun Zha, Hanwang Zhang, Yongdong Zhang, and Feng Wu. 2018. Context-aware visual policy network for sequence-level image captioning. In Proceedings of the 26th ACM international conference on Multimedia. 1416--1424.

Digital Library

[31]

Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 375--383.

[32]

Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2018. Neural baby talk. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7219--7228.

[33]

Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B Tenenbaum, and Jiajun Wu. 2019. The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. arXiv preprint arXiv:1904.12584 (2019).

[34]

Niluthpol Chowdhury Mithun, Rameswar Panda, Evangelos E Papalexakis, and Amit K Roy-Chowdhury. 2018. Webly supervised joint embedding for crossmodal image-text retrieval. In Proceedings of the 26th ACM international conference on Multimedia. 1856--1864.

Digital Library

[35]

Allen Newell. 1980. Physical symbol systems. Cognitive science 4, 2 (1980), 135--183.

[36]

Yingwei Pan, Ting Yao, Yehao Li, and Tao Mei. 2020. X-Linear Attention Networks for Image Captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10971--10980.

[37]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 311--318.

[38]

Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532--1543.

[39]

Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k entities: Collecting region-tophrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision. 2641--2649.

Digital Library

[40]

KR Prajwal, CV Jawahar, and Ponnurangam Kumaraguru. 2019. Towards Increased Accessibility of Meme Images with the Help of Rich Face Emotion Captions. In Proceedings of the 27th ACM International Conference on Multimedia. 202--210.

Digital Library

[41]

Yu Qin, Jiajun Du, Yonghua Zhang, and Hongtao Lu. 2019. Look back and predict forward in image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8367--8375.

[42]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91--99.

[43]

Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7008--7024.

[44]

Sebastian Schuster, Ranjay Krishna, Angel Chang, Li Fei-Fei, and Christopher D Manning. 2015. Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Proceedings of the fourth workshop on vision and language. 70--80.

[45]

Shurong Sheng and Marie-Francine Moens. 2019. Generating Captions for Images of Ancient Artworks. In Proceedings of the 27th ACM International Conference on Multimedia. 2478--2486.

Digital Library

[46]

Paul Smolensky. 1987. Connectionist AI, symbolic AI, and the brain. Artificial Intelligence Review 1, 2 (1987), 95--109.

[47]

Yuqing Song, Shizhe Chen, Yida Zhao, and Qin Jin. 2019. Unpaired Cross-lingual Image Caption Generation with Self-Supervised Rewards. In Proceedings of the 27th ACM International Conference on Multimedia. 784--792.

Digital Library

[48]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008. ICMR '21, August 21--24, 2021, Taipei, Taiwan X. Wang, L. Ma, Y. Fu, X. Xue

[49]

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4566--4575.

[50]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3156--3164.

[51]

BairuiWang, Lin Ma,Wei Zhang,Wenhao Jiang, JingwenWang, andWei Liu. 2019. Controllable video captioning with pos sequence guidance based on gated fusion network. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2641--2650.

[52]

Bairui Wang, Lin Ma, Wei Zhang, and Wei Liu. 2018. Reconstruction network for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7622--7631.

[53]

Jingwen Wang, Wenhao Jiang, Lin Ma, Wei Liu, and Yong Xu. 2018. Bidirectional attentive fusion with context gating for dense video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7190--7198.

[54]

Alfred North Whitehead. 1929. Symbolism, its meaning and effect. (1929).

[55]

Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. 2017. Scene graph generation by iterative message passing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5410--5419.

[56]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning. 2048--2057.

Digital Library

[57]

Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. 2018. Graph r-cnn for scene graph generation. In Proceedings of the European conference on computer vision (ECCV). 670--685.

Digital Library

[58]

Xu Yang, Kaihua Tang, Hanwang Zhang, and Jianfei Cai. 2019. Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10685--10694.

[59]

Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV). 684--699.

Digital Library

[60]

Kexin Yi, Jiajun Wu, Chuang Gan, Antonio Torralba, Pushmeet Kohli, and Josh Tenenbaum. 2018. Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding. In NeurIPS.

[61]

Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4651--4659.

[62]

Jun Yu, Guochen Xie, Mengyan Li, Haonian Xie, and Lingyun Yu. 2019. Beauty Product Retrieval Based on Regional Maximum Activation of Convolutions with Generalized Attention. In Proceedings of the 27th ACM International Conference on Multimedia. 2553--2557.

Digital Library

[63]

Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5831--5840.

[64]

Wei Zhang, Bairui Wang, Lin Ma, and Wei Liu. 2019. Reconstruct and represent video contents for captioning via reinforcement learning. IEEE transactions on pattern analysis and machine intelligence 42, 12 (2019), 3088--3101.

[65]

Wei Zhao, Wei Xu, Min Yang, Jianbo Ye, Zhou Zhao, Yabing Feng, and Yu Qiao. 2017. Dual learning for cross-domain image captioning. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 29--38.

Digital Library

[66]

Luowei Zhou, Yannis Kalantidis, Xinlei Chen, Jason J Corso, and Marcus Rohrbach. 2019. Grounded video description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6578--6587.

Cited By

Wang XXuan XXu QCai HShen W(2025)Semantic Abstractions for Multi-label ClassificationArtificial Intelligence Logic and Applications10.1007/978-981-96-0354-1_12(143-151)Online publication date: 31-Jan-2025
https://doi.org/10.1007/978-981-96-0354-1_12
Khan MG. Breslin JCurry E(2024)NeuSyRE: Neuro-symbolic visual understanding and reasoning framework based on scene graph enrichmentSemantic Web10.3233/SW-23351015:4(1389-1413)Online publication date: 4-Oct-2024
https://doi.org/10.3233/SW-233510
Lopes CMinetto RDelgado MSilva T(2023)PerceptSent - Exploring Subjectivity in a Novel Dataset for Visual Sentiment AnalysisIEEE Transactions on Affective Computing10.1109/TAFFC.2022.322523814:3(1817-1831)Online publication date: 1-Jul-2023
https://dl.acm.org/doi/10.1109/TAFFC.2022.3225238
Show More Cited By

Index Terms

Neural Symbolic Representation Learning for Image Captioning
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Scene understanding

Recommendations

Factors Influencing The Performance of Image Captioning Model: An Evaluation
MoMM '16: Proceedings of the 14th International Conference on Advances in Mobile Computing and Multi Media

Recently, neural network-based methods have shown impressive performances in captioning task. There have been numerous attempts with many proposed architectures to solve this captioning problem. In this paper, we present the evaluation of different ...
Recurrent Fusion Network for Image Captioning
Computer Vision – ECCV 2018
Abstract
Recently, much advance has been made in image captioning, and an encoder-decoder framework has been adopted by all the state-of-the-art models. Under this framework, an input image is encoded by a convolutional neural network (CNN) and then ...
Transformer based Multitask Learning for Image Captioning and Object Detection
Advances in Knowledge Discovery and Data Mining
Abstract
In several real-world scenarios like autonomous navigation and mobility, to obtain a better visual understanding of the surroundings, image captioning and object detection play a crucial role. This work introduces a novel multitask learning ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMR '21: Proceedings of the 2021 International Conference on Multimedia Retrieval

August 2021

715 pages

ISBN:9781450384636

DOI:10.1145/3460426

General Chairs:
Wen-Huang Cheng
National Yang Ming Chiao Tung University, Taiwan
,
Mohan Kankanhalli
National University of Singapore, Singapore
,
Meng Wang
Hefei University of Technology, China
,
Program Chairs:
Wei-Ta Chu
National Cheng Kung University, Taiwan
,
Jiaying Liu
Peking University, China
,
Marcel Worring
University of Amsterdam, Netherlands

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 September 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Shanghai Municipal Science and Technology Major Project

Conference

ICMR '21

Sponsor:

SIGMM

ICMR '21: International Conference on Multimedia Retrieval

August 21 - 24, 2021

Taipei, Taiwan

Acceptance Rates

Overall Acceptance Rate 88 of 241 submissions, 37%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
288
Total Downloads

Downloads (Last 12 months)28
Downloads (Last 6 weeks)5

Reflects downloads up to 27 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wang XXuan XXu QCai HShen W(2025)Semantic Abstractions for Multi-label ClassificationArtificial Intelligence Logic and Applications10.1007/978-981-96-0354-1_12(143-151)Online publication date: 31-Jan-2025
https://doi.org/10.1007/978-981-96-0354-1_12
Khan MG. Breslin JCurry E(2024)NeuSyRE: Neuro-symbolic visual understanding and reasoning framework based on scene graph enrichmentSemantic Web10.3233/SW-23351015:4(1389-1413)Online publication date: 4-Oct-2024
https://doi.org/10.3233/SW-233510
Lopes CMinetto RDelgado MSilva T(2023)PerceptSent - Exploring Subjectivity in a Novel Dataset for Visual Sentiment AnalysisIEEE Transactions on Affective Computing10.1109/TAFFC.2022.322523814:3(1817-1831)Online publication date: 1-Jul-2023
https://dl.acm.org/doi/10.1109/TAFFC.2022.3225238
Liu YYang YXiang RMa J(2023)Complementary Shifted Transformer for Image CaptioningNeural Processing Letters10.1007/s11063-023-11314-055:6(8339-8363)Online publication date: 10-Jun-2023
https://dl.acm.org/doi/10.1007/s11063-023-11314-0
Jiang WHu H(2022)Hadamard Product Perceptron Attention for Image CaptioningNeural Processing Letters10.1007/s11063-022-10980-w55:3(2707-2724)Online publication date: 28-Jul-2022
https://dl.acm.org/doi/10.1007/s11063-022-10980-w
Wei JLi ZZhu JMa H(2022)Enhance understanding and reasoning ability for image captioningApplied Intelligence10.1007/s10489-022-03624-y53:3(2706-2722)Online publication date: 12-May-2022
https://dl.acm.org/doi/10.1007/s10489-022-03624-y

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten