skip to main content
10.1145/3474085.3475712acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Scene Graph with 3D Information for Change Captioning

Published: 17 October 2021 Publication History

Abstract

Change captioning aims to describe the differences in image pairs with natural language. It is an interesting task under-explored with two main challenges: describing the relative position relationship between objects correctly and overcoming the disturbances from viewpoint changes. To address these issues, we propose a three-dimensional (3D) information aware Scene Graph based Change Captioning (SGCC) model. We extract the semantic attributes of objects and the 3D information of images (i.e., depths of objects, relative two-dimensional image plane distances, and relative angles between objects) to construct the scene graphs for image pairs, then aggregate the nodes representations with a graph convolutional network. Owing to the relative position relationships between objects and the scene graphs, our model thereby is capable of assisting observers to locate the changed objects quickly and being immune to the viewpoint change to some extent. Extensive experiments show that our SGCC model achieves competitive performance with the state-of-the-art models on the CLEVR-Change and Spot-the-Diff datasets, thus verifying the effectiveness of our proposed model. Codes are available at https://github.com/VISLANG-Lab/SGCC.

References

[1]
Kingsley Nketia Acheampong, Wenhong Tian, and Addis Abebe Assifa. 2019. Redefining Learning in Visual Comparison with Spatio-relational Context-aware Representations. In Proceedings of the 3rd International Conference on Big Data and Internet of Things. 51--55.
[2]
Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. In European conference on computer vision. Springer, 382--398.
[3]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[4]
Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. 2020. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 (2020).
[5]
Ke Cheng, Yifan Zhang, Xiangyu He, Weihan Chen, Jian Cheng, and Hanqing Lu. 2020. Skeleton-Based Action Recognition With Shift Graph Convolutional Network. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. IEEE, 180--189.
[6]
Michael Denkowski and Alon Lavie. 2014. Meteor Universal: Language Specific Translation Evaluation for Any Target Language. In Proceedings of the Ninth Workshop on Statistical Machine Translation. Baltimore, Maryland, USA, 376--380.
[7]
Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John M. Winn, and Andrew Zisserman. 2010. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis., Vol. 88, 2 (2010), 303--338.
[8]
Zhengcong Fei. 2020. Actor-Critic Sequence Generation for Relative Difference Captioning. In Proceedings of the 2020 International Conference on Multimedia Retrieval. 100--107.
[9]
Davis Gilton, Ruotian Luo, Rebecca Willett, and Greg Shakhnarovich. 2020. Detection and description of change in visual streams. arXiv preprint arXiv:2003.12633 (2020).
[10]
Clément Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J Brostow. 2019. Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3828--3838.
[11]
Jiuxiang Gu, Shafiq Joty, Jianfei Cai, Handong Zhao, Xu Yang, and Gang Wang. 2019. Unpaired image captioning via scene graph alignments. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10323--10332.
[12]
Lionel Gueguen and Raffay Hamid. 2015. Large-scale damage detection using satellite imagery. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1321--1328.
[13]
K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770--778.
[14]
Qingbao Huang, Jielong Wei, Yi Cai, Changmeng Zheng, Junying Chen, Ho-fung Leung, and Qing Li. 2020. Aligned Dual Channel Graph Convolutional Network for Visual Question Answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020. Association for Computational Linguistics, 7166--7176.
[15]
Harsh Jhamtani and Taylor Berg-Kirkpatrick. 2018. Learning to describe differences between pairs of similar images. arXiv preprint arXiv:1808.10584 (2018).
[16]
Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. 2017. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2901--2910.
[17]
F. Kenghagho Kenfack, F. Ahmed Siddiky, F. Balint-Benczedi, and M. Beetz. 2020. RobotVQA - A Scene-Graph- and Deep-Learning-based Visual Question Answering System for Robot Manipulation. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). 9667--9674.
[18]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[19]
Thomas N Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. International Conference on Learning Representation (ICLR) (2017), 1--10.
[20]
S. Lee, J. Kim, Y. Oh, and J. H. Jeon. 2019. Visual Question Answering over Scene Graph. In 2019 First International Conference on Graph Computing (GC). 45--50.
[21]
Bin Liao, You Du, and Xiangyun Yin. 2020. Fusion of Infrared-visible images in UE-IoT for Fault point detection based on GAN. IEEE Access, Vol. 8 (2020), 79754--79763.
[22]
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74--81.
[23]
Z. Liu, G. Li, G. Mercier, Y. He, and Q. Pan. 2018. Change Detection in Heterogenous Remote Sensing Images via Homogeneous Pixel Transformation. IEEE Transactions on Image Processing, Vol. 27, 4 (2018), 1822--1834.
[24]
Ariyo Oluwasanmi, Muhammad Umar Aftab, Eatedal Alabdulkreem, Bulbula Kumeda, Edward Y Baagyere, and Zhiquang Qin. 2019 a. CaptionNet: Automatic end-to-end siamese difference captioning model with attention. IEEE Access, Vol. 7 (2019), 106773--106783.
[25]
Ariyo Oluwasanmi, Enoch Frimpong, Muhammad Umar Aftab, Edward Y Baagyere, Zhiguang Qin, and Kifayat Ullah. 2019 b. Fully convolutional captionnet: Siamese difference captioning attention model. IEEE Access, Vol. 7 (2019), 175929--175939.
[26]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311--318.
[27]
Dong Huk Park, Trevor Darrell, and Anna Rohrbach. 2019. Robust change captioning. In Proceedings of the IEEE international conference on computer vision. 4624--4633.
[28]
Julia Patriarche and Bradley Erickson. 2004. A review of the automated detection of change in serial imaging studies of the brain. Journal of digital imaging, Vol. 17, 3 (2004), 158--174.
[29]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, Alessandro Moschitti, Bo Pang, and Walter Daelemans (Eds.). ACL, 1532--1543.
[30]
Yue Qiu, Yutaka Satoh, Ryota Suzuki, Kenji Iwata, and Hirokatsu Kataoka. 2020 a. 3D-Aware Scene Change Captioning From Multiview Images. IEEE Robotics and Automation Letters, Vol. 5, 3 (2020), 4743--4750.
[31]
Yue Qiu, Yutaka Satoh, Ryota Suzuki, Kenji Iwata, and Hirokatsu Kataoka. 2020 b. Indoor Scene Change Captioning Based on Multimodality Data. Sensors, Vol. 20, 17 (2020), 4761.
[32]
Xiangxi Shi, Xu Yang, Jiuxiang Gu, Shafiq Joty, and Jianfei Cai. 2020. Finding It at Another Side: A Viewpoint-Adapted Matching Encoder for Change Captioning. In European Conference on Computer Vision. Springer, 574--590.
[33]
Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4566--4575.
[34]
Dalin Wang, Daniel Beck, and Trevor Cohn. 2019. On the role of scene graphs in image captioning. In Proceedings of the Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN). 29--34.
[35]
Xin Wei, Ruixuan Yu, and Jian Sun. 2020. View-GCN: View-Based Graph Convolutional Network for 3D Shape Analysis. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. IEEE, 1847--1856.
[36]
Shuangjie Xu, Feng Xu, Yu Cheng, and Pan Zhou. 2019. Tell-the-difference: Fine-grained Visual Descriptor via a Discriminating Referee. arXiv preprint arXiv:1910.06426 (2019).
[37]
An Yan, Xin Eric Wang, Tsu-Jui Fu, and William Yang Wang. 2021. L2C: Describing Visual Differences Needs Semantic Understanding of Individuals. arXiv preprint arXiv:2102.01860 (2021).
[38]
Xu Yang, Kaihua Tang, Hanwang Zhang, and Jianfei Cai. 2019. Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10685--10694.
[39]
Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring visual relationship for image captioning. In Proceedings of the European conference on computer vision (ECCV). 684--699.

Cited By

View all
  • (2024)A survey of neurosymbolic visual reasoning with scene graphs and common sense knowledgeNeurosymbolic Artificial Intelligence10.3233/NAI-240719(1-24)Online publication date: 13-May-2024
  • (2024)Region-Focused Network for Dense CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364837020:6(1-20)Online publication date: 26-Mar-2024
  • (2024)Intelli-Change Remote Sensing - A Novel Transformer Approach2024 Second International Conference on Data Science and Information System (ICDSIS)10.1109/ICDSIS61070.2024.10594026(1-7)Online publication date: 17-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '21: Proceedings of the 29th ACM International Conference on Multimedia
October 2021
5796 pages
ISBN:9781450386517
DOI:10.1145/3474085
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. change captioning
  2. image difference description
  3. scene graph

Qualifiers

  • Research-article

Funding Sources

  • the Science and Technology Planning Project of Guangdong Province ?
  • the Science and Technology Programs of Guangzhou
  • an internal research grant from the Hong Kong Polytechnic University, China
  • National Natural Science Foundation of China
  • the collaborative research grants from the Fundamental Research Funds for the Central Universities, SCUT?
  • the Hong Kong Research Grants Council, China

Conference

MM '21
Sponsor:
MM '21: ACM Multimedia Conference
October 20 - 24, 2021
Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)91
  • Downloads (Last 6 weeks)5
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)A survey of neurosymbolic visual reasoning with scene graphs and common sense knowledgeNeurosymbolic Artificial Intelligence10.3233/NAI-240719(1-24)Online publication date: 13-May-2024
  • (2024)Region-Focused Network for Dense CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364837020:6(1-20)Online publication date: 26-Mar-2024
  • (2024)Intelli-Change Remote Sensing - A Novel Transformer Approach2024 Second International Conference on Data Science and Information System (ICDSIS)10.1109/ICDSIS61070.2024.10594026(1-7)Online publication date: 17-May-2024
  • (2024)Distractors-Immune Representation Learning with Cross-Modal Contrastive Regularization for Change CaptioningComputer Vision – ECCV 202410.1007/978-3-031-72775-7_18(311-328)Online publication date: 30-Sep-2024
  • (2023)Cross-modal representation learning and generationJournal of Image and Graphics10.11834/jig.23003528:6(1608-1629)Online publication date: 2023
  • (2023)Improving Scene Graph Generation with Superpixel-Based Interaction LearningProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3611889(1809-1820)Online publication date: 26-Oct-2023
  • (2023)Neighborhood Contrastive Transformer for Change CaptioningIEEE Transactions on Multimedia10.1109/TMM.2023.325416225(9518-9529)Online publication date: 29-Mar-2023
  • (2023)I3N: Intra- and Inter-Representation Interaction Network for Change CaptioningIEEE Transactions on Multimedia10.1109/TMM.2023.324214225(8828-8841)Online publication date: 1-Jan-2023
  • (2023)Depth-Aware and Semantic Guided Relational Attention Network for Visual Question AnsweringIEEE Transactions on Multimedia10.1109/TMM.2022.319068625(5344-5357)Online publication date: 1-Jan-2023
  • (2023)3D VSG: Long-term Semantic Scene Change Prediction through 3D Variable Scene Graphs2023 IEEE International Conference on Robotics and Automation (ICRA)10.1109/ICRA48891.2023.10161212(8179-8186)Online publication date: 29-May-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media