research-article

Scene Graph with 3D Information for Change Captioning

Authors:

Qing LiAuthors Info & Claims

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 5074 - 5082

https://doi.org/10.1145/3474085.3475712

Published: 17 October 2021 Publication History

Abstract

Change captioning aims to describe the differences in image pairs with natural language. It is an interesting task under-explored with two main challenges: describing the relative position relationship between objects correctly and overcoming the disturbances from viewpoint changes. To address these issues, we propose a three-dimensional (3D) information aware Scene Graph based Change Captioning (SGCC) model. We extract the semantic attributes of objects and the 3D information of images (i.e., depths of objects, relative two-dimensional image plane distances, and relative angles between objects) to construct the scene graphs for image pairs, then aggregate the nodes representations with a graph convolutional network. Owing to the relative position relationships between objects and the scene graphs, our model thereby is capable of assisting observers to locate the changed objects quickly and being immune to the viewpoint change to some extent. Extensive experiments show that our SGCC model achieves competitive performance with the state-of-the-art models on the CLEVR-Change and Spot-the-Diff datasets, thus verifying the effectiveness of our proposed model. Codes are available at https://github.com/VISLANG-Lab/SGCC.

References

[1]

Kingsley Nketia Acheampong, Wenhong Tian, and Addis Abebe Assifa. 2019. Redefining Learning in Visual Comparison with Spatio-relational Context-aware Representations. In Proceedings of the 3rd International Conference on Big Data and Internet of Things. 51--55.

Digital Library

[2]

Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. In European conference on computer vision. Springer, 382--398.

[3]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]

Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. 2020. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 (2020).

[5]

Ke Cheng, Yifan Zhang, Xiangyu He, Weihan Chen, Jian Cheng, and Hanqing Lu. 2020. Skeleton-Based Action Recognition With Shift Graph Convolutional Network. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. IEEE, 180--189.

[6]

Michael Denkowski and Alon Lavie. 2014. Meteor Universal: Language Specific Translation Evaluation for Any Target Language. In Proceedings of the Ninth Workshop on Statistical Machine Translation. Baltimore, Maryland, USA, 376--380.

[7]

Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John M. Winn, and Andrew Zisserman. 2010. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis., Vol. 88, 2 (2010), 303--338.

Digital Library

[8]

Zhengcong Fei. 2020. Actor-Critic Sequence Generation for Relative Difference Captioning. In Proceedings of the 2020 International Conference on Multimedia Retrieval. 100--107.

Digital Library

[9]

Davis Gilton, Ruotian Luo, Rebecca Willett, and Greg Shakhnarovich. 2020. Detection and description of change in visual streams. arXiv preprint arXiv:2003.12633 (2020).

[10]

Clément Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J Brostow. 2019. Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3828--3838.

[11]

Jiuxiang Gu, Shafiq Joty, Jianfei Cai, Handong Zhao, Xu Yang, and Gang Wang. 2019. Unpaired image captioning via scene graph alignments. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10323--10332.

[12]

Lionel Gueguen and Raffay Hamid. 2015. Large-scale damage detection using satellite imagery. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1321--1328.

[13]

K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770--778.

[14]

Qingbao Huang, Jielong Wei, Yi Cai, Changmeng Zheng, Junying Chen, Ho-fung Leung, and Qing Li. 2020. Aligned Dual Channel Graph Convolutional Network for Visual Question Answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020. Association for Computational Linguistics, 7166--7176.

[15]

Harsh Jhamtani and Taylor Berg-Kirkpatrick. 2018. Learning to describe differences between pairs of similar images. arXiv preprint arXiv:1808.10584 (2018).

[16]

Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. 2017. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2901--2910.

[17]

F. Kenghagho Kenfack, F. Ahmed Siddiky, F. Balint-Benczedi, and M. Beetz. 2020. RobotVQA - A Scene-Graph- and Deep-Learning-based Visual Question Answering System for Robot Manipulation. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). 9667--9674.

[18]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[19]

Thomas N Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. International Conference on Learning Representation (ICLR) (2017), 1--10.

[20]

S. Lee, J. Kim, Y. Oh, and J. H. Jeon. 2019. Visual Question Answering over Scene Graph. In 2019 First International Conference on Graph Computing (GC). 45--50.

[21]

Bin Liao, You Du, and Xiangyun Yin. 2020. Fusion of Infrared-visible images in UE-IoT for Fault point detection based on GAN. IEEE Access, Vol. 8 (2020), 79754--79763.

[22]

Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74--81.

[23]

Z. Liu, G. Li, G. Mercier, Y. He, and Q. Pan. 2018. Change Detection in Heterogenous Remote Sensing Images via Homogeneous Pixel Transformation. IEEE Transactions on Image Processing, Vol. 27, 4 (2018), 1822--1834.

Digital Library

[24]

Ariyo Oluwasanmi, Muhammad Umar Aftab, Eatedal Alabdulkreem, Bulbula Kumeda, Edward Y Baagyere, and Zhiquang Qin. 2019 a. CaptionNet: Automatic end-to-end siamese difference captioning model with attention. IEEE Access, Vol. 7 (2019), 106773--106783.

[25]

Ariyo Oluwasanmi, Enoch Frimpong, Muhammad Umar Aftab, Edward Y Baagyere, Zhiguang Qin, and Kifayat Ullah. 2019 b. Fully convolutional captionnet: Siamese difference captioning attention model. IEEE Access, Vol. 7 (2019), 175929--175939.

[26]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311--318.

Digital Library

[27]

Dong Huk Park, Trevor Darrell, and Anna Rohrbach. 2019. Robust change captioning. In Proceedings of the IEEE international conference on computer vision. 4624--4633.

[28]

Julia Patriarche and Bradley Erickson. 2004. A review of the automated detection of change in serial imaging studies of the brain. Journal of digital imaging, Vol. 17, 3 (2004), 158--174.

[29]

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, Alessandro Moschitti, Bo Pang, and Walter Daelemans (Eds.). ACL, 1532--1543.

[30]

Yue Qiu, Yutaka Satoh, Ryota Suzuki, Kenji Iwata, and Hirokatsu Kataoka. 2020 a. 3D-Aware Scene Change Captioning From Multiview Images. IEEE Robotics and Automation Letters, Vol. 5, 3 (2020), 4743--4750.

[31]

Yue Qiu, Yutaka Satoh, Ryota Suzuki, Kenji Iwata, and Hirokatsu Kataoka. 2020 b. Indoor Scene Change Captioning Based on Multimodality Data. Sensors, Vol. 20, 17 (2020), 4761.

[32]

Xiangxi Shi, Xu Yang, Jiuxiang Gu, Shafiq Joty, and Jianfei Cai. 2020. Finding It at Another Side: A Viewpoint-Adapted Matching Encoder for Change Captioning. In European Conference on Computer Vision. Springer, 574--590.

[33]

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4566--4575.

[34]

Dalin Wang, Daniel Beck, and Trevor Cohn. 2019. On the role of scene graphs in image captioning. In Proceedings of the Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN). 29--34.

[35]

Xin Wei, Ruixuan Yu, and Jian Sun. 2020. View-GCN: View-Based Graph Convolutional Network for 3D Shape Analysis. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. IEEE, 1847--1856.

[36]

Shuangjie Xu, Feng Xu, Yu Cheng, and Pan Zhou. 2019. Tell-the-difference: Fine-grained Visual Descriptor via a Discriminating Referee. arXiv preprint arXiv:1910.06426 (2019).

[37]

An Yan, Xin Eric Wang, Tsu-Jui Fu, and William Yang Wang. 2021. L2C: Describing Visual Differences Needs Semantic Understanding of Individuals. arXiv preprint arXiv:2102.01860 (2021).

[38]

Xu Yang, Kaihua Tang, Hanwang Zhang, and Jianfei Cai. 2019. Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10685--10694.

[39]

Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring visual relationship for image captioning. In Proceedings of the European conference on computer vision (ECCV). 684--699.

Digital Library

Cited By

Khan MIlievski FBreslin JCurry E(2024)A survey of neurosymbolic visual reasoning with scene graphs and common sense knowledgeNeurosymbolic Artificial Intelligence10.3233/NAI-240719(1-24)Online publication date: 13-May-2024
https://doi.org/10.3233/NAI-240719
Huang QLi PHuang YShuang FCai Y(2024)Region-Focused Network for Dense CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364837020:6(1-20)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3648370
Vyshnav DGutha LManindra AKarthikeyan B(2024)Intelli-Change Remote Sensing - A Novel Transformer Approach2024 Second International Conference on Data Science and Information System (ICDSIS)10.1109/ICDSIS61070.2024.10594026(1-7)Online publication date: 17-May-2024
https://doi.org/10.1109/ICDSIS61070.2024.10594026
Show More Cited By

Index Terms

Scene Graph with 3D Information for Change Captioning
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Scene understanding
    2. Natural language processing
      1. Natural language generation

Recommendations

Improve Image Captioning by Modeling Dynamic Scene Graph Extension
ICMR '22: Proceedings of the 2022 International Conference on Multimedia Retrieval

Recently, scene graph generation methods have been used in image captioning to encode the objects and their relationships in the encoder-decoder framework, where the decoder selects part of the graph nodes as input for word inference. However, current ...
Comprehensive Image Captioning via Scene Graph Decomposition
Computer Vision – ECCV 2020
Abstract
We address the challenging problem of image captioning by revisiting the representation of image scene graph. At the core of our method lies the decomposition of a scene graph into a set of sub-graphs, with each sub-graph capturing a semantic ...
Scene Graph Masked Variational Autoencoders for 3D Scene Generation
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Generating realistic 3D indoor scenes requires a deep understanding of objects and their spatial relationships. However, existing methods often fail to generate realistic 3D scenes due to the limited understanding of object relationships. To tackle this ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

October 2021

5796 pages

ISBN:9781450386517

DOI:10.1145/3474085

General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

the Science and Technology Planning Project of Guangdong Province ?
the Science and Technology Programs of Guangzhou
an internal research grant from the Hong Kong Polytechnic University, China
National Natural Science Foundation of China
the collaborative research grants from the Fundamental Research Funds for the Central Universities, SCUT?
the Hong Kong Research Grants Council, China

Conference

MM '21

Sponsor:

SIGMM

MM '21: ACM Multimedia Conference

October 20 - 24, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
451
Total Downloads

Downloads (Last 12 months)91
Downloads (Last 6 weeks)5

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Khan MIlievski FBreslin JCurry E(2024)A survey of neurosymbolic visual reasoning with scene graphs and common sense knowledgeNeurosymbolic Artificial Intelligence10.3233/NAI-240719(1-24)Online publication date: 13-May-2024
https://doi.org/10.3233/NAI-240719
Huang QLi PHuang YShuang FCai Y(2024)Region-Focused Network for Dense CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364837020:6(1-20)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3648370
Vyshnav DGutha LManindra AKarthikeyan B(2024)Intelli-Change Remote Sensing - A Novel Transformer Approach2024 Second International Conference on Data Science and Information System (ICDSIS)10.1109/ICDSIS61070.2024.10594026(1-7)Online publication date: 17-May-2024
https://doi.org/10.1109/ICDSIS61070.2024.10594026
Tu YLi LSu LYan CHuang Q(2024)Distractors-Immune Representation Learning with Cross-Modal Contrastive Regularization for Change CaptioningComputer Vision – ECCV 202410.1007/978-3-031-72775-7_18(311-328)Online publication date: 30-Sep-2024
https://doi.org/10.1007/978-3-031-72775-7_18
Huafeng LJingjing CLiang LBingkun BZechao LJiaying LLiqiang N(2023)Cross-modal representation learning and generationJournal of Image and Graphics10.11834/jig.23003528:6(1608-1629)Online publication date: 2023
https://doi.org/10.11834/jig.230035
Wang JZhang CHuang JRen BDeng ZEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Improving Scene Graph Generation with Superpixel-Based Interaction LearningProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3611889(1809-1820)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3611889
Tu YLi LSu LLu KHuang Q(2023)Neighborhood Contrastive Transformer for Change CaptioningIEEE Transactions on Multimedia10.1109/TMM.2023.325416225(9518-9529)Online publication date: 29-Mar-2023
https://dl.acm.org/doi/10.1109/TMM.2023.3254162
Yue STu YLi LYang YGao SYu Z(2023)I3N: Intra- and Inter-Representation Interaction Network for Change CaptioningIEEE Transactions on Multimedia10.1109/TMM.2023.324214225(8828-8841)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2023.3242142
Liu YWei WPeng DMao XHe ZZhou P(2023)Depth-Aware and Semantic Guided Relational Attention Network for Visual Question AnsweringIEEE Transactions on Multimedia10.1109/TMM.2022.319068625(5344-5357)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2022.3190686
Looper SRodriguez-Puigvert JSiegwart RCadena CSchmid L(2023)3D VSG: Long-term Semantic Scene Change Prediction through 3D Variable Scene Graphs2023 IEEE International Conference on Robotics and Automation (ICRA)10.1109/ICRA48891.2023.10161212(8179-8186)Online publication date: 29-May-2023
https://doi.org/10.1109/ICRA48891.2023.10161212
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents