skip to main content
10.1145/3581783.3612210acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Beware of Overcorrection: Scene-induced Commonsense Graph for Scene Graph Generation

Published: 27 October 2023 Publication History

Abstract

A scene graph generation task is largely restricted under a class imbalance. Previous methods have alleviated the class imbalance problem by incorporating commonsense information into the classification, enabling the prediction model to rectify the incorrect head class into the correct tail class. However, the results of commonsense-based models are typically overcorrected, e.g., the visually correct head class is forcibly modified into the wrong tail class. We argue that there are two principal reasons for this phenomenon. First, existing models ignore the semantic gap between commonsense knowledge and real scenes. Second, current commonsense fusion strategies propagate the neighbors in the visual-linguistic contexts without long-range correlation. To alleviate overcorrection, we formulate the commonsense-based scene graph generation task as two sub-problems: scene-induced commonsense graph generation (SI-CGG) and commonsense-inspired scene graph generation (CI-SGG). In SI-CGG module, unlike conventional methods using fixed commonsense graph, we adaptively adjust the node embeddings in a commonsense graph according to their visual appearance and configure the new reasoning edge under a specific visual context. The CI-SGG module is proposed to propagate the information from scene-induced commonsense graph back to the scene graph. It updates the representations of each node in scene graph by the aggregation of neighbourhood information at different scales. Through maximum likelihood optimisation of the logarithmic Gaussian process, the scene graph automatically adapt to the different neighbors in the visual-linguistic contexts. Systematic experiments on the Visual Genome dataset show that our full method achieves state-of-the-art performance.

Supplemental Material

MP4 File
Presentation video

References

[1]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer Normalization. stat, Vol. 1050 (2016), 21.
[2]
Muhammet Balcilar, Renton Guillaume, Pierre Héroux, Benoit Gaüzère, Sébastien Adam, and Paul Honeine. 2021. Analyzing the expressive power of graph neural networks in a spectral perspective. In Proceedings of the International Conference on Learning Representations (ICLR).
[3]
Hedi Ben-Younes, Rémi Cadene, Matthieu Cord, and Nicolas Thome. 2017. Mutan: Multimodal tucker fusion for visual question answering. In Proceedings of the IEEE international conference on computer vision. 2612--2620.
[4]
Bashirul Azam Biswas and Qiang Ji. 2023. Probabilistic Debiasing of Scene Graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10429--10438.
[5]
Chao Chen, Yibing Zhan, Baosheng Yu, Liu Liu, Yong Luo, and Bo Du. 2022. Resistance Training using Prior Bias: toward Unbiased Scene Graph Generation. in Proceedings of the AAAI Conference on Artificial Intelligence (2022).
[6]
Tianshui Chen, Weihao Yu, Riquan Chen, and Liang Lin. 2019 Knowledge-embedded routing network for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6163--6171.
[7]
Xingning Dong, Tian Gan, Xuemeng Song, Jianlong Wu, Yuan Cheng, and Liqiang Nie. 2022. Stacked hybrid-attention and group collaborative learning for unbiased scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19427--19436.
[8]
Apoorva Dornadula, Austin Narcomey, Ranjay Krishna, Michael Bernstein, and Fei-Fei Li. 2019. Visual relationships as functions: Enabling few-shot scene graph prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. 0--0.
[9]
Jiarui Feng, Yixin Chen, Fuhai Li, Anindya Sarkar, and Muhan Zhang. 2022. How powerful are k-hop message passing graph neural networks. arXiv preprint arXiv:2205.13328 (2022).
[10]
Jiuxiang Gu, Handong Zhao, Zhe Lin, Sheng Li, Jianfei Cai, and Mingyang Ling. 2019. Scene graph generation with external knowledge and image reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1969--1978.
[11]
R. Herzig, A. Bar, H. Xu, G. Chechik, T. Darrell, and A. Globerson. 2020. Learning Canonical Representations for Scene Graph to Image Generation. In European Conference on Computer Vision. 210--227.
[12]
Justin Johnson, Agrim Gupta, and Li Fei-Fei. 2018. Image generation from scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1219--1228.
[13]
Deunsol Jung, Sanghyun Kim, Won Hwa Kim, and Minsu Cho. 2023. Devil's on the Edges: Selective Quad Attention for Scene Graph Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18664--18674.
[14]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[15]
Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In International Conference on Learning Representations (ICLR).
[16]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, Vol. 123, 1 (2017), 32--73.
[17]
Lin Li, Long Chen, Yifeng Huang, Zhimeng Zhang, Songyang Zhang, and Jun Xiao. 2022a. The devil is in the labels: Noisy label correction for robust scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18869--18878.
[18]
Rongjie Li, Songyang Zhang, Bo Wan, and Xuming He. 2021. Bipartite graph network with adaptive message passing for unbiased scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11109--11119.
[19]
Wei Li, Haiwei Zhang, Qijie Bai, Guoqing Zhao, Ning Jiang, and Xiaojie Yuan. 2022b. Ppdl: Predicate probability distribution based loss for unbiased scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19447--19456.
[20]
Yikang Li, Wanli Ouyang, Bolei Zhou, Jianping Shi, Chao Zhang, and Xiaogang Wang. 2018. Factorizable net: an efficient subgraph-based framework for scene graph generation. In Proceedings of the European Conference on Computer Vision (ECCV). 335--351.
[21]
Yikang Li, Wanli Ouyang, Bolei Zhou, Kun Wang, and Xiaogang Wang. 2017. Scene graph generation from objects, phrases and region captions. In Proceedings of the IEEE International Conference on Computer Vision. 1261--1270.
[22]
Bingqian Lin, Yi Zhu, and Xiaodan Liang. 2022. Atom correlation based graph propagation for scene graph generation. Pattern Recognition, Vol. 122 (2022), 108300.
[23]
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision. 2980--2988.
[24]
T. Y. Lin, M. Maire, S. Belongie, J. Hays, and C. L. Zitnick. 2014. Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision.
[25]
Xin Lin, Changxing Ding, Jinquan Zeng, and Dacheng Tao. 2020. Gps-net: Graph property sensing network for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3746--3753.
[26]
Hugo Liu and Push Singh. 2004. ConceptNet-a practical commonsense reasoning tool-kit. BT technology journal, Vol. 22, 4 (2004), 211--226.
[27]
Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. 2016. Visual relationship detection with language priors. In European conference on computer vision. Springer, 852--869.
[28]
Sitao Luan, Chenqing Hua, Qincheng Lu, Jiaqi Zhu, Mingde Zhao, Shuyuan Zhang, Xiao-Wen Chang, and Doina Precup. 2022. Revisiting heterophily for graph neural networks. arXiv preprint arXiv:2210.07606 (2022).
[29]
Xinyu Lyu, Lianli Gao, Yuyu Guo, Zhou Zhao, Hao Huang, Heng Tao Shen, and Jingkuan Song. 2022. Fine-grained predicates learning for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19467--19475.
[30]
Andrew L Maas, Awni Y Hannun, Andrew Y Ng, et al. 2013. Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, Vol. 30. Citeseer, 3.
[31]
George Miller. 1995. Wordnet: a lexical database for english communications of the acm 38 (11) 3941. Niemela, I (1995).
[32]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91--99.
[33]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision, Vol. 115, 3 (2015), 211--252.
[34]
Sahand Sharifzadeh, Sina Moayed Baharlou, Martin Schmitt, Hinrich Schütze, and Volker Tresp. 2021b. Improving Scene Graph Classification by Exploiting Knowledge from Texts. arXiv preprint arXiv:2102.04760 (2021).
[35]
Sahand Sharifzadeh, Sina Moayed Baharlou, and Volker Tresp. 2021a. Classification by Attention: Scene Graph Classification with Prior Knowledge. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 5025--5033.
[36]
Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, and Hanwang Zhang. 2020. Unbiased scene graph generation from biased training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3716--3725.
[37]
Kaihua Tang, Hanwang Zhang, Baoyuan Wu, Wenhan Luo, and Wei Liu. 2019. Learning to compose dynamic tree structures for visual contexts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6619--6628.
[38]
Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research, Vol. 9, 11 (2008).
[39]
Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. 2019. Dynamic graph cnn for learning on point clouds. Acm Transactions On Graphics (tog), Vol. 38, 5 (2019), 1--12.
[40]
Pengfei Wei, Yiping Ke, Yew Soon Ong, and Zejun Ma. 2022. Adaptive Transfer Kernel Learning for Transfer Gaussian Process Regression. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
[41]
Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. 2017. Scene graph generation by iterative message passing. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5410--5419.
[42]
Minghao Xu, Meng Qu, Bingbing Ni, and Jian Tang. 2021. Joint Modeling of Visual Objects and Relations for Scene Graph Generation. Advances in Neural Information Processing Systems, Vol. 34 (2021), 7689--7702.
[43]
Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. 2018. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1316--1324.
[44]
Shaotian Yan, Chen Shen, Zhongming Jin, Jianqiang Huang, Rongxin Jiang, Yaowu Chen, and Xian-Sheng Hua. 2020. PCPL: Predicate-Correlation Perception Learning for Unbiased Scene Graph Generation. In Proceedings of the 28th ACM International Conference on Multimedia. 265--273.
[45]
Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. 2018. Graph r-cnn for scene graph generation. In Proceedings of the European conference on computer vision (ECCV). 670--685.
[46]
Xu Yang, Kaihua Tang, Hanwang Zhang, and Jianfei Cai. 2019. Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10685--10694.
[47]
Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring visual relationship for image captioning. In Proceedings of the European conference on computer vision (ECCV). 684--699.
[48]
Ruichi Yu, Ang Li, Vlad I Morariu, and Larry S Davis. 2017. Visual relationship detection with internal and external linguistic knowledge distillation. In Proceedings of the IEEE international conference on computer vision. 1974--1982.
[49]
Alireza Zareian, Svebor Karaman, and Shih-Fu Chang. 2020a. Bridging knowledge graphs to generate scene graphs. In European Conference on Computer Vision. Springer, 606--623.
[50]
Alireza Zareian, Svebor Karaman, and Shih-Fu Chang. 2020b. Weakly supervised visual semantic parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3736--3745.
[51]
Alireza Zareian, Zhecan Wang, Haoxuan You, and Shih-Fu Chang. 2020c. Learning visual commonsense for robust scene graph generation. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXIII 16. Springer, 642--657.
[52]
Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5831--5840.
[53]
Ji Zhang, Kevin J Shih, Ahmed Elgammal, Andrew Tao, and Bryan Catanzaro. 2019. Graphical contrastive losses for scene graph parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11535--11543.
[54]
Yifei Zhang, Hao Zhu, Zixing Song, Piotr Koniusz, and Irwin King. 2022. Spectral Feature Augmentation for Graph Contrastive Learning and Beyond. arXiv preprint arXiv:2212.01026 (2022).
[55]
Chaofan Zheng, Xinyu Lyu, Lianli Gao, Bo Dai, and Jingkuan Song. 2023 a. Prototype-based Embedding Network for Scene Graph Generation. arXiv preprint arXiv:2303.07096 (2023).
[56]
Wenbo Zheng, Lan Yan, Wenwen Zhang, and Fei-Yue Wang. 2023 b. Webly Supervised Knowledge-Embedded Model for Visual Reasoning. IEEE Transactions on Neural Networks and Learning Systems (2023).
[57]
Y. Zhong, L. Wang, J. Chen, D. Yu, and Y. Li. 2020. Comprehensive Image Captioning via Scene Graph Decomposition. In European Conference on Computer Vision. 211--229.

Cited By

View all
  • (2024)UniQ: Unified Decoder with Task-specific Queries for Efficient Scene Graph GenerationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681542(8815-8824)Online publication date: 28-Oct-2024
  • (2024)HGOE: Hybrid External and Internal Graph Outlier Exposure for Graph Out-of-Distribution DetectionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681118(1544-1553)Online publication date: 28-Oct-2024
  • (2024)Synergetic Prototype Learning Network for Unbiased Scene Graph GenerationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680973(945-954)Online publication date: 28-Oct-2024

Index Terms

  1. Beware of Overcorrection: Scene-induced Commonsense Graph for Scene Graph Generation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. overcorrection
    2. scene graph generation
    3. scene-induced commonsense graph
    4. visual-linguistic context

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)71
    • Downloads (Last 6 weeks)8
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)UniQ: Unified Decoder with Task-specific Queries for Efficient Scene Graph GenerationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681542(8815-8824)Online publication date: 28-Oct-2024
    • (2024)HGOE: Hybrid External and Internal Graph Outlier Exposure for Graph Out-of-Distribution DetectionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681118(1544-1553)Online publication date: 28-Oct-2024
    • (2024)Synergetic Prototype Learning Network for Unbiased Scene Graph GenerationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680973(945-954)Online publication date: 28-Oct-2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media