research-article

Memory-Based Network for Scene Graph with Unbalanced Relations

Authors:

Yang ChenAuthors Info & Claims

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Pages 2400 - 2408

https://doi.org/10.1145/3394171.3413507

Published: 12 October 2020 Publication History

Abstract

The scene graph which can be represented by a set of visual triples is composed of objects and the relations between object pairs. It is vital for image captioning, visual question answering, and many other applications. However, there is a long tail distribution on the scene graph dataset, and the tail relation cannot be accurately identified due to the lack of training samples. The problem of the nonstandard label and feature overlap on the scene graph affects the extraction of discriminative features and exacerbates the effect of data imbalance on the model. For these reasons, we propose a novel scene graph generation model that can effectively improve the detection of low-frequency relations. We use the method of memory features to realize the transfer of high-frequency relation features to low-frequency relation features. Extensive experiments on scene graph datasets show that our model significantly improved the performance of two evaluation metrics R@K and mR@K compared with state-of-the-art baselines.

Supplementary Material

MP4 File (3394171.3413507.mp4)

The scene graph which can be represented by a set of visual triples is composed of objects and the relations between object pairs. It is vital for image captioning, visual question answering, and many other applications. However, there is a long tail distribution on the scene graph dataset, and the tail relation cannot be accurately identified due to the lack of training samples. The problem of the nonstandard label and feature overlap on the scene graph affects the extraction of discriminative features and exacerbates the effect of data imbalance on the model. For these reasons, we propose a novel scene graph generation model that can effectively improve the detection of low-frequency relations. We use the method of memory features to realize the transfer of high-frequency relation features to low-frequency relation features. Extensive experiments on scene graph datasets show that our model significantly improved the performance of two evaluation metrics R@K andmR@K compared with state-of-the-art baselines.

Download
1419.84 MB

References

[1]

Gustavo E A P A Batista, Ronaldo C Prati, and Maria Carolina Monard. 2004. A study of the behavior of several methods for balancing machine learning training data. Sigkdd Explorations, Vol. 6, 1 (2004), 20--29.

Digital Library

[2]

Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, Vol. 16, 1 (2002), 321--357.

[3]

Tianshui Chen, Weihao Yu, Riquan Chen, and Liang Lin. 2019. Knowledge-Embedded Routing Network for Scene Graph Generation. In Proceedings of CVPR. 6163--6171.

[4]

B. Dai, Y. Zhang, and D. Lin. 2017. Detecting Visual Relationships with Deep Relational Networks. In Proceedings of CVPR. 3298--3308.

[5]

C. Elkan. 2001. The Foundation of Cost-sensitive Learning. Proc.seventeenth Intl.joint Conf.on Artificial Intelligence (2001).

[6]

Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C Platt, et almbox. 2015. From captions to visual concepts and back. In Proceedings of CVPR. 1473--1482.

[7]

Spyros Gidaris and Nikos Komodakis. 2018. Dynamic few-shot visual learning without forgetting. In Proceedings of the CVPR. 4367--4375.

[8]

Jiuxiang Gu, Jianfei Cai, Shafiq Joty, Li Niu, and Gang Wang. 2018a. Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models. In Proceedings of CVPR. 7181--7189.

[9]

Jiuxiang Gu, Jianfei Cai, Gang Wang, and Tsuhan Chen. 2018b. Stack-Captioning: Coarse-to-Fine Learning for Image Captioning. In Proceedings of AAAI. 6837--6844.

[10]

Jiuxiang Gu, Shafiq Joty, Jianfei Cai, and Gang Wang. 2018c. Unpaired Image Captioning by Language Pivoting. In Proceedings of ECCV. 519--535.

[11]

Jiuxiang Gu, Gang Wang, Jianfei Cai, and Tsuhan Chen. 2017. An Empirical Study of Language CNN for Image Captioning. In Proceedings of ICCV. 1231--1240.

[12]

Hui Han, Wenyuan Wang, and Binghuan Mao. 2005. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In Proceedings of ICIC. 878--887.

Digital Library

[13]

Haibo He, Yang Bai, E A Garcia, and Shutao Li. 2008. ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. In Proceedings of IJCNN. 1322--1328.

[14]

Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long short-term memory. Neural Computation, Vol. 9, 8 (1997), 1735--1780.

Digital Library

[15]

Yenchang Hsu, Zhaoyang Lv, and Zsolt Kira. 2018. Learning to cluster in order to transfer across domains and tasks. In Proceedings of ICLR.

[16]

Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David Shamma, Michael Bernstein, and Li Fei-Fei. 2015. Image retrieval using scene graphs. In Proceedings of CVPR. 3668--3678.

[17]

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li Jia Li, and David A. Shamma. [n.d.]. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. International Journal of Computer Vision, Vol. 123, 1 ([n.,d.]), 32--73.

[18]

Tsungyi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. 2017. Focal Loss for Dense Object Detection. In Proceedings of ICCV. 2999--3007.

[19]

Ziwei Liu, Zhongqi Miao, Xiaohang Zhan, Jiayun Wang, Boqing Gong, and Stella X Yu. 2019. Large-Scale Long-Tailed Recognition in an Open World. In Proceedings of CVPR. 2537--2546.

[20]

Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. 2016. Visual relationship detection with language priors. In Proceedings of ECCV. 852--869.

[21]

I Mani. 2003. KNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction. In Icml Workshop on Learning from Imbalanced Datasets.

[22]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the NIPS. 3111--3119.

Digital Library

[23]

Alejandro Newell and Jia Deng. 2017. Pixels to Graphs by Associative Embedding. In Proceedings of NIPS. 2171--2180.

[24]

Hang Qi, Matthew Brown, and David G Lowe. 2018. Low-Shot Learning with Imprinted Weights. In Proceedings of the CVPR. 5822--5830.

[25]

Shaoqing Ren, Kaiming He, Ross Girshick, and Sun Jian. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 39, 6 (2015).

[26]

Jiaxin Shi, Hanwang Zhang, and Juanzi Li. 2019. Explainable and explicit visual reasoning over scene graphs. In Proceedings of the CVPR. 8376--8384.

[27]

Benoit Steiner, Zachary Devito, Soumith Chintala, Sam Gross, Adam Paszke, Francisco Massa, Adam Lerer, Gregory Chanan, Zeming Lin, Edward Yang, et almbox. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of NIPS. 8026--8037.

[28]

Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, and Hanwang Zhang. 2020. Unbiased Scene Graph Generation from Biased Training. In Proceedings of CVPR.

[29]

Kaihua Tang, Hanwang Zhang, Baoyuan Wu, Wenhan Luo, and Wei Liu. 2019. Learning to Compose Dynamic Tree Structures for Visual Contexts. In Proceedings of CVPR. 6619--6628.

[30]

Ivan Tomek. 1976. Two modifications of CNN. IEEE Transactions on Systems Man and Cybernetics, Vol. 6, 11 (1976), 769--772.

[31]

Meng Wang, Weitong Chen, Sen Wang, Jun Liu, Xue Li, and Bela Stantic. 2018a. Answering why-not questions on semantic multimedia queries. Multimedia Tools and Applications, Vol. 77, 3 (2018), 3405--3429.

Digital Library

[32]

Meng Wang, Guilin Qi, HaoFen Wang, and Qiushuo Zheng. 2020 a. Richpedia: A Comprehensive Multi-modal Knowledge Graph. In Proceedings of Joint International Semantic Technology Conference. 130--145.

[33]

Peng Wang, Qi Wu, Chunhua Shen, Anthony Dick, and Anton Van Den Hengel. 2018b. FVQA: Fact-Based Visual Question Answering. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 40, 10 (2018), 2413--2427.

Digital Library

[34]

Weitao Wang, Meng Wang, Sen Wang, Guodong Long, Lina Yao, Guilin Qi, and Yang Chen. 2020 b. One-Shot Learning for Long-Tail Visual Relation Detection. In Proceedings of the AAAI. 12225--12232.

[35]

Wenbin Wang, Ruiping Wang, Shiguang Shan, and Xilin Chen. 2019. Exploring Context and Visual Pattern of Relationship for Scene Graph Generation. In Proceedings of CVPR. 8188--8197.

[36]

Dennis L Wilson. 1972. Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. IEEE Transactions on Systems Man and Cybernetics, Vol. 2, 3 (1972), 408--421.

[37]

Qi Wu, Chunhua Shen, Peng Wang, Anthony Dick, and Anton Van Den Hengel. 2018. Image Captioning and Visual Question Answering Based on Attributes and External Knowledge. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 40, 6 (2018), 1367--1381.

[38]

Caiming Xiong, Stephen Merity, and Richard Socher. 2016. Dynamic memory networks for visual and textual question answering. In Proceedings of ICML. 2397--2406.

[39]

Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. 2017. Scene graph generation by iterative message passing. In Proceedings of CVPR. 5410--5419.

[40]

Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. 2018. Graph R-CNN for Scene Graph Generation. In Proceedings of ECCV. 690--706.

[41]

Xu Yang, Kaihua Tang, Hanwang Zhang, and Jianfei Cai. 2019. Auto-Encoding Scene Graphs for Image Captioning. In Proceedings of CVPR. 10685--10694.

[42]

Guojun Yin, Lu Sheng, Bin Liu, Nenghai Yu, Xiaogang Wang, Jing Shao, and Chen Change Loy. 2018. Zoom-Net: Mining Deep Feature Interactions for Visual Relationship Recognition. In Proceedings of ECCV. 330--347.

[43]

Ruichi Yu, Ang Li, Vlad I Morariu, and Larry S Davis. 2017. Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation. In Proceedings of ICCV. 1068--1076.

[44]

Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural Motifs: Scene Graph Parsing with Global Context. In Proceedings of CVPR. 5831--5840.

[45]

Yibing Zhan, Jun Yu, Ting Yu, and Dacheng Tao. 2019. On Exploring Undetermined Relationships for Visual Relationship Detection. In Proceedings of CVPR. 5128--5137.

[46]

Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, and Tat-Seng Chua. 2017b. Visual Translation Embedding Network for Visual Relation Detection. In Proceedings of CVPR. 5532--5540.

[47]

Ji Zhang, Yannis Kalantidis, Marcus Rohrbach, Manohar Paluri, and Mohamed Elhoseiny. 2019 a. Large-scale visual relationship understanding. In Proceedings of AAAI. 9185--9194.

[48]

Ji Zhang, Kevin J Shih, Ahmed Elgammal, Andrew Tao, and Bryan Catanzaro. 2019 b. Graphical Contrastive Losses for Scene Graph Parsing. In Proceedings of CVPR. 11535--11543.

[49]

Xiao Zhang, Zhiyuan Fang, Yandong Wen, Zhifeng Li, and Yu Qiao. 2017a. Range Loss for Deep Face Recognition with Long-Tailed Training Data. (2017), 5419--5428.

[50]

Handong Zhao, Quanfu Fan, Dan Gutfreund, and Yun Fu. 2018. Semantically Guided Visual Question Answering. In Proceedings of WACV. 1852--1860.

[51]

Zhihua Zhou. 2006. The Influence of Class Imbalance on Cost-Sensitive Learning: An Empirical Study. In Proceedings of ICDM. 970--974.

Cited By

Li HZhu GZhang LJiang YDang YHou HShen PZhao XShah SBennamoun M(2024)Scene Graph Generation: A comprehensive surveyNeurocomputing10.1016/j.neucom.2023.127052566(127052)Online publication date: Jan-2024
https://doi.org/10.1016/j.neucom.2023.127052
Nie WJiao CChang RQu LLiu A(2023)CPG3D: Cross-Modal Priors Guided 3D Object ReconstructionIEEE Transactions on Multimedia10.1109/TMM.2023.325169725(9383-9396)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2023.3251697
Ma YJi JSun XZhou YWu YHuang FJi R(2023)Knowing What it is: Semantic-Enhanced Dual Attention TransformerIEEE Transactions on Multimedia10.1109/TMM.2022.316478725(3723-3736)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2022.3164787
Show More Cited By

Index Terms

Memory-Based Network for Scene Graph with Unbalanced Relations
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision representations
        Image representations
      2. Computer vision tasks
        Scene understanding

Recommendations

Boosting Scene Graph Generation with Visual Relation Saliency
The scene graph is a symbolic data structure that comprehensively describes the objects and visual relations in a visual scene, while ignoring the inherent perceptual saliency of each visual relation (i.e., relation saliency). However, humans often ...
Part-Aware Interactive Learning for Scene Graph Generation
MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Generating scene graph to describe the whereabouts and interactions of objects in an image has attracted increasing attention of researchers. Most existing methods explore object-level visual context or bodypart-object cooperation with the message ...
Improve Image Captioning by Modeling Dynamic Scene Graph Extension
ICMR '22: Proceedings of the 2022 International Conference on Multimedia Retrieval

Recently, scene graph generation methods have been used in image captioning to encode the objects and their relationships in the encoder-decoder framework, where the decoder selects part of the graph nodes as input for word inference. However, current ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

October 2020

4889 pages

ISBN:9781450379885

DOI:10.1145/3394171

General Chairs:
Chang Wen Chen
Chinese University of Hong Kong, Shenzhen, China
,
Rita Cucchiara
UNIMORE, Italy
,
Xian-Sheng Hua
Alibaba Group, China
,
Program Chairs:
Guo-Jun Qi
Futurewei Technologies, USA
,
Elisa Ricci
UNITN & Fondazione Bruno Kessler, Italy
,
Zhengyou Zhang
Tencent, China
,
Roger Zimmermann
National University of Singapore, Singapore

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '20

Sponsor:

SIGMM

MM '20: The 28th ACM International Conference on Multimedia

October 12 - 16, 2020

WA, Seattle, USA

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
406
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)2

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Li HZhu GZhang LJiang YDang YHou HShen PZhao XShah SBennamoun M(2024)Scene Graph Generation: A comprehensive surveyNeurocomputing10.1016/j.neucom.2023.127052566(127052)Online publication date: Jan-2024
https://doi.org/10.1016/j.neucom.2023.127052
Nie WJiao CChang RQu LLiu A(2023)CPG3D: Cross-Modal Priors Guided 3D Object ReconstructionIEEE Transactions on Multimedia10.1109/TMM.2023.325169725(9383-9396)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2023.3251697
Ma YJi JSun XZhou YWu YHuang FJi R(2023)Knowing What it is: Semantic-Enhanced Dual Attention TransformerIEEE Transactions on Multimedia10.1109/TMM.2022.316478725(3723-3736)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1109/TMM.2022.3164787
Chen LLu JWang CHe G(2023)Scene Graph Generation using Depth-based Multimodal Network2023 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME55011.2023.00199(1139-1144)Online publication date: Jul-2023
https://doi.org/10.1109/ICME55011.2023.00199
Zheng XChen FLou LCheng PHuang Y(2022)Real-Time Detection of Full-Scale Forest Fire Smoke Based on Deep Convolution Neural NetworkRemote Sensing10.3390/rs1403053614:3(536)Online publication date: 23-Jan-2022
https://doi.org/10.3390/rs14030536
Li XChen LMa WYang YXiao JMagalhães Jdel Bimbo ASatoh SSebe NAlameda-Pineda XJin QOria VToni L(2022)Integrating Object-aware and Interaction-aware Knowledge for Weakly Supervised Scene Graph GenerationProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3548164(4204-4213)Online publication date: 10-Oct-2022
https://dl.acm.org/doi/10.1145/3503161.3548164
Zhu XLi ZWang XJiang XSun PWang XXiao YYuan N(2022)Multi-Modal Knowledge Graph Construction and Application: A SurveyIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.3224228(1-20)Online publication date: 2022
https://doi.org/10.1109/TKDE.2022.3224228
Zhou YYu F(2022)Complete interest propagation from part for visual relation of interest detectionInternational Journal of Machine Learning and Cybernetics10.1007/s13042-022-01603-w14:2(455-465)Online publication date: 2-Aug-2022
https://doi.org/10.1007/s13042-022-01603-w

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten