skip to main content
10.1145/3394171.3413507acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Memory-Based Network for Scene Graph with Unbalanced Relations

Published: 12 October 2020 Publication History

Abstract

The scene graph which can be represented by a set of visual triples is composed of objects and the relations between object pairs. It is vital for image captioning, visual question answering, and many other applications. However, there is a long tail distribution on the scene graph dataset, and the tail relation cannot be accurately identified due to the lack of training samples. The problem of the nonstandard label and feature overlap on the scene graph affects the extraction of discriminative features and exacerbates the effect of data imbalance on the model. For these reasons, we propose a novel scene graph generation model that can effectively improve the detection of low-frequency relations. We use the method of memory features to realize the transfer of high-frequency relation features to low-frequency relation features. Extensive experiments on scene graph datasets show that our model significantly improved the performance of two evaluation metrics R@K and mR@K compared with state-of-the-art baselines.

Supplementary Material

MP4 File (3394171.3413507.mp4)
The scene graph which can be represented by a set of visual triples is composed of objects and the relations between object pairs. It is vital for image captioning, visual question answering, and many other applications. However, there is a long tail distribution on the scene graph dataset, and the tail relation cannot be accurately identified due to the lack of training samples. The problem of the nonstandard label and feature overlap on the scene graph affects the extraction of discriminative features and exacerbates the effect of data imbalance on the model. For these reasons, we propose a novel scene graph generation model that can effectively improve the detection of low-frequency relations. We use the method of memory features to realize the transfer of high-frequency relation features to low-frequency relation features. Extensive experiments on scene graph datasets show that our model significantly improved the performance of two evaluation metrics R@K andmR@K compared with state-of-the-art baselines.

References

[1]
Gustavo E A P A Batista, Ronaldo C Prati, and Maria Carolina Monard. 2004. A study of the behavior of several methods for balancing machine learning training data. Sigkdd Explorations, Vol. 6, 1 (2004), 20--29.
[2]
Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, Vol. 16, 1 (2002), 321--357.
[3]
Tianshui Chen, Weihao Yu, Riquan Chen, and Liang Lin. 2019. Knowledge-Embedded Routing Network for Scene Graph Generation. In Proceedings of CVPR. 6163--6171.
[4]
B. Dai, Y. Zhang, and D. Lin. 2017. Detecting Visual Relationships with Deep Relational Networks. In Proceedings of CVPR. 3298--3308.
[5]
C. Elkan. 2001. The Foundation of Cost-sensitive Learning. Proc.seventeenth Intl.joint Conf.on Artificial Intelligence (2001).
[6]
Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C Platt, et almbox. 2015. From captions to visual concepts and back. In Proceedings of CVPR. 1473--1482.
[7]
Spyros Gidaris and Nikos Komodakis. 2018. Dynamic few-shot visual learning without forgetting. In Proceedings of the CVPR. 4367--4375.
[8]
Jiuxiang Gu, Jianfei Cai, Shafiq Joty, Li Niu, and Gang Wang. 2018a. Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models. In Proceedings of CVPR. 7181--7189.
[9]
Jiuxiang Gu, Jianfei Cai, Gang Wang, and Tsuhan Chen. 2018b. Stack-Captioning: Coarse-to-Fine Learning for Image Captioning. In Proceedings of AAAI. 6837--6844.
[10]
Jiuxiang Gu, Shafiq Joty, Jianfei Cai, and Gang Wang. 2018c. Unpaired Image Captioning by Language Pivoting. In Proceedings of ECCV. 519--535.
[11]
Jiuxiang Gu, Gang Wang, Jianfei Cai, and Tsuhan Chen. 2017. An Empirical Study of Language CNN for Image Captioning. In Proceedings of ICCV. 1231--1240.
[12]
Hui Han, Wenyuan Wang, and Binghuan Mao. 2005. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In Proceedings of ICIC. 878--887.
[13]
Haibo He, Yang Bai, E A Garcia, and Shutao Li. 2008. ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. In Proceedings of IJCNN. 1322--1328.
[14]
Sepp Hochreiter and Jurgen Schmidhuber. 1997. Long short-term memory. Neural Computation, Vol. 9, 8 (1997), 1735--1780.
[15]
Yenchang Hsu, Zhaoyang Lv, and Zsolt Kira. 2018. Learning to cluster in order to transfer across domains and tasks. In Proceedings of ICLR.
[16]
Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David Shamma, Michael Bernstein, and Li Fei-Fei. 2015. Image retrieval using scene graphs. In Proceedings of CVPR. 3668--3678.
[17]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li Jia Li, and David A. Shamma. [n.d.]. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. International Journal of Computer Vision, Vol. 123, 1 ([n.,d.]), 32--73.
[18]
Tsungyi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. 2017. Focal Loss for Dense Object Detection. In Proceedings of ICCV. 2999--3007.
[19]
Ziwei Liu, Zhongqi Miao, Xiaohang Zhan, Jiayun Wang, Boqing Gong, and Stella X Yu. 2019. Large-Scale Long-Tailed Recognition in an Open World. In Proceedings of CVPR. 2537--2546.
[20]
Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. 2016. Visual relationship detection with language priors. In Proceedings of ECCV. 852--869.
[21]
I Mani. 2003. KNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction. In Icml Workshop on Learning from Imbalanced Datasets.
[22]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the NIPS. 3111--3119.
[23]
Alejandro Newell and Jia Deng. 2017. Pixels to Graphs by Associative Embedding. In Proceedings of NIPS. 2171--2180.
[24]
Hang Qi, Matthew Brown, and David G Lowe. 2018. Low-Shot Learning with Imprinted Weights. In Proceedings of the CVPR. 5822--5830.
[25]
Shaoqing Ren, Kaiming He, Ross Girshick, and Sun Jian. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 39, 6 (2015).
[26]
Jiaxin Shi, Hanwang Zhang, and Juanzi Li. 2019. Explainable and explicit visual reasoning over scene graphs. In Proceedings of the CVPR. 8376--8384.
[27]
Benoit Steiner, Zachary Devito, Soumith Chintala, Sam Gross, Adam Paszke, Francisco Massa, Adam Lerer, Gregory Chanan, Zeming Lin, Edward Yang, et almbox. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of NIPS. 8026--8037.
[28]
Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, and Hanwang Zhang. 2020. Unbiased Scene Graph Generation from Biased Training. In Proceedings of CVPR.
[29]
Kaihua Tang, Hanwang Zhang, Baoyuan Wu, Wenhan Luo, and Wei Liu. 2019. Learning to Compose Dynamic Tree Structures for Visual Contexts. In Proceedings of CVPR. 6619--6628.
[30]
Ivan Tomek. 1976. Two modifications of CNN. IEEE Transactions on Systems Man and Cybernetics, Vol. 6, 11 (1976), 769--772.
[31]
Meng Wang, Weitong Chen, Sen Wang, Jun Liu, Xue Li, and Bela Stantic. 2018a. Answering why-not questions on semantic multimedia queries. Multimedia Tools and Applications, Vol. 77, 3 (2018), 3405--3429.
[32]
Meng Wang, Guilin Qi, HaoFen Wang, and Qiushuo Zheng. 2020 a. Richpedia: A Comprehensive Multi-modal Knowledge Graph. In Proceedings of Joint International Semantic Technology Conference. 130--145.
[33]
Peng Wang, Qi Wu, Chunhua Shen, Anthony Dick, and Anton Van Den Hengel. 2018b. FVQA: Fact-Based Visual Question Answering. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 40, 10 (2018), 2413--2427.
[34]
Weitao Wang, Meng Wang, Sen Wang, Guodong Long, Lina Yao, Guilin Qi, and Yang Chen. 2020 b. One-Shot Learning for Long-Tail Visual Relation Detection. In Proceedings of the AAAI. 12225--12232.
[35]
Wenbin Wang, Ruiping Wang, Shiguang Shan, and Xilin Chen. 2019. Exploring Context and Visual Pattern of Relationship for Scene Graph Generation. In Proceedings of CVPR. 8188--8197.
[36]
Dennis L Wilson. 1972. Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. IEEE Transactions on Systems Man and Cybernetics, Vol. 2, 3 (1972), 408--421.
[37]
Qi Wu, Chunhua Shen, Peng Wang, Anthony Dick, and Anton Van Den Hengel. 2018. Image Captioning and Visual Question Answering Based on Attributes and External Knowledge. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 40, 6 (2018), 1367--1381.
[38]
Caiming Xiong, Stephen Merity, and Richard Socher. 2016. Dynamic memory networks for visual and textual question answering. In Proceedings of ICML. 2397--2406.
[39]
Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. 2017. Scene graph generation by iterative message passing. In Proceedings of CVPR. 5410--5419.
[40]
Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. 2018. Graph R-CNN for Scene Graph Generation. In Proceedings of ECCV. 690--706.
[41]
Xu Yang, Kaihua Tang, Hanwang Zhang, and Jianfei Cai. 2019. Auto-Encoding Scene Graphs for Image Captioning. In Proceedings of CVPR. 10685--10694.
[42]
Guojun Yin, Lu Sheng, Bin Liu, Nenghai Yu, Xiaogang Wang, Jing Shao, and Chen Change Loy. 2018. Zoom-Net: Mining Deep Feature Interactions for Visual Relationship Recognition. In Proceedings of ECCV. 330--347.
[43]
Ruichi Yu, Ang Li, Vlad I Morariu, and Larry S Davis. 2017. Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation. In Proceedings of ICCV. 1068--1076.
[44]
Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural Motifs: Scene Graph Parsing with Global Context. In Proceedings of CVPR. 5831--5840.
[45]
Yibing Zhan, Jun Yu, Ting Yu, and Dacheng Tao. 2019. On Exploring Undetermined Relationships for Visual Relationship Detection. In Proceedings of CVPR. 5128--5137.
[46]
Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, and Tat-Seng Chua. 2017b. Visual Translation Embedding Network for Visual Relation Detection. In Proceedings of CVPR. 5532--5540.
[47]
Ji Zhang, Yannis Kalantidis, Marcus Rohrbach, Manohar Paluri, and Mohamed Elhoseiny. 2019 a. Large-scale visual relationship understanding. In Proceedings of AAAI. 9185--9194.
[48]
Ji Zhang, Kevin J Shih, Ahmed Elgammal, Andrew Tao, and Bryan Catanzaro. 2019 b. Graphical Contrastive Losses for Scene Graph Parsing. In Proceedings of CVPR. 11535--11543.
[49]
Xiao Zhang, Zhiyuan Fang, Yandong Wen, Zhifeng Li, and Yu Qiao. 2017a. Range Loss for Deep Face Recognition with Long-Tailed Training Data. (2017), 5419--5428.
[50]
Handong Zhao, Quanfu Fan, Dan Gutfreund, and Yun Fu. 2018. Semantically Guided Visual Question Answering. In Proceedings of WACV. 1852--1860.
[51]
Zhihua Zhou. 2006. The Influence of Class Imbalance on Cost-Sensitive Learning: An Empirical Study. In Proceedings of ICDM. 970--974.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '20: Proceedings of the 28th ACM International Conference on Multimedia
October 2020
4889 pages
ISBN:9781450379885
DOI:10.1145/3394171
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. memory feature
  2. neural networks
  3. scene graph

Qualifiers

  • Research-article

Conference

MM '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)19
  • Downloads (Last 6 weeks)2
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Scene Graph Generation: A comprehensive surveyNeurocomputing10.1016/j.neucom.2023.127052566(127052)Online publication date: Jan-2024
  • (2023)CPG3D: Cross-Modal Priors Guided 3D Object ReconstructionIEEE Transactions on Multimedia10.1109/TMM.2023.325169725(9383-9396)Online publication date: 1-Jan-2023
  • (2023)Knowing What it is: Semantic-Enhanced Dual Attention TransformerIEEE Transactions on Multimedia10.1109/TMM.2022.316478725(3723-3736)Online publication date: 1-Jan-2023
  • (2023)Scene Graph Generation using Depth-based Multimodal Network2023 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME55011.2023.00199(1139-1144)Online publication date: Jul-2023
  • (2022)Real-Time Detection of Full-Scale Forest Fire Smoke Based on Deep Convolution Neural NetworkRemote Sensing10.3390/rs1403053614:3(536)Online publication date: 23-Jan-2022
  • (2022)Integrating Object-aware and Interaction-aware Knowledge for Weakly Supervised Scene Graph GenerationProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3548164(4204-4213)Online publication date: 10-Oct-2022
  • (2022)Multi-Modal Knowledge Graph Construction and Application: A SurveyIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.3224228(1-20)Online publication date: 2022
  • (2022)Complete interest propagation from part for visual relation of interest detectionInternational Journal of Machine Learning and Cybernetics10.1007/s13042-022-01603-w14:2(455-465)Online publication date: 2-Aug-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media