research-article

Relationship graph learning network for visual relationship detection

Authors:

Zhi ChenAuthors Info & Claims

MMAsia '20: Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Article No.: 59, Pages 1 - 7

https://doi.org/10.1145/3444685.3446312

Published: 03 May 2021 Publication History

Abstract

Visual relationship detection aims to predict the relationships between detected object pairs. It is well believed that the correlations between image components (i.e., objects and relationships between objects) are significant considerations when predicting objects' relationships. However, most current visual relationship detection methods only exploited the correlations among objects, and the correlations among objects' relationships remained underexplored. This paper proposes a relationship graph learning network (RGLN) to explore the correlations among objects' relationships for visual relationship detection. Specifically, RGLN obtains image objects using an object detector, and then, every pair of objects constitutes a relationship proposal. All relationship proposals construct a relationship graph, in which the proposals are treated as nodes. Accordingly, RGLN designs bi-stream graph attention subnetworks to detect relationship proposals, in which one graph attention subnetwork analyzes correlations among relationships based on visual and spatial information, and the other analyzes correlations based on semantic and spatial information. Besides, RGLN exploits a relationship selection subnetwork to ignore redundant information of object pairs with no relationships. We conduct extensive experiments on two public datasets: the VRD and the VG datasets. The experimental results compared with the state-of-the-art demonstrate the competitiveness of RGLN.

References

[1]

Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. 2018. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261 (2018).

[2]

Long Chen, Hanwang Zhang, Jun Xiao, Xiangnan He, Shiliang Pu, and Shih-Fu Chang. 2019. Counterfactual critic multi-agent training for scene graph generation. In Proceedings of the IEEE International Conference on Computer Vision. 4613--4623.

[3]

Tianshui Chen, Weihao Yu, Riquan Chen, and Liang Lin. 2019. Knowledge-embedded routing network for scene graph generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6163--6171.

[4]

Zhe Chen, Jing Zhang, and Dacheng Tao. 2020. Recursive context routing for object detection. International Journal of Computer Vision (2020), 1--19.

[5]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248--255.

[6]

Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C Platt, et al. 2015. From captions to visual concepts and back. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1473--1482.

[7]

Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. 2018. Relation networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3588--3597.

[8]

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision 123, 1 (2017), 32--73.

[9]

Yikang Li, Wanli Ouyang, Xiaogang Wang, and Xiao'ou Tang. 2017. Vip-cnn: Visual phrase guided convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1347--1356.

[10]

Kongming Liang, Yuhong Guo, Hong Chang, and Xilin Chen. 2018. Visual relationship detection with deep structural ranking. In Thirty-Second AAAI Conference on Artificial Intelligence.

[11]

Xiaodan Liang, Lisa Lee, and Eric P Xing. 2017. Deep variation-structured reinforcement learning for visual relationship and attribute detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 848--857.

[12]

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision. 2980--2988.

[13]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740--755.

[14]

Xin Lin, Changxing Ding, Jinquan Zeng, and Dacheng Tao. 2020. GPS-Net: Graph Property Sensing Network for Scene Graph Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3746--3753.

[15]

Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. 2016. Visual relationship detection with language priors. In European conference on computer vision. Springer, 852--869.

[16]

Benteng Ma, Jing Zhang, Yong Xia, and Dacheng Tao. 2020. Auto Learning Attention. Advances in Neural Information Processing Systems 33 (2020).

[17]

Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532--1543.

[18]

Tingting Qiao, Jing Zhang, Duanqing Xu, and Dacheng Tao. 2019. Mirrorgan: Learning text-to-image generation by redescription. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1505--1514.

[19]

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 779--788.

[20]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91--99.

[21]

Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, and Hanwang Zhang. 2020. Unbiased scene graph generation from biased training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3716--3725.

[22]

Kaihua Tang, Hanwang Zhang, Baoyuan Wu, Wenhan Luo, and Wei Liu. 2019. Learning to compose dynamic tree structures for visual contexts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6619--6628.

[23]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.

[24]

Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903 (2017).

[25]

Wenbin Wang, Ruiping Wang, Shiguang Shan, and Xilin Chen. 2019. Exploring context and visual pattern of relationship for scene graph generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8188--8197.

[26]

Sanghyun Woo, Dahun Kim, Donghyeon Cho, and In SoKweon. 2018. Linknet: Relational embedding for scene graph. In Advances in Neural Information Processing Systems. 560--570.

[27]

Jian Wu, Jianbo Jiao, Qingxiong Yang, Zheng-Jun Zha, and Xuejin Chen. 2019. Ground-Aware Point Cloud Semantic Segmentation for Autonomous Driving. In Proceedings of the 27th ACM International Conference on Multimedia. 971--979.

Digital Library

[28]

Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. 2017. Scene graph generation by iterative message passing. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5410--5419.

[29]

Binxin Yang, Xuejin Chen, Richang Hong, Zihan Chen, Yuhang Li, and Zheng-Jun Zha. 2020. Joint Sketch-Attribute Learning for Fine-Grained Face Synthesis. In International Conference on Multimedia Modeling. Springer, 790--801.

[30]

Jianyu Yang, Wu Liu, Junsong Yuan, and Tao Mei. 2020. Hierarchical Soft Quantization for Skeleton-Based Human Action Recognition. IEEE Transactions on Multimedia (2020).

[31]

Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. 2018. Graph r-cnn for scene graph generation. In Proceedings of the European conference on computer vision (ECCV). 670--685.

Digital Library

[32]

Guojun Yin, Lu Sheng, Bin Liu, Nenghai Yu, Xiaogang Wang, Jing Shao, and Chen Change Loy. 2018. Zoom-net: Mining deep feature interactions for visual relationship recognition. In Proceedings of the European Conference on Computer Vision (ECCV). 322--338.

[33]

Jun Yu, Yong Rui, and Bo Chen. 2013. Exploiting click constraints and multi-view features for image re-ranking. IEEE Transactions on Multimedia 16, 1 (2013), 159--168.

[34]

Ruichi Yu, Ang Li, Vlad I Morariu, and Larry S Davis. 2017. Visual relationship detection with internal and external linguistic knowledge distillation. In Proceedings of the IEEE international conference on computer vision. 1974--1982.

[35]

Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5831--5840.

[36]

Yibing Zhan, Jun Yu, Ting Yu, and Dacheng Tao. 2019. On exploring undetermined relationships for visual relationship detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5128--5137.

[37]

Yibing Zhan, Jun Yu, Ting Yu, and Dacheng Tao. 2020. Multi-task Compositional Network for Visual Relationship Detection. International Journal of Computer Vision 128, 8 (2020), 2146--2165.

Digital Library

[38]

Yibing Zhan, Jun Yu, Zhou Yu, Rong Zhang, Dacheng Tao, and Qi Tian. 2018. Comprehensive distance-preserving autoencoders for cross-modal retrieval. In Proceedings of the 26th ACM international conference on Multimedia. 1137--1145.

Digital Library

[39]

Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, and Tat-Seng Chua. 2017. Visual translation embedding network for visual relation detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5532--5540.

[40]

Ji Zhang, Yannis Kalantidis, Marcus Rohrbach, Manohar Paluri, Ahmed Elgammal, and Mohamed Elhoseiny. 2019. Large-scale visual relationship understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9185--9194.

Digital Library

[41]

Ji Zhang, Kevin J Shih, Ahmed Elgammal, Andrew Tao, and Bryan Catanzaro. 2019. Graphical contrastive losses for scene graph parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11535--11543.

[42]

Yaohui Zhu and Shuqiang Jiang. 2018. Deep Structured Learning for Visual Relationship Detection. In AAAI. 7623--7630.

Cited By

Rong HQian MMa TJin DSheng V(2024)CoBjeason: Reasoning Covered Object in Image by Multi-Agent Collaboration Based on Informed Knowledge GraphACM Transactions on Knowledge Discovery from Data10.1145/364356518:5(1-56)Online publication date: 28-Feb-2024
https://dl.acm.org/doi/10.1145/3643565

Index Terms

Relationship graph learning network for visual relationship detection
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Scene understanding

Recommendations

Boosting Relationship Detection in Images with Multi-Granular Self-Supervised Learning
Visual and spatial relationship detection in images has been a fast-developing research topic in the multimedia field, which learns to recognize the semantic/spatial interactions between objects in an image, aiming to compose a structured semantic ...
Hierarchical Visual Relationship Detection
MM '19: Proceedings of the 27th ACM International Conference on Multimedia

Acting as a bridge between vision and language, visual relationship detection (VRD) aims to represent objects and their interactions in an image with several relationship triplets. Nevertheless, the conventional VRD task shows little consideration for ...
Multi-task Compositional Network for Visual Relationship Detection
Abstract
Previous methods treat visual relationship detection as a combination of object detection and predicate detection. However, natural images likely contain hundreds of objects and thousands of object pairs. Relying only on object detection and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MMAsia '20: Proceedings of the 2nd ACM International Conference on Multimedia in Asia

March 2021

512 pages

ISBN:9781450383080

DOI:10.1145/3444685

General Chairs:
Tat-Seng Chua
National University of Singapore
,
Jingdong Wang
Microsoft Research
,
Qi Tian
Huawei Noah's Ark
,
Program Chairs:
Cathal Gurrin
Dublin City University
,
Jia Jia
Tsinghua University
,
Hanwang Zhang
Nanyang Technological University
,
Qianru Sun
Singapore Management University

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 May 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Nature Science Foundation of China
National Key R&D Program of China

Conference

MMAsia '20

Sponsor:

SIGMM

MMAsia '20: ACM Multimedia Asia

March 7, 2021

Virtual Event, Singapore

Acceptance Rates

Overall Acceptance Rate 59 of 204 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
125
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)1

Reflects downloads up to 17 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Rong HQian MMa TJin DSheng V(2024)CoBjeason: Reasoning Covered Object in Image by Multi-Agent Collaboration Based on Informed Knowledge GraphACM Transactions on Knowledge Discovery from Data10.1145/364356518:5(1-56)Online publication date: 28-Feb-2024
https://dl.acm.org/doi/10.1145/3643565

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents