research-article

Visual Relation Detection with Multi-Level Attention

Authors:

Qin JinAuthors Info & Claims

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

Pages 121 - 129

https://doi.org/10.1145/3343031.3350962

Published: 15 October 2019 Publication History

Abstract

Visual relations, which describe various types of interactions between two objects in the image, can provide critical information for comprehensive semantic understanding of the image. Multiple cues related to the objects can contribute to visual relation detection, which mainly include appearances, spacial locations and semantic meanings. It is of great importance to represent different cues and combine them effectively for visual relation detection. However, in previous works, the appearance representation is simply realized by global visual representation based on the bounding boxes of objects, which may not capture salient regions of the interaction between two objects, and the different cue representations are equally concatenated without considering their different contributions for different relations. In this work, we propose a multi-level attention visual relation detection model (MLA-VRD), which generates salient appearance representation via a multi-stage appearance attention strategy and adaptively combine different cues with different importance weighting via a multi-cue attention strategy. Extensive experiment results on two widely used visual relation detection datasets, VRD and Visual Genome, demonstrate the effectiveness of our proposed model which significantly outperforms the previous state-of-the-arts. Our proposed model also achieves superior performance under the zero-shot learning condition, which is an important ordeal for testing the generalization ability of visual relation detection models.

References

[1]

Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016. Neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 39--48.

[2]

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision. 2425--2433.

Digital Library

[3]

Yuval Atzmon, Jonathan Berant, Vahid Kezami, Amir Globerson, and Gal Chechik. 2016. Learning to generalize to new compositions in image understanding. arXiv preprint arXiv:1608.07639 (2016).

[4]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).

[5]

Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multi-relational data. In Advances in neural information processing systems. 2787--2795.

[6]

Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat- Seng Chua. 2017. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In roceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5659--5667.

[7]

Shizhe Chen, Jia Chen, Qin Jin, and Alexander Hauptmann. 2018. Class-aware Self-Attention for Audio Event Recognition. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval. ACM, 28--36.

Digital Library

[8]

Bo Dai, Yuqi Zhang, and Dahua Lin. 2017. Detecting visual relationships with deep relational networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3076--3086.

[9]

Santosh K Divvala, Ali Farhadi, and Carlos Guestrin. 2014. Learning everything about anything: Webly-supervised visual concept learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3270--3277.

Digital Library

[10]

Hao-Shu Fang, Jinkun Cao, Yu-Wing Tai, and Cewu Lu. 2018. Pairwise bodypart attention for recognizing human-object interactions. In Proceedings of the European Conference on Computer Vision (ECCV). 51--67.

[11]

Nagi Gebraeel, Mark Lawley, Richard Liu, and Vijay Parmeshwaran. 2004. Residual life predictions from vibration-based degradation signals: a neural network approach. IEEE Transactions on industrial electronics 51, 3 (2004), 694--700.

[12]

Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision. 1440--1448.

Digital Library

[13]

Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 580--587.

Digital Library

[14]

Georgia Gkioxari, Ross Girshick, Piotr Dollár, and Kaiming He. 2018. Detecting and recognizing human-object interactions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8359--8367.

[15]

Abhinav Gupta and Larry S Davis. 2008. Beyond nouns: Exploiting prepositions and comparative adjectives for learning visual classifiers. In European conference on computer vision. Springer, 16--29.

Digital Library

[16]

Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7132--7141.

[17]

Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3128--3137.

[18]

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123, 1 (2017), 32--73.

Digital Library

[19]

Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C Berg, and Tamara L Berg. 2013. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 12 (2013), 2891--2903.

Digital Library

[20]

Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani, Victor Zhong, Romain Paulus, and Richard Socher. 2016. Ask me anything: Dynamic memory networks for natural language processing. In International Conference on Machine Learning. 1378--1387.

Digital Library

[21]

M Pawan Kumar and Daphne Koller. 2010. Efficiently selecting regions for scene understanding. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 3217--3224.

[22]

Jiwei Li, Minh-Thang Luong, and Dan Jurafsky. 2015. A hierarchical neural autoencoder for paragraphs and documents. arXiv preprint arXiv:1506.01057 (2015).

[23]

Yikang Li, Wanli Ouyang, Xiaogang Wang, and Xiao'ou Tang. 2017. Vip-cnn: Visual phrase guided convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1347--1356.

[24]

Kongming Liang, Yuhong Guo, Hong Chang, and Xilin Chen. 2018. Visual relationship detection with deep structural ranking. In Thirty-Second AAAI Conference on Artificial Intelligence.

[25]

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. 2017. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision. 2980--2988.

[26]

Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. 2016. Visual relationship detection with language priors. In European Conference on Computer Vision. Springer, 852--869.

[27]

Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015).

[28]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111--3119.

[29]

Seyedmahdad Mirsamadi, Emad Barsoum, and Cha Zhang. 2017. Automatic speech emotion recognition using recurrent neural networks with local attention. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2227--2231.

Digital Library

[30]

Vignesh Ramanathan, Congcong Li, Jia Deng, Wei Han, Zhen Li, Kunlong Gu, Yang Song, Samy Bengio, Charles Rosenberg, and Li Fei-Fei. 2015. Learning semantic relationships for better action retrieval in images. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1100--1109.

[31]

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 779--788.

[32]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91--99.

[33]

Mohammad Amin Sadeghi and Ali Farhadi. 2011. Recognition using visual phrases. In CVPR 2011. IEEE, 1745--1752.

Digital Library

[34]

Karen Simonyan and AndrewZisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[35]

Caiming Xiong, Stephen Merity, and Richard Socher. 2016. Dynamic memory networks for visual and textual question answering. In International conference on machine learning. 2397--2406.

Digital Library

[36]

Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. 2017. Scene graph generation by iterative message passing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5410--5419.

[37]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning. 2048--2057.

Digital Library

[38]

Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing videos by exploiting temporal structure. In Proceedings of the IEEE international conference on computer vision. 4507--4515.

Digital Library

[39]

Guojun Yin, Lu Sheng, Bin Liu, Nenghai Yu, Xiaogang Wang, Jing Shao, and Chen Change Loy. 2018. Zoom-net: Mining deep feature interactions for visual relationship recognition. In Proceedings of the European Conference on Computer Vision (ECCV). 322--338.

[40]

Ruichi Yu, Ang Li, Vlad I Morariu, and Larry S Davis. 2017. Visual relationship detection with internal and external linguistic knowledge distillation. In Proceedings of the IEEE International Conference on Computer Vision. 1974--1982.

[41]

Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5831--5840.

[42]

Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, and Tat-Seng Chua. 2017. Visual translation embedding network for visual relation detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5532--5540.

[43]

Hanwang Zhang, Zawlin Kyaw, Jinyang Yu, and Shih-Fu Chang. 2017. PPRFCN: weakly supervised visual relation detection via parallel pairwise R-FCN. In Proceedings of the IEEE International Conference on Computer Vision. 4233--4241.

[44]

Ji Zhang, Yannis Kalantidis, Marcus Rohrbach, Manohar Paluri, Ahmed Elgammal, and Mohamed Elhoseiny. 2018. Large-scale visual relationship understanding. arXiv preprint arXiv:1804.10660 (2018).

[45]

Pan Zhou, Wenwen Yang, Wei Chen, Yanfeng Wang, and Jia Jia. 2018. Modality Attention for End-to-End Audio-visual Speech Recognition. arXiv preprint arXiv:1811.05250 (2018).

[46]

Bohan Zhuang, Lingqiao Liu, Chunhua Shen, and Ian Reid. 2017. Towards contextaware interaction recognition for visual relationship detection. In Proceedings of the IEEE International Conference on Computer Vision. 589--598.

Cited By

Liao XWei WChen DFu YCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)UniQ: Unified Decoder with Task-specific Queries for Efficient Scene Graph GenerationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681542(8815-8824)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681542
Zheng WYan LChen LLi QWang F(2024)Knowledge-Embedded Mutual Guidance for Visual ReasoningIEEE Transactions on Cybernetics10.1109/TCYB.2023.331089254:4(2579-2591)Online publication date: Apr-2024
https://doi.org/10.1109/TCYB.2023.3310892
Li HZhu GZhang LJiang YDang YHou HShen PZhao XShah SBennamoun M(2024)Scene Graph Generation: A comprehensive surveyNeurocomputing10.1016/j.neucom.2023.127052566(127052)Online publication date: Jan-2024
https://doi.org/10.1016/j.neucom.2023.127052
Show More Cited By

Index Terms

Visual Relation Detection with Multi-Level Attention
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision representations
        Appearance and texture representations
      2. Computer vision tasks
        Activity recognition and understanding
        Scene understanding

Recommendations

Video Visual Relation Detection
MM '17: Proceedings of the 25th ACM international conference on Multimedia

As a bridge to connect vision and language, visual relations between objects in the form of relation triplet $łangle subject,predicate,object\rangle$, such as "person-touch-dog'' and "cat-above-sofa'', provide a more comprehensive visual content ...
Context-Dependent Diffusion Network for Visual Relationship Detection
MM '18: Proceedings of the 26th ACM international conference on Multimedia

Visual relationship detection can bridge the gap between computer vision and natural language for scene understanding of images. Different from pure object recognition tasks, the relation triplets of subject-predicate-object lie on an extreme diversity ...
Movement bias in visual attention for perceptually-guided selective rendering of animations
SCCG '07: Proceedings of the 23rd Spring Conference on Computer Graphics

The Human Visual System (HVS) is a key part of the rendering pipeline. The human eye is only capable of sensing image detail in a 2° foveal region, relying on rapid eye movements, or saccades, to jump between points of interest. These points of interest ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

October 2019

2794 pages

ISBN:9781450368896

DOI:10.1145/3343031

General Chairs:
Laurent Amsaleg
CNRS-IRISA, France
,
Benoit Huet
EURECOM, France
,
Martha Larson
Radboud University and TU Delft (Netherlands)
,
Program Chairs:
Guillaume Gravier
CNRS-IRISA, France
,
Hayley Hung
Delft University of Technology Netherlands
,
Chong-Wah Ngo
City University of Hong Kong Hong Kong
,
Wei Tsang Ooi
National University of Singapore Singapore

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China award number(s):
Beijing Natural Science Foundation award number(s):
National Key Research and Development Plan award number(s)

Conference

MM '19

Sponsor:

SIGMM

MM '19: The 27th ACM International Conference on Multimedia

October 21 - 25, 2019

Nice, France

Acceptance Rates

MM '19 Paper Acceptance Rate 252 of 936 submissions, 27%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
564
Total Downloads

Downloads (Last 12 months)37
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liao XWei WChen DFu YCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)UniQ: Unified Decoder with Task-specific Queries for Efficient Scene Graph GenerationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681542(8815-8824)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681542
Zheng WYan LChen LLi QWang F(2024)Knowledge-Embedded Mutual Guidance for Visual ReasoningIEEE Transactions on Cybernetics10.1109/TCYB.2023.331089254:4(2579-2591)Online publication date: Apr-2024
https://doi.org/10.1109/TCYB.2023.3310892
Li HZhu GZhang LJiang YDang YHou HShen PZhao XShah SBennamoun M(2024)Scene Graph Generation: A comprehensive surveyNeurocomputing10.1016/j.neucom.2023.127052566(127052)Online publication date: Jan-2024
https://doi.org/10.1016/j.neucom.2023.127052
Wang WRen YLuo HLi TYan CChen ZWang WLi QLu LZhu XQiao YDai J(2024)The All-Seeing Project V2: Towards General Relation Comprehension of the Open WorldComputer Vision – ECCV 202410.1007/978-3-031-73414-4_27(471-490)Online publication date: 25-Oct-2024
https://doi.org/10.1007/978-3-031-73414-4_27
Zhou ZShi MCaesar H(2023)HiLo: Exploiting High Low Frequency Relations for Unbiased Panoptic Scene Graph Generation2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.01978(21580-21591)Online publication date: 1-Oct-2023
https://doi.org/10.1109/ICCV51070.2023.01978
Jia JDing XPang SGao XXin XHu RNie J(2023)Image captioning based on scene graphs: A surveyExpert Systems with Applications10.1016/j.eswa.2023.120698231(120698)Online publication date: Nov-2023
https://doi.org/10.1016/j.eswa.2023.120698
Li HLiu BWu DLiu HGuo L(2023)Bel: Batch Equalization Loss for scene graph generationPattern Analysis and Applications10.1007/s10044-023-01199-z26:4(1821-1831)Online publication date: 10-Oct-2023
https://doi.org/10.1007/s10044-023-01199-z
Raj NTarun GSantosh DRaghava M(2023)Ontological Scene Graph Engineering and Reasoning Over YOLO Objects for Creating Panoramic VR ContentMulti-disciplinary Trends in Artificial Intelligence10.1007/978-3-031-36402-0_20(225-235)Online publication date: 24-Jun-2023
https://doi.org/10.1007/978-3-031-36402-0_20
Ding XPan YLi YYao TZeng DMei T(2022)Boosting Relationship Detection in Images with Multi-Granular Self-Supervised LearningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/355697819:2s(1-18)Online publication date: 18-Aug-2022
https://dl.acm.org/doi/10.1145/3556978
Cao QHuang H(2022)Attention Guided Relation Detection Approach for Video Visual Relation DetectionIEEE Transactions on Multimedia10.1109/TMM.2021.310943024(3896-3907)Online publication date: 2022
https://doi.org/10.1109/TMM.2021.3109430
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten