skip to main content
10.1145/3343031.3350962acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Visual Relation Detection with Multi-Level Attention

Published: 15 October 2019 Publication History

Abstract

Visual relations, which describe various types of interactions between two objects in the image, can provide critical information for comprehensive semantic understanding of the image. Multiple cues related to the objects can contribute to visual relation detection, which mainly include appearances, spacial locations and semantic meanings. It is of great importance to represent different cues and combine them effectively for visual relation detection. However, in previous works, the appearance representation is simply realized by global visual representation based on the bounding boxes of objects, which may not capture salient regions of the interaction between two objects, and the different cue representations are equally concatenated without considering their different contributions for different relations. In this work, we propose a multi-level attention visual relation detection model (MLA-VRD), which generates salient appearance representation via a multi-stage appearance attention strategy and adaptively combine different cues with different importance weighting via a multi-cue attention strategy. Extensive experiment results on two widely used visual relation detection datasets, VRD and Visual Genome, demonstrate the effectiveness of our proposed model which significantly outperforms the previous state-of-the-arts. Our proposed model also achieves superior performance under the zero-shot learning condition, which is an important ordeal for testing the generalization ability of visual relation detection models.

References

[1]
Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016. Neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 39--48.
[2]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision. 2425--2433.
[3]
Yuval Atzmon, Jonathan Berant, Vahid Kezami, Amir Globerson, and Gal Chechik. 2016. Learning to generalize to new compositions in image understanding. arXiv preprint arXiv:1608.07639 (2016).
[4]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).
[5]
Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multi-relational data. In Advances in neural information processing systems. 2787--2795.
[6]
Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat- Seng Chua. 2017. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In roceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5659--5667.
[7]
Shizhe Chen, Jia Chen, Qin Jin, and Alexander Hauptmann. 2018. Class-aware Self-Attention for Audio Event Recognition. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval. ACM, 28--36.
[8]
Bo Dai, Yuqi Zhang, and Dahua Lin. 2017. Detecting visual relationships with deep relational networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3076--3086.
[9]
Santosh K Divvala, Ali Farhadi, and Carlos Guestrin. 2014. Learning everything about anything: Webly-supervised visual concept learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3270--3277.
[10]
Hao-Shu Fang, Jinkun Cao, Yu-Wing Tai, and Cewu Lu. 2018. Pairwise bodypart attention for recognizing human-object interactions. In Proceedings of the European Conference on Computer Vision (ECCV). 51--67.
[11]
Nagi Gebraeel, Mark Lawley, Richard Liu, and Vijay Parmeshwaran. 2004. Residual life predictions from vibration-based degradation signals: a neural network approach. IEEE Transactions on industrial electronics 51, 3 (2004), 694--700.
[12]
Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision. 1440--1448.
[13]
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 580--587.
[14]
Georgia Gkioxari, Ross Girshick, Piotr Dollár, and Kaiming He. 2018. Detecting and recognizing human-object interactions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8359--8367.
[15]
Abhinav Gupta and Larry S Davis. 2008. Beyond nouns: Exploiting prepositions and comparative adjectives for learning visual classifiers. In European conference on computer vision. Springer, 16--29.
[16]
Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7132--7141.
[17]
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3128--3137.
[18]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123, 1 (2017), 32--73.
[19]
Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C Berg, and Tamara L Berg. 2013. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 12 (2013), 2891--2903.
[20]
Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani, Victor Zhong, Romain Paulus, and Richard Socher. 2016. Ask me anything: Dynamic memory networks for natural language processing. In International Conference on Machine Learning. 1378--1387.
[21]
M Pawan Kumar and Daphne Koller. 2010. Efficiently selecting regions for scene understanding. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 3217--3224.
[22]
Jiwei Li, Minh-Thang Luong, and Dan Jurafsky. 2015. A hierarchical neural autoencoder for paragraphs and documents. arXiv preprint arXiv:1506.01057 (2015).
[23]
Yikang Li, Wanli Ouyang, Xiaogang Wang, and Xiao'ou Tang. 2017. Vip-cnn: Visual phrase guided convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1347--1356.
[24]
Kongming Liang, Yuhong Guo, Hong Chang, and Xilin Chen. 2018. Visual relationship detection with deep structural ranking. In Thirty-Second AAAI Conference on Artificial Intelligence.
[25]
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. 2017. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision. 2980--2988.
[26]
Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. 2016. Visual relationship detection with language priors. In European Conference on Computer Vision. Springer, 852--869.
[27]
Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015).
[28]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111--3119.
[29]
Seyedmahdad Mirsamadi, Emad Barsoum, and Cha Zhang. 2017. Automatic speech emotion recognition using recurrent neural networks with local attention. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2227--2231.
[30]
Vignesh Ramanathan, Congcong Li, Jia Deng, Wei Han, Zhen Li, Kunlong Gu, Yang Song, Samy Bengio, Charles Rosenberg, and Li Fei-Fei. 2015. Learning semantic relationships for better action retrieval in images. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1100--1109.
[31]
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 779--788.
[32]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91--99.
[33]
Mohammad Amin Sadeghi and Ali Farhadi. 2011. Recognition using visual phrases. In CVPR 2011. IEEE, 1745--1752.
[34]
Karen Simonyan and AndrewZisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[35]
Caiming Xiong, Stephen Merity, and Richard Socher. 2016. Dynamic memory networks for visual and textual question answering. In International conference on machine learning. 2397--2406.
[36]
Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. 2017. Scene graph generation by iterative message passing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5410--5419.
[37]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning. 2048--2057.
[38]
Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing videos by exploiting temporal structure. In Proceedings of the IEEE international conference on computer vision. 4507--4515.
[39]
Guojun Yin, Lu Sheng, Bin Liu, Nenghai Yu, Xiaogang Wang, Jing Shao, and Chen Change Loy. 2018. Zoom-net: Mining deep feature interactions for visual relationship recognition. In Proceedings of the European Conference on Computer Vision (ECCV). 322--338.
[40]
Ruichi Yu, Ang Li, Vlad I Morariu, and Larry S Davis. 2017. Visual relationship detection with internal and external linguistic knowledge distillation. In Proceedings of the IEEE International Conference on Computer Vision. 1974--1982.
[41]
Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5831--5840.
[42]
Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, and Tat-Seng Chua. 2017. Visual translation embedding network for visual relation detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5532--5540.
[43]
Hanwang Zhang, Zawlin Kyaw, Jinyang Yu, and Shih-Fu Chang. 2017. PPRFCN: weakly supervised visual relation detection via parallel pairwise R-FCN. In Proceedings of the IEEE International Conference on Computer Vision. 4233--4241.
[44]
Ji Zhang, Yannis Kalantidis, Marcus Rohrbach, Manohar Paluri, Ahmed Elgammal, and Mohamed Elhoseiny. 2018. Large-scale visual relationship understanding. arXiv preprint arXiv:1804.10660 (2018).
[45]
Pan Zhou, Wenwen Yang, Wei Chen, Yanfeng Wang, and Jia Jia. 2018. Modality Attention for End-to-End Audio-visual Speech Recognition. arXiv preprint arXiv:1811.05250 (2018).
[46]
Bohan Zhuang, Lingqiao Liu, Chunhua Shen, and Ian Reid. 2017. Towards contextaware interaction recognition for visual relationship detection. In Proceedings of the IEEE International Conference on Computer Vision. 589--598.

Cited By

View all
  • (2024)UniQ: Unified Decoder with Task-specific Queries for Efficient Scene Graph GenerationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681542(8815-8824)Online publication date: 28-Oct-2024
  • (2024)Knowledge-Embedded Mutual Guidance for Visual ReasoningIEEE Transactions on Cybernetics10.1109/TCYB.2023.331089254:4(2579-2591)Online publication date: Apr-2024
  • (2024)Scene Graph Generation: A comprehensive surveyNeurocomputing10.1016/j.neucom.2023.127052566(127052)Online publication date: Jan-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '19: Proceedings of the 27th ACM International Conference on Multimedia
October 2019
2794 pages
ISBN:9781450368896
DOI:10.1145/3343031
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. attention
  2. feature representation
  3. visual relation detection

Qualifiers

  • Research-article

Funding Sources

  • National Natural Science Foundation of China award number(s):
  • Beijing Natural Science Foundation award number(s):
  • National Key Research and Development Plan award number(s)

Conference

MM '19
Sponsor:

Acceptance Rates

MM '19 Paper Acceptance Rate 252 of 936 submissions, 27%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)37
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)UniQ: Unified Decoder with Task-specific Queries for Efficient Scene Graph GenerationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681542(8815-8824)Online publication date: 28-Oct-2024
  • (2024)Knowledge-Embedded Mutual Guidance for Visual ReasoningIEEE Transactions on Cybernetics10.1109/TCYB.2023.331089254:4(2579-2591)Online publication date: Apr-2024
  • (2024)Scene Graph Generation: A comprehensive surveyNeurocomputing10.1016/j.neucom.2023.127052566(127052)Online publication date: Jan-2024
  • (2024)The All-Seeing Project V2: Towards General Relation Comprehension of the Open WorldComputer Vision – ECCV 202410.1007/978-3-031-73414-4_27(471-490)Online publication date: 25-Oct-2024
  • (2023)HiLo: Exploiting High Low Frequency Relations for Unbiased Panoptic Scene Graph Generation2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.01978(21580-21591)Online publication date: 1-Oct-2023
  • (2023)Image captioning based on scene graphs: A surveyExpert Systems with Applications10.1016/j.eswa.2023.120698231(120698)Online publication date: Nov-2023
  • (2023)Bel: Batch Equalization Loss for scene graph generationPattern Analysis and Applications10.1007/s10044-023-01199-z26:4(1821-1831)Online publication date: 10-Oct-2023
  • (2023)Ontological Scene Graph Engineering and Reasoning Over YOLO Objects for Creating Panoramic VR ContentMulti-disciplinary Trends in Artificial Intelligence10.1007/978-3-031-36402-0_20(225-235)Online publication date: 24-Jun-2023
  • (2022)Boosting Relationship Detection in Images with Multi-Granular Self-Supervised LearningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/355697819:2s(1-18)Online publication date: 18-Aug-2022
  • (2022)Attention Guided Relation Detection Approach for Video Visual Relation DetectionIEEE Transactions on Multimedia10.1109/TMM.2021.310943024(3896-3907)Online publication date: 2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media