skip to main content
10.1145/3394171.3413566acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Visual Relation of Interest Detection

Published: 12 October 2020 Publication History

Abstract

In this paper, we propose a novel Visual Relation of Interest Detection (VROID) task, which aims to detect visual relations that are important for conveying the main content of an image, motivated from the intuition that not all correctly detected relations are really "interesting" in semantics and only a fraction of them really make sense for representing the image main content. Such relations are named Visual Relations of Interest (VROIs). VROID can be deemed as an evolution over the traditional Visual Relation Detection (VRD) task that tries to discover all visual relations in an image. We construct a new dataset to facilitate research on this new task, named ViROI, which contains 30,120 images each with VROIs annotated. Furthermore, we develop an Interest Propagation Network (IPNet) to solve VROID. IPNet contains a Panoptic Object Detection (POD) module, a Pair Interest Prediction (PaIP) module and a Predicate Interest Prediction (PrIP) module. The POD module extracts instances from the input image and also generates corresponding instance features and union features. The PaIP module then predicts the interest score of each instance pair while the PrIP module predicts that of each predicate for each instance pair. Then the interest scores of instance pairs are combined with those of the corresponding predicates as the final interest scores. All VROI candidates are sorted by final interest scores and the highest ones are taken as final results. We conduct extensive experiments to test effectiveness of our method, and the results show that IPNet achieves the best performance compared with the baselines on visual relation detection, scene graph generation and image captioning.

Supplementary Material

MP4 File (3394171.3413566.mp4)
We propose a new Visual Relation of Interest Detection task aiming to detect visual relations that are important for conveying the main content of an image, motivated from the intuition that not all correctly detected relations are really ?interesting? in semantics and only a fraction of them really make sense for representing the image main content. We construct a new dataset to facilitate research on this new task, named ViROI, which contains 30,120 images each with VROIs annotated. Furthermore, we develop an Interest Propagation Network to solve Visual Relation of Interest Detection. It contains a Panoptic Object Detection module, a Pair Interest Prediction module and a Predicate Interest Prediction module. We conduct extensive experiments to test effectiveness of our method, and the results show that our Interest Propagation Network achieves the best performance compared with the baselines on visual relation detection, scene graph generation and image captioning.

References

[1]
Jyoti Aneja, Aditya Deshpande, and Alexander G Schwing. 2018. Convolutional image captioning. In IEEE Conference on Computer Vision and Pattern Recognition. 5561--5570.
[2]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In IEEE International Conference on Computer Vision. 2425--2433.
[3]
Xinpeng Chen, Lin Ma, Wenhao Jiang, Jian Yao, and Wei Liu. 2018. Regularizing rnns for caption generation by reconstructing the past with the present. In IEEE Conference on Computer Vision and Pattern Recognition. 7995--8003.
[4]
Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. 2019. Mtextsuperscript2: Meshed-Memory Transformer for Image Captioning. arXiv preprint arXiv:1912.08226 (2019).
[5]
Bo Dai, Yuqi Zhang, and Dahua Lin. 2017. Detecting visual relationships with deep relational networks. In IEEE Conference on Computer Vision and Pattern Recognition. 3076--3086.
[6]
Jiuxiang Gu, Handong Zhao, Zhe Lin, Sheng Li, Jianfei Cai, and Mingyang Ling. 2019. Scene graph generation with external knowledge and image reconstruction. In IEEE Conference on Computer Vision and Pattern Recognition. 1969--1978.
[7]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.
[8]
Qibin Hou, Ming-Ming Cheng, Xiaowei Hu, Ali Borji, Zhuowen Tu, and Philip HS Torr. 2017. Deeply supervised salient object detection with short connections. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3203--3212.
[9]
Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. 2019. Attention on attention for image captioning. In IEEE International Conference on Computer Vision. 4634--4643.
[10]
Seong Jae Hwang, Sathya N Ravi, Zirui Tao, Hyunwoo J Kim, Maxwell D Collins, and Vikas Singh. 2018. Tensorize, factorize and regularize: Robust visual relationship learning. In IEEE Conference on Computer Vision and Pattern Recognition. 1014--1023.
[11]
Vahid Kazemi and Ali Elqursh. 2017. Show, ask, attend, and answer: A strong baseline for visual question answering. arXiv preprint arXiv:1704.03162 (2017).
[12]
Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Dollár. 2019. Panoptic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition. 9404--9413.
[13]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et almbox. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, Vol. 123, 1 (2017), 32--73.
[14]
Dong-Hyun Lee. 2013. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on Challenges in Representation Learning, ICML.
[15]
Yuanzhi Liang, Yalong Bai, Wei Zhang, Xueming Qian, Li Zhu, and Tao Mei. 2019. VrR-VG: Refocusing Visually-Relevant Relationships. In IEEE International Conference on Computer Vision. 10403--10412.
[16]
Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017a. Feature pyramid networks for object detection. In IEEE Conference on Computer Vision and Pattern Recognition. 2117--2125.
[17]
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017b. Focal loss for dense object detection. In IEEE International Conference on Computer Vision. 2980--2988.
[18]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European Conference on Computer Vision. 740--755.
[19]
Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. 2016. Visual relationship detection with language priors. In European Conference on Computer Vision. Springer, 852--869.
[20]
Zhiming Luo, Akshaya Mishra, Andrew Achkar, Justin A Eichel, Shaozi Li, and Pierremarc Jodoin. 2017. Non-local Deep Features for Salient Object Detection. IEEE Conference on Computer Vision and Pattern Recognition.
[21]
Jianming Lv, Qinzhe Xiao, and Jiajie Zhong. 2020. AVR: Attention based Salient Visual Relationship Detection. arXiv preprint arXiv:2003.07012 (2020).
[22]
Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Association for Computational Linguistics System Demonstrations. 55--60.
[23]
George A Miller. 1998. WordNet: An electronic lexical database. MIT press.
[24]
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Conference on Empirical Methods in Natural Language Processing. 1532--1543.
[25]
Francois Plesse, Alexandru Ginsca, Bertrand Delezoide, and Francoise Preteux. 2020. Focusing Visual Relation Detection on Relevant Relations with Prior Potentials. In IEEE Winter Conference on Applications of Computer Vision. 2980--2989.
[26]
Moshiko Raboh, Roei Herzig, Jonathan Berant, Gal Chechik, and Amir Globerson. 2020. Differentiable scene graphs. In The IEEE Winter Conference on Applications of Computer Vision. 1488--1497.
[27]
Sebastian Schuster, Ranjay Krishna, Angel Chang, Li Fei-Fei, and Christopher D Manning. 2015. Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Workshop on Vision and Language. 70--80.
[28]
Xu Sun, Yuan Zi, Tongwei Ren, Jinhui Tang, and Gangshan Wu. 2019. Hierarchical Visual Relationship Detection. In Proceedings of the 27th ACM International Conference on Multimedia. 94--102.
[29]
Jinhui Tang, Zechao Li, Meng Wang, and Ruizhen Zhao. 2015. Neighborhood Discriminant Hashing for Large-Scale Image Retrieval. IEEE Transactions on Image Processing, Vol. 24, 9 (2015), 2827--2840.
[30]
Jinhui Tang, Xiangbo Shu, Guojun Qi, Zechao Li, Meng Wang, Shuicheng Yan, and Ramesh Jain. 2017. Tri-Clustered Tensor Completion for Social-Aware Image Tag Refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 39, 8 (2017), 1662--1674.
[31]
Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, and Hanwang Zhang. 2020. Unbiased Scene Graph Generation from Biased Training. arXiv preprint arXiv:2002.11949 (2020).
[32]
Kaihua Tang, Hanwang Zhang, Baoyuan Wu, Wenhan Luo, and Wei Liu. 2019. Learning to compose dynamic tree structures for visual contexts. In IEEE Conference on Computer Vision and Pattern Recognition. 6619--6628.
[33]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In IEEE Conference on Computer Vision and Pattern Recognition. 3156--3164.
[34]
Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. 2019. Detectron2. https://github.com/facebookresearch/detectron2.
[35]
Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. 2017. Scene graph generation by iterative message passing. In IEEE Conference on Computer Vision and Pattern Recognition. 5410--5419.
[36]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning. 2048--2057.
[37]
Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. 2018a. Graph r-cnn for scene graph generation. In European Conference on Computer Vision. 670--685.
[38]
Xu Yang, Hanwang Zhang, and Jianfei Cai. 2018b. Shuffle-then-assemble: Learning object-agnostic visual relationship features. In European Conference on Computer Vision. 36--52.
[39]
Fan Yu, Haonan Wang, Tongwei Ren, Jinhui Tang, and Gangshan Wu. 2019. Instance of Interest Detection. In ACM International Conference on Multimedia. 1997--2005.
[40]
Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In IEEE Conference on Computer Vision and Pattern Recognition. 5831--5840.
[41]
Zhengjun Zha, Daqing Liu, Hanwang Zhang, Yongdong Zhang, and Feng Wu. 2019. Context-Aware Visual Policy Network for Fine-Grained Image Captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence (2019), 1--1.
[42]
Yibing Zhan, Jun Yu, Ting Yu, and Dacheng Tao. 2019. On Exploring Undetermined Relationships for Visual Relationship Detection. In IEEE Conference on Computer Vision and Pattern Recognition. 5128--5137.
[43]
Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, and Tat-Seng Chua. 2017a. Visual translation embedding network for visual relation detection. In IEEE Conference on Computer Vision and Pattern Recognition. 5532--5540.
[44]
Hanwang Zhang, Zawlin Kyaw, Jinyang Yu, and Shih-Fu Chang. 2017b. Ppr-fcn: Weakly supervised visual relation detection via parallel pairwise r-fcn. In IEEE International Conference on Computer Vision. 4233--4241.
[45]
Ji Zhang, Yannis Kalantidis, Marcus Rohrbach, Manohar Paluri, Ahmed Elgammal, and Mohamed Elhoseiny. 2019. Large-scale visual relationship understanding. In AAAI Conference on Artificial Intelligence, Vol. 33. 9185--9194.
[46]
Sipeng Zheng, Shizhe Chen, and Qin Jin. 2019. Visual Relation Detection with Multi-Level Attention. In ACM International Conference on Multimedia. 121--129.

Cited By

View all
  • (2024)3D Scene Graph Generation From Point CloudsIEEE Transactions on Multimedia10.1109/TMM.2023.333158326(5358-5368)Online publication date: 2024
  • (2023)Fine‐Grained Scene Graph Generation with Overlap Region and Geometrical CenterComputer Graphics Forum10.1111/cgf.1468341:7(359-370)Online publication date: 20-Mar-2023
  • (2023)A Balanced Relation Prediction Framework for Scene Graph GenerationArtificial Neural Networks and Machine Learning – ICANN 202310.1007/978-3-031-44216-2_18(216-228)Online publication date: 22-Sep-2023
  • Show More Cited By

Index Terms

  1. Visual Relation of Interest Detection

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '20: Proceedings of the 28th ACM International Conference on Multimedia
    October 2020
    4889 pages
    ISBN:9781450379885
    DOI:10.1145/3394171
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 October 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. interest estimation
    2. interest propagation network
    3. visual relation detection
    4. visual relation of interest

    Qualifiers

    • Research-article

    Funding Sources

    • Natural Science Foundation of Jiangsu Province
    • Collaborative Innovation Center of Novel Software Technology and Industrialization
    • National Science Foundation of China
    • Science,Technology and Innovation Commission of Shenzhen Municipality

    Conference

    MM '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)17
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)3D Scene Graph Generation From Point CloudsIEEE Transactions on Multimedia10.1109/TMM.2023.333158326(5358-5368)Online publication date: 2024
    • (2023)Fine‐Grained Scene Graph Generation with Overlap Region and Geometrical CenterComputer Graphics Forum10.1111/cgf.1468341:7(359-370)Online publication date: 20-Mar-2023
    • (2023)A Balanced Relation Prediction Framework for Scene Graph GenerationArtificial Neural Networks and Machine Learning – ICANN 202310.1007/978-3-031-44216-2_18(216-228)Online publication date: 22-Sep-2023
    • (2022)Complete interest propagation from part for visual relation of interest detectionInternational Journal of Machine Learning and Cybernetics10.1007/s13042-022-01603-w14:2(455-465)Online publication date: 2-Aug-2022
    • (2021)Reproducibility Companion PaperProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3477940(3633-3637)Online publication date: 17-Oct-2021
    • (2021)Recovering the Unbiased Scene Graphs from the Biased OnesProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3475297(1581-1590)Online publication date: 17-Oct-2021
    • (2021)Topic Scene Graph Generation by Attention Distillation from Caption2021 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV48922.2021.01560(15880-15890)Online publication date: Oct-2021

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media