research-article

STDG: Semi-Teacher-Student Training Paradigm for Depth-guided One-stage Scene Graph Generation

Authors:

Zhaoxin FanAuthors Info & Claims

ICMR '24: Proceedings of the 2024 International Conference on Multimedia Retrieval

Pages 915 - 924

https://doi.org/10.1145/3652583.3658024

Published: 07 June 2024 Publication History

Abstract

Scene Graph Generation is a critical enabler of environmental comprehension for autonomous robotic systems. Most of existing methods, however, are often thwarted by the intricate dynamics of background complexity, which limits their ability to fully decode the inherent topological information of the environment. Additionally, the wealth of contextual information encapsulated within depth cues is often left untapped, rendering existing approaches less effective. To address these shortcomings, we present STDG, an avant-garde Depth-Guided One-Stage Scene Graph Generation methodology. The innovative architecture of STDG is a triad of custom-built modules: The Depth Guided HHA Representation Generation Module, the Depth Guided Semi-Teaching Network Learning Module, and the Depth Guided Scene Graph Generation Module. This trifecta of modules synergistically harnesses depth information, covering all aspects from depth signal generation and depth feature utilization, to the final scene graph prediction. Importantly, this is achieved without imposing additional computational burden during the inference phase. Experimental results confirm that our method significantly enhances the performance of one-stage scene graph generation baselines.

References

[1]

George Adaimi, David Mizrahi, and Alexandre Alahi. 2023. Composite Relationship Fields with Transformers for Scene Graph Generation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 52--64.

[2]

Saeid Amiri, Kishan Chandan, and Shiqi Zhang. 2022. Reasoning with scene graphs for robot planning under partial observability. IEEE Robotics and Automation Letters 7, 2 (2022), 5560--5567.

[3]

Fernando Amodeo, Fernando Caballero, Natalia Díaz-Rodríguez, and Luis Merino. 2022. OG-SGG: ontology-guided scene graph generation-a case study in transfer learning for telepresence robotics. IEEE Access 10 (2022), 132564--132583.

[4]

Abhishek Badki, Alejandro Troccoli, Kihwan Kim, Jan Kautz, Pradeep Sen, and Orazio Gallo. 2020. Bi3d: Stereo depth estimation via binary classifications. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1600--1608.

[5]

Yuanzhouhan Cao, Chunhua Shen, and Heng Tao Shen. 2016. Exploiting depth from single monocular images for object detection and semantic segmentation. IEEE Transactions on Image Processing 26, 2 (2016), 836--846.

Digital Library

[6]

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In European conference on computer vision. Springer, 213--229.

Digital Library

[7]

Tianshui Chen, Weihao Yu, Riquan Chen, and Liang Lin. 2019. Knowledge-embedded routing network for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6163--6171.

[8]

Meng-Jiun Chiou, Henghui Ding, Hanshu Yan, Changhu Wang, Roger Zimmermann, and Jiashi Feng. 2021. Recovering the unbiased scene graphs from the biased ones. In Proceedings of the 29th ACM International Conference on Multimedia. 1581--1590.

Digital Library

[9]

Yuren Cong, Wentong Liao, Hanno Ackermann, Bodo Rosenhahn, and Michael Ying Yang. 2021. Spatial-temporal transformer for dynamic scene graph generation. In Proceedings of the IEEE/CVF international conference on computer vision. 16372--16382.

[10]

Mingyu Ding, Yuqi Huo, Hongwei Yi, Zhe Wang, Jianping Shi, Zhiwu Lu, and Ping Luo. 2020. Learning depth-guided convolutions for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition workshops. 1000--1001.

[11]

Xuewei Ding, Yehao Li, Yingwei Pan, Dan Zeng, and Ting Yao. 2020. Exploring Depth Information for Spatial Relation Recognition. In 2020 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR). 279--284. https: //doi.org/10.1109/MIPR49039.2020.00065

[12]

Xingning Dong, Tian Gan, Xuemeng Song, Jianlong Wu, Yuan Cheng, and Liqiang Nie. 2022. Stacked hybrid-attention and group collaborative learning for unbiased scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19427--19436.

[13]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929 [cs.CV]

[14]

Kaiwen Duan, Song Bai, Lingxi Xie, Honggang Qi, Qingming Huang, and Qi Tian. 2019. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF international conference on computer vision. 6569--6578.

[15]

Junming Fan, Pai Zheng, and Shufei Li. 2022. Vision-based holistic scene under-standing towards proactive human--robot collaboration. Robotics and Computer-Integrated Manufacturing 75 (2022), 102304. https://doi.org/10.1016/j.rcim.2021. 102304

Digital Library

[16]

Jan Feyereisl, Suha Kwak, Jeany Son, and Bohyung Han. 2014. Object localization based on structural SVM using privileged information. Advances in Neural Information Processing Systems 27 (2014).

[17]

Josselin Gautier, Olivier Le Meur, and Christine Guillemot. 2011. Depth-based image completion for view synthesis. In 2011 3DTV Conference: The True Vision-Capture, Transmission and Display of 3D Video (3DTV-CON). IEEE, 1--4.

[18]

Jiuxiang Gu, Handong Zhao, Zhe Lin, Sheng Li, Jianfei Cai, and Mingyang Ling. 2019. Scene Graph Generation With External Knowledge and Image Reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]

Saurabh Gupta, Ross Girshick, Pablo Arbeláez, and Jitendra Malik. 2014. Learning rich features from RGB-D images for object detection and segmentation. In Computer Vision-ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VII 13. Springer, 345--360.

[20]

Saurabh Gupta, Judy Hoffman, and Jitendra Malik. 2016. Cross modal distillation for supervision transfer. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2827--2836.

[21]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[22]

Marcel Hildebrandt, Hang Li, Rajat Koner, Volker Tresp, and Stephan Günnemann. 2020. Scene graph reasoning for visual question answering. arXiv preprint arXiv:2007.01072 (2020).

[23]

Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Carlos Niebles. 2020. Action genome: Actions as compositions of spatio-temporal scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10236--10247.

[24]

Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David Shamma, Michael Bernstein, and Li Fei-Fei. 2015. Image retrieval using scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3668--3678.

[25]

Seung-Taek Kim and Hyo Jong Lee. 2020. Lightweight stacked hourglass network for human pose estimation. Applied Sciences 10, 18 (2020), 6497.

[26]

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision 123 (2017), 32--73.

[27]

Aiswarya S. Kumar and Jyothisha J. Nair. 2022. Scene Graph Generation Using Depth, Spatial, and Visual Cues in 2D Images. IEEE Access 10 (2022), 1968--1978. https://doi.org/10.1109/ACCESS.2021.3139000

[28]

Katrin Lasinger, René Ranftl, Konrad Schindler, and Vladlen Koltun. 2019. Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer. CoRR abs/1907.01341 (2019). arXiv:1907.01341 http: //arxiv.org/abs/1907.01341

[29]

Kuan-Hui Lee, German Ros, Jie Li, and Adrien Gaidon. 2018. Spigan: Privileged adversarial learning from simulation. arXiv preprint arXiv:1810.03756 (2018).

[30]

Jiefeng Li, Chao Xu, Zhicun Chen, Siyuan Bian, Lixin Yang, and Cewu Lu. 2021. Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3383--3393.

[31]

Lin Li, Long Chen, Yifeng Huang, Zhimeng Zhang, Songyang Zhang, and Jun Xiao. 2022. The devil is in the labels: Noisy label correction for robust scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18869--18878.

[32]

Rongjie Li, Songyang Zhang, Bo Wan, and Xuming He. 2021. Bipartite graph network with adaptive message passing for unbiased scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11109--11119.

[33]

Wei Li, Haiwei Zhang, Qijie Bai, Guoqing Zhao, Ning Jiang, and Xiaojie Yuan. 2022. PPDL: Predicate Probability Distribution Based Loss for Unbiased Scene Graph Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 19447--19456.

[34]

Yikang Li, Wanli Ouyang, Bolei Zhou, Kun Wang, and Xiaogang Wang. 2017. Scene graph generation from objects, phrases and region captions. In Proceedings of the IEEE international conference on computer vision. 1261--1270.

[35]

Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]

Xin Lin, Changxing Ding, Yibing Zhan, Zijian Li, and Dacheng Tao. 2022. HL-Net: Heterophily Learning Network for Scene Graph Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 19476--19485.

[37]

Hengyue Liu, Ning Yan, Masood Mortazavi, and Bir Bhanu. 2021. Fully convolutional scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11546--11556.

[38]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision. 10012--10022.

[39]

Yichao Lu, Cheng Chang, Himanshu Rai, Guangwei Yu, and Maksims Volkovs. 2021. Multi-view scene graph generation in videos. In International Challenge on Activity Recognition (ActivityNet) CVPR 2021 Workshop, Vol. 3. 2.

[40]

Yichao Lu, Himanshu Rai, Jason Chang, Boris Knyazev, Guangwei Yu, Shashank Shekhar, Graham W Taylor, and Maksims Volkovs. 2021. Context-aware scene graph generation with seq2seq transformers. In Proceedings of the IEEE/CVF international conference on computer vision. 15931--15941.

[41]

Arnav Vaibhav Malawade, Shih-Yuan Yu, Brandon Hsu, Deepan Muthirayan, Pramod P Khargonekar, and Mohammad Abdullah Al Faruque. 2022. Spatiotemporal scene-graph embedding for autonomous vehicle collision prediction. IEEE Internet of Things Journal 9, 12 (2022), 9379--9388.

[42]

Ricardo Marcondes Marcacini and Solange Oliveira Rezende. 2013. Incremental hierarchical text clustering with privileged information. In Proceedings of the 2013 ACM symposium on document engineering. 231--232.

Digital Library

[43]

Taylor Mordan, Nicolas Thome, Gilles Henaff, and Matthieu Cord. 2018. Revisiting multi-task learning with rock: a deep residual auxiliary block for visual detection. Advances in neural information processing systems 31 (2018).

[44]

Alejandro Newell and Jia Deng. 2017. Pixels to graphs by associative embedding. Advances in neural information processing systems 30 (2017).

[45]

Tianwen Qian, Jingjing Chen, Shaoxiang Chen, Bo Wu, and Yu-Gang Jiang. 2022. Scene graph refinement network for visual question answering. IEEE Transactions on Multimedia (2022).

[46]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[47]

Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, and Hanwang Zhang. 2020. Unbiased scene graph generation from biased training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3716--3725.

[48]

Yao Teng, Limin Wang, Zhifeng Li, and Gangshan Wu. 2021. Target adaptive context aggregation for video scene graph generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13688--13697.

[49]

Vladimir Vapnik and Akshay Vashist. 2009. A new learning paradigm: Learning using privileged information. Neural networks: the official journal of the International Neural Network Society 22 (07 2009), 544--57. https://doi.org/10.1016/j.neunet.2009.06.042

Digital Library

[50]

Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, Wenyu Liu, and Bin Xiao. 2020. Deep High-Resolution Representation Learning for Visual Recognition. arXiv:1908.07919 [cs.CV]

[51]

Qi Wu, Damien Teney, Peng Wang, Chunhua Shen, Anthony Dick, and Anton Van Den Hengel. 2017. Visual question answering: A survey of methods and datasets. Computer Vision and Image Understanding 163 (2017), 21--40.

Digital Library

[52]

Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. 2017. Scene graph generation by iterative message passing. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5410--5419.

[53]

Yan Yan, Feiping Nie, Wen Li, Chenqiang Gao, Yi Yang, and Dong Xu. 2016. Image Classification by Cross-Media Active Learning With Privileged Information. IEEE Transactions on Multimedia 18 (12 2016), 1--1. https://doi.org/10.1109/TMM.2016. 2602938

Digital Library

[54]

Jingkang Yang, Yi Zhe Ang, Zujin Guo, Kaiyang Zhou, Wayne Zhang, and Ziwei Liu. 2022. Panoptic scene graph generation. In European Conference on Computer Vision. Springer, 178--196.

Digital Library

[55]

Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. 2018. Graph r-cnn for scene graph generation. In Proceedings of the European conference on computer vision (ECCV). 670--685.

Digital Library

[56]

Shih-Yuan Yu, Arnav Vaibhav Malawade, Deepan Muthirayan, Pramod P Khargonekar, and Mohammad Abdullah Al Faruque. 2021. Scene-graph augmented data-driven risk assessment of autonomous vehicle decisions. IEEE Transactions on Intelligent Transportation Systems 23, 7 (2021), 7941--7951.

Digital Library

[57]

Hongwen Zhang, Yating Tian, Xinchi Zhou, Wanli Ouyang, Yebin Liu, Limin Wang, and Zhenan Sun. 2021. Pymaf: 3d human pose and shape regression with pyramidal mesh alignment feedback loop. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11446--11456.

[58]

Chaoqiang Zhao, Qiyu Sun, Chongzhen Zhang, Yang Tang, and Feng Qian. 2020. Monocular depth estimation based on deep learning: An overview. Science China Technological Sciences 63, 9 (2020), 1612--1627.

[59]

Guangming Zhu, Liang Zhang, Youliang Jiang, Yixuan Dang, Haoran Hou, Peiyi Shen, Mingtao Feng, Xia Zhao, Qiguang Miao, Syed Afaq Ali Shah, et al. 2022. Scene graph generation: A comprehensive survey. arXiv preprint arXiv:2201.00443 (2022).

Index Terms

STDG: Semi-Teacher-Student Training Paradigm for Depth-guided One-stage Scene Graph Generation
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Object identification
      2. Computer vision tasks
        Scene understanding

Recommendations

Review on scene graph generation methods

A scene graph generation is a structured way of representing the image in a graphical network and it is mostly used to describe a scene’s objects and attributes and the relationship between the objects in the image. Image retrieval, video captioning, ...
Graph R-CNN for Scene Graph Generation
Computer Vision – ECCV 2018
Abstract
We propose a novel scene graph generation model called Graph R-CNN, that is both effective and efficient at detecting objects and their relations in images. Our model contains a Relation Proposal Network (RePN) that efficiently deals with the ...
Beware of Overcorrection: Scene-induced Commonsense Graph for Scene Graph Generation
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

A scene graph generation task is largely restricted under a class imbalance. Previous methods have alleviated the class imbalance problem by incorporating commonsense information into the classification, enabling the prediction model to rectify the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMR '24: Proceedings of the 2024 International Conference on Multimedia Retrieval

May 2024

1379 pages

ISBN:9798400706196

DOI:10.1145/3652583

General Chairs:
Cathal Gurrin
Dublin City University, Ireland
,
Rachada Kongkachandra
Thammasat University, Thailand
,
Klaus Schoeffmann
Klagenfurt University, Austria
,
Program Chairs:
Duc-Tien Dang-Nguyen
University of Bergen, Norway
,
Luca Rossetto
University of Zurich, Switzerland
,
Shin'ichi Satoh
National Institute of Informatics, Japan
,
Liting Zhou
Dublin City University, Ireland

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 June 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China (NSFC)

Conference

ICMR '24

Sponsor:

ICMR '24: International Conference on Multimedia Retrieval

June 10 - 14, 2024

Phuket, Thailand

Acceptance Rates

Overall Acceptance Rate 254 of 830 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
58
Total Downloads

Downloads (Last 12 months)58
Downloads (Last 6 weeks)7

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten