Transformer networks with adaptive inference for scene graph generation

Wang, Yini; Gao, Yongbin; Yu, Wenjun; Guo, Ruyan; Wan, Weibing; Yang, Shuqun; Huang, Bo

doi:10.1007/s10489-022-04022-0

Transformer networks with adaptive inference for scene graph generation

Published: 10 August 2022

Volume 53, pages 9621–9633, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Yini Wang¹,
Yongbin Gao¹,
Wenjun Yu¹,
Ruyan Guo¹,
Weibing Wan¹,
Shuqun Yang¹ &
…
Bo Huang¹

507 Accesses
1 Altmetric
Explore all metrics

Abstract

Understanding a visual scene requires not only identifying single objects in isolation but also inferring the relationships and interactions between object pairs. In this study, we propose a novel scene graph generation framework based on Transformer to convert image data into linguistic descriptions characterized as nodes and edges of a graph describing the <subject–predicate–object> information of the given image. The proposed model consists of three components. First, we propose an enhanced object detection module with bidirectional long short-term memory (Bi-LSTM) for object-to-object information exchange to generate the classification probabilities for object bounding boxes and classes. Second, we introduce a novel context information capture module containing Transformer layers that outputs object categories containing object context as well as edge information for specific object pairs with context. Finally, since the relationship frequencies follow a long-tailed distribution, an adaptive inference module with a special feature fusion strategy is designed to soften the distribution and perform adaptive reasoning about relationship classification based on the visual appearance of object pairs. We have conducted detailed experiments on three popular open-source datasets, namely, Visual Genome, OpenImages, and Visual Relationship Detection, and have performed ablation experiments on each module, demonstrating significant improvements under different settings and in terms of various metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-task Compositional Network for Visual Relationship Detection

Article 30 July 2020

Scene Graph Generation Based on Node-Relation Context Module

Independent Relationship Detection for Real-Time Scene Graph Generation

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Yao T, Pan Y, Li Y, Mei T (2018) Exploring visual relationship for image captioning, pp 684–699
Gao L, Wang B, Wang W (2018) Image captioning with scene-graph based semantic concepts. In: Proceedings of the 2018 10th international conference on machine learning and computing, pp 225–229
Armeni I, He Z-Y, Gwak JY, Zamir AR, Fischer M, Malik J, Savarese S (2019) 3d Scene graph: A structure for unified semantics, 3d space, and camera. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 5664–5673
Chaoyi Z, Yu J, Song Y, Cai W (2021) Exploiting edge-oriented reasoning for 3d point-based scene graph analysis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9705–9715
Ren S, He K, Girshick R , Sun J (2016) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Article Google Scholar
Dai X, Chen Y, Xiao B, Chen D, Liu M, Yuan L, Zhang L (2021) Dynamic head: Unifying object detection heads with attentions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7373–7382
Zou C, Wang B, Hu Y, Liu J, Wu Q, Zhao Y, Li B, Zhang C, Zhang C, Wei Y et al (2021) End-to-end human object interaction detection with hoi transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11825–11834
Wang T, Yang T, Danelljan M, Khan FS, Zhang X, Su J (2020) Learning human-object interaction detection using interaction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4116–4125
Nweke HF, Teh YW, Al-Garadi MA, Alo UR (2018) Deep learning algorithms for human activity recognition using mobile and wearable sensor networks: State of the art and research challenges. Expert Syst Appl 105:233–261
Article Google Scholar
Chen K, Zhang D, Yao L, Guo B, Yu Z (2021) Deep learning for sensor-based human activity recognition: Overview, challenges, and opportunities. ACM Comput Surv (CSUR) 54(4):1–40
Google Scholar
Zellers R, Yatskar M, Thomson S (2018) Neural motifs: Scene graph parsing with global context, pp 5831–5840
Chen L, Zhang H, Xiao J, He X, Pu S, Chang S (2019) Counterfactual critic multi-agent training for scene graph generation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4613–4623
Woo S, Kim D, Cho D, Kweon IS (2018) Linknet: Relational embedding for scene graph. Adv Neural Inf Process, Syst, 31,2018
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. arXiv:1706.03762
Zhao B, Wu X, Feng J, Peng Q, Yan S (2017) Diversified visual attention networks for fine-grained object classification. IEEE Trans Multimedia 19(6):1245–1256
Article Google Scholar
Xiao T, Xu Y, Yang K, Zhang J, Peng Y, Zhang Z (2015) The application of two-level attention models in deep convolutional neural network for fine-grained image classification, pp 842– 850
Kolesnikov A, Kuznetsova A, Lampert C, Ferrari V (2019) Detecting visual relationships using box. In: Proceedings of the IEEE/CVF international conference on computer vision workshops, pp 0–0
Yang J, Lu J, Lee S, Batra D, Parikh D (2018) Graph r-cnn for scene graph generation, pp 670–685
Liu A-A, Tian H, Xu N, Nie W, Zhang Y, Kankanhalli M (2021) Toward region-aware attention learning for scene graph generation. IEEE Trans Neural Netw Learn Syst
Xu D, Zhu Y, Choy CB, Fei-Fei L (2017) Scene graph generation by iterative message passing, pp 5410–5419
Li Y, Ouyang W, Zhou B, Wang K, Wang X (2017) Scene graph generation from objects, phrases and region captions, pp 1261–1270
Dai B, Zhang Y, Lin D (2017) Detecting visual relationships with deep relational networks, pp 3076–3086
Lin X, Ding C, Zeng J, Tao D (2020) Gps-net: Graph property sensing network for scene graph generation, pp 3746–3753
Herzig R, Raboh M, Chechik G, Berant J, Globerson A (2018) Mapping images to scene graphs with permutation-invariant structured prediction. Adv Neural Inf Process Syst, 31, 2018
Tang K, Zhang H, Wu B, Luo W, Liu W (2019) Learning to compose dynamic tree structures for visual contexts, pp 6619–6628
Chen T, Yu W, Chen R, Lin L (2019) Knowledge-embedded routing network for scene graph generation, pp 6163–6171
Zhang J, Shih KJ, Elgammal A, Tao A, Catanzaro B (2019) Graphical contrastive losses for scene graph parsing, pp 11535–11543
Lu Y, Rai H, Chang J, Knyazev B, Yu G, Shekhar S, Taylor GW, Volkovs M (2021) Context-aware scene graph generation with seq2seq transformers. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 15931–15941
Lafferty J, McCallum A, Pereira FCN (2001) Conditional random fields: Probabilistic models for segmenting and labeling sequence data
Tang K, Niu Y, Huang J, Shi J, Zhang H (2020) Unbiased scene graph generation from biased training
Guo Y, Gao L, Wang X, Hu Y, Xing X u, Xu L u, Shen Heng Tao (2021) From general to informative scene graph generation via balance adjustment. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 16383–16392
Xie S, Girshick R, Dollár P, Tu Z, He K (2017) Aggregated residual transformations for deep neural networks, pp 1492–1500
Lin TY, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection, pp 2117–2125
Zhang Y, Hare J, Prügel-Bennett A (2018) Learning to count objects in natural images for visual question answering. arXiv:1802.05766
Krishna R, Zhu Y, Groth O, Johnson J, Hata K , Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA et al (2017) Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73
Article MathSciNet Google Scholar
Kuznetsova A, Rom H, Alldrin N, Uijlings J, Krasin I, Pont-Tuset J, Kamali S, Popov S, Malloci M, Kolesnikov A et al (2020) The open images dataset v4. Int J Comput Vis 128 (7):1956–1981
Article Google Scholar
Lu C, Krishna R, Bernstein M, Fei-Fei L (2016) Visual relationship detection with language priors. pp 852–869. Springer
Pennington J, Socher R, Christopher DM (2014) Glove: Global vectors for word representation, pp 1532–1543
Newell A, Deng J (2017)
Zhou H, Yang Y, Luo T, Zhang J, Li S (2022) A unified deep sparse graph attention network for scene graph generation. Pattern Recognit 123:108367
Article Google Scholar
Li Y, Ouyang W, Wang X (2017) Vip-cnn: A visual phrase reasoning convolutional neural network for visual relationship detection 2, 2017. arXiv:1702.07191
Zhang H, Kyaw Z, Chang SF, Chua TS (2017) Visual translation embedding network for visual relation detection, pp 5532–5540
Liang X, Lee L, Xing EP (2017) Deep variation-structured reinforcement learning for visual relationship and attribute detection, pp 848–857

Download references

Acknowledgements

This work was sponsored by the National Natural Science Foundation of China (No. 61802253).

Author information

Authors and Affiliations

School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai, China
Yini Wang, Yongbin Gao, Wenjun Yu, Ruyan Guo, Weibing Wan, Shuqun Yang & Bo Huang

Authors

Yini Wang
View author publications
You can also search for this author inPubMed Google Scholar
Yongbin Gao
View author publications
You can also search for this author inPubMed Google Scholar
Wenjun Yu
View author publications
You can also search for this author inPubMed Google Scholar
Ruyan Guo
View author publications
You can also search for this author inPubMed Google Scholar
Weibing Wan
View author publications
You can also search for this author inPubMed Google Scholar
Shuqun Yang
View author publications
You can also search for this author inPubMed Google Scholar
Bo Huang
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Yongbin Gao.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, Y., Gao, Y., Yu, W. et al. Transformer networks with adaptive inference for scene graph generation. Appl Intell 53, 9621–9633 (2023). https://doi.org/10.1007/s10489-022-04022-0

Download citation

Accepted: 20 July 2022
Published: 10 August 2022
Issue Date: April 2023
DOI: https://doi.org/10.1007/s10489-022-04022-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Transformer networks with adaptive inference for scene graph generation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-task Compositional Network for Visual Relationship Detection

Scene Graph Generation Based on Node-Relation Context Module

Independent Relationship Detection for Real-Time Scene Graph Generation

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now