Skip to main content
Log in

Transformer networks with adaptive inference for scene graph generation

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Understanding a visual scene requires not only identifying single objects in isolation but also inferring the relationships and interactions between object pairs. In this study, we propose a novel scene graph generation framework based on Transformer to convert image data into linguistic descriptions characterized as nodes and edges of a graph describing the <subject–predicate–object> information of the given image. The proposed model consists of three components. First, we propose an enhanced object detection module with bidirectional long short-term memory (Bi-LSTM) for object-to-object information exchange to generate the classification probabilities for object bounding boxes and classes. Second, we introduce a novel context information capture module containing Transformer layers that outputs object categories containing object context as well as edge information for specific object pairs with context. Finally, since the relationship frequencies follow a long-tailed distribution, an adaptive inference module with a special feature fusion strategy is designed to soften the distribution and perform adaptive reasoning about relationship classification based on the visual appearance of object pairs. We have conducted detailed experiments on three popular open-source datasets, namely, Visual Genome, OpenImages, and Visual Relationship Detection, and have performed ablation experiments on each module, demonstrating significant improvements under different settings and in terms of various metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Yao T, Pan Y, Li Y, Mei T (2018) Exploring visual relationship for image captioning, pp 684–699

  2. Gao L, Wang B, Wang W (2018) Image captioning with scene-graph based semantic concepts. In: Proceedings of the 2018 10th international conference on machine learning and computing, pp 225–229

  3. Armeni I, He Z-Y, Gwak JY, Zamir AR, Fischer M, Malik J, Savarese S (2019) 3d Scene graph: A structure for unified semantics, 3d space, and camera. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 5664–5673

  4. Chaoyi Z, Yu J, Song Y, Cai W (2021) Exploiting edge-oriented reasoning for 3d point-based scene graph analysis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9705–9715

  5. Ren S, He K, Girshick R , Sun J (2016) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149

    Article  Google Scholar 

  6. Dai X, Chen Y, Xiao B, Chen D, Liu M, Yuan L, Zhang L (2021) Dynamic head: Unifying object detection heads with attentions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7373–7382

  7. Zou C, Wang B, Hu Y, Liu J, Wu Q, Zhao Y, Li B, Zhang C, Zhang C, Wei Y et al (2021) End-to-end human object interaction detection with hoi transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11825–11834

  8. Wang T, Yang T, Danelljan M, Khan FS, Zhang X, Su J (2020) Learning human-object interaction detection using interaction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4116–4125

  9. Nweke HF, Teh YW, Al-Garadi MA, Alo UR (2018) Deep learning algorithms for human activity recognition using mobile and wearable sensor networks: State of the art and research challenges. Expert Syst Appl 105:233–261

    Article  Google Scholar 

  10. Chen K, Zhang D, Yao L, Guo B, Yu Z (2021) Deep learning for sensor-based human activity recognition: Overview, challenges, and opportunities. ACM Comput Surv (CSUR) 54(4):1–40

    Google Scholar 

  11. Zellers R, Yatskar M, Thomson S (2018) Neural motifs: Scene graph parsing with global context, pp 5831–5840

  12. Chen L, Zhang H, Xiao J, He X, Pu S, Chang S (2019) Counterfactual critic multi-agent training for scene graph generation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4613–4623

  13. Woo S, Kim D, Cho D, Kweon IS (2018) Linknet: Relational embedding for scene graph. Adv Neural Inf Process, Syst, 31,2018

  14. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. arXiv:1706.03762

  15. Zhao B, Wu X, Feng J, Peng Q, Yan S (2017) Diversified visual attention networks for fine-grained object classification. IEEE Trans Multimedia 19(6):1245–1256

    Article  Google Scholar 

  16. Xiao T, Xu Y, Yang K, Zhang J, Peng Y, Zhang Z (2015) The application of two-level attention models in deep convolutional neural network for fine-grained image classification, pp 842– 850

  17. Kolesnikov A, Kuznetsova A, Lampert C, Ferrari V (2019) Detecting visual relationships using box. In: Proceedings of the IEEE/CVF international conference on computer vision workshops, pp 0–0

  18. Yang J, Lu J, Lee S, Batra D, Parikh D (2018) Graph r-cnn for scene graph generation, pp 670–685

  19. Liu A-A, Tian H, Xu N, Nie W, Zhang Y, Kankanhalli M (2021) Toward region-aware attention learning for scene graph generation. IEEE Trans Neural Netw Learn Syst

  20. Xu D, Zhu Y, Choy CB, Fei-Fei L (2017) Scene graph generation by iterative message passing, pp 5410–5419

  21. Li Y, Ouyang W, Zhou B, Wang K, Wang X (2017) Scene graph generation from objects, phrases and region captions, pp 1261–1270

  22. Dai B, Zhang Y, Lin D (2017) Detecting visual relationships with deep relational networks, pp 3076–3086

  23. Lin X, Ding C, Zeng J, Tao D (2020) Gps-net: Graph property sensing network for scene graph generation, pp 3746–3753

  24. Herzig R, Raboh M, Chechik G, Berant J, Globerson A (2018) Mapping images to scene graphs with permutation-invariant structured prediction. Adv Neural Inf Process Syst, 31, 2018

  25. Tang K, Zhang H, Wu B, Luo W, Liu W (2019) Learning to compose dynamic tree structures for visual contexts, pp 6619–6628

  26. Chen T, Yu W, Chen R, Lin L (2019) Knowledge-embedded routing network for scene graph generation, pp 6163–6171

  27. Zhang J, Shih KJ, Elgammal A, Tao A, Catanzaro B (2019) Graphical contrastive losses for scene graph parsing, pp 11535–11543

  28. Lu Y, Rai H, Chang J, Knyazev B, Yu G, Shekhar S, Taylor GW, Volkovs M (2021) Context-aware scene graph generation with seq2seq transformers. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 15931–15941

  29. Lafferty J, McCallum A, Pereira FCN (2001) Conditional random fields: Probabilistic models for segmenting and labeling sequence data

  30. Tang K, Niu Y, Huang J, Shi J, Zhang H (2020) Unbiased scene graph generation from biased training

  31. Guo Y, Gao L, Wang X, Hu Y, Xing X u, Xu L u, Shen Heng Tao (2021) From general to informative scene graph generation via balance adjustment. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 16383–16392

  32. Xie S, Girshick R, Dollár P, Tu Z, He K (2017) Aggregated residual transformations for deep neural networks, pp 1492–1500

  33. Lin TY, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection, pp 2117–2125

  34. Zhang Y, Hare J, Prügel-Bennett A (2018) Learning to count objects in natural images for visual question answering. arXiv:1802.05766

  35. Krishna R, Zhu Y, Groth O, Johnson J, Hata K , Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA et al (2017) Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73

    Article  MathSciNet  Google Scholar 

  36. Kuznetsova A, Rom H, Alldrin N, Uijlings J, Krasin I, Pont-Tuset J, Kamali S, Popov S, Malloci M, Kolesnikov A et al (2020) The open images dataset v4. Int J Comput Vis 128 (7):1956–1981

    Article  Google Scholar 

  37. Lu C, Krishna R, Bernstein M, Fei-Fei L (2016) Visual relationship detection with language priors. pp 852–869. Springer

  38. Pennington J, Socher R, Christopher DM (2014) Glove: Global vectors for word representation, pp 1532–1543

  39. Newell A, Deng J (2017)

  40. Zhou H, Yang Y, Luo T, Zhang J, Li S (2022) A unified deep sparse graph attention network for scene graph generation. Pattern Recognit 123:108367

    Article  Google Scholar 

  41. Li Y, Ouyang W, Wang X (2017) Vip-cnn: A visual phrase reasoning convolutional neural network for visual relationship detection 2, 2017. arXiv:1702.07191

  42. Zhang H, Kyaw Z, Chang SF, Chua TS (2017) Visual translation embedding network for visual relation detection, pp 5532–5540

  43. Liang X, Lee L, Xing EP (2017) Deep variation-structured reinforcement learning for visual relationship and attribute detection, pp 848–857

Download references

Acknowledgements

This work was sponsored by the National Natural Science Foundation of China (No. 61802253).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yongbin Gao.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, Y., Gao, Y., Yu, W. et al. Transformer networks with adaptive inference for scene graph generation. Appl Intell 53, 9621–9633 (2023). https://doi.org/10.1007/s10489-022-04022-0

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-04022-0

Keywords

Navigation