Multimodal graph inference network for scene graph generation

Duan, Jingwen; Min, Weidong; Lin, Deyu; Xu, Jianfeng; Xiong, Xin

doi:10.1007/s10489-021-02304-7

Multimodal graph inference network for scene graph generation

Published: 13 April 2021

Volume 51, pages 8768–8783, (2021)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Jingwen Duan¹,
Weidong Min ORCID: orcid.org/0000-0003-2526-2181^2,3,
Deyu Lin²,
Jianfeng Xu² &
…
Xin Xiong¹

952 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

A scene graph can describe images concisely and structurally. However, existing methods of scene graph generation have low capabilities of inferring certain relationships, because of the lack of semantic information and their heavy dependence on the statistical distribution of the training set. To alleviate the above problems, a Multimodal Graph Inference Network (MGIN), which includes two modules; Multimodal Information Extraction (MIE) and Target with Multimodal Feature Inference (TMFI), is proposed in this study. MGIN can increase the inference capability of triplets, especially for uncommon samples. In the proposed MIE module, the prior statistical knowledge of the training set is incorporated into the network in a reprocess to relieve the problem of overfitting to the training set. Visual and semantic features are extracted in the MIE module and fused as unified multimodal features in the TMFI module. These features are efficient for the inference module to increase the prediction capability of MGIN, especially for some uncommon samples. The proposed method achieves 27.0% average mean recall and 55.9% average recall, with improvements of 0.48% and 0.50%, respectively, compared with state-of-the-art methods. It also increases the average recall of 20 relationships with the lowest probability by 4.91%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SG-Shuffle: Multi-aspect Shuffle Transformer for Scene Graph Generation

Transformer networks with adaptive inference for scene graph generation

Article 10 August 2022

Scene Graph Generation Based on Node-Relation Context Module

References

Johnson J, Krishna R, Stark M, Li LJ, Shamma D, Bernstein M, Fei-Fei L (2015) Image retrieval using scene graphs. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3668–3678
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
Lin TY, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2117–2125
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Xiong X, Min W, Zheng W, Liao P, Yang H, Wang S (2020) S3d-cnn: skeleton-based 3d consecutive-low-pooling neural network for fall detection. Applied Intelligence
Qiu J, Dong Y, Ma H, Li J, Wang K, Tang J (2018) Network embedding as matrix factorization: Unifying deepwalk, line, pte, and node2vec. In: Proceedings of the eleventh ACM international conference on web search and data mining, 459–467
Cui P, Wang X, Pei J, Zhu W (2018) A survey on network embedding. IEEE Trans Knowl Data Eng 31(5):833–852
Article Google Scholar
Cai H, Zheng VW, Chang KCC (2018) A comprehensive survey of graph embedding: Problems, techniques, and applications. IEEE Trans Knowl Data Eng 30(9):1616–1637
Article Google Scholar
Li Y, Tarlow D, Brockschmidt M, Zemel R (2015) Gated graph sequence neural networks. arXiv:151105493
Scarselli F, Gori M, Tsoi AC, Hagenbuchner M, Monfardini G (2008) The graph neural network model. IEEE Trans Neural Netw 20(1):61–80
Article Google Scholar
Scarselli F, Gori M, Tsoi AC, Hagenbuchner M, Monfardini G (2008) Computational capabilities of graph neural networks. IEEE Trans Neural Netw 20(1):81–102
Article Google Scholar
Li X, Jiang S (2019) Know more say less: Image captioning based on scene graphs. IEEE Trans Multimed 21(8):2117–2130
Article Google Scholar
Xu N, Liu A-A, Liu J, Nie W, Su Y (2019) Scene graph captioner: Image captioning based on structural visual representation. J Vis Commun Image Represent 58:477–485
Article Google Scholar
Yang X, Tang K, Zhang H, Cai J (2019) Auto-encoding scene graphs for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 10685–10694
Xi Y, Zhang Y, Ding S, Wan S (2020) Visual question answering model based on visual relationship detection. Signal Process Image Commun 80:115648
Article Google Scholar
Prabhu N, Venkatesh Babu R (2015) Attribute-graph: A graph based approach to image ranking. In: Proceedings of the IEEE international conference on computer vision, pp 1071–1079
Herzig R, Bar A, Xu H, Chechik G, Darrell T, Globerson A (2019) Learning canonical representations for scene graph to image generation. arXiv:191207414
Chen T, Yu W, Chen R, Lin L (2019) Knowledge-embedded routing network for scene graph generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6163–6171
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA et al (2017) Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73
Article MathSciNet Google Scholar
Zhang H, Kyaw Z, Chang S-F, Chua T-S (2017) Visual translation embedding network for visual relation detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5532–5540
Wan H, Luo Y, Peng B, Zheng W-S (2018) Representation learning for scene graph completion via jointly structural and visual embedding. In: IJCAI, pp 949–956
Hung Z-S, Mallya A, Lazebnik S (2020) Contextual translation embedding for visual relationship detection and scene graph generation. IEEE Transactions on Pattern Analysis and Machine Intelligence
Dai B, Zhang Y, Lin D (2017) Detecting visual relationships with deep relational networks. In: Proceedings of the IEEE conference on computer vision and Pattern recognition, pp 3076–3086
Xu D, Zhu Y, Choy CB, Fei-Fei L (2017) Scene graph generation by iterative message passing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5410–5419
Li Y, Ouyang W, Zhou B, Wang K, Wang X (2017) Scene graph generation from objects, phrases and region captions. In: Proceedings of the IEEE international conference on computer vision, pp 1261–1270
Lu C, Krishna R, Bernstein M, Fei-Fei L (2016) Visual relationship detection with language priors. In: European conference on computer vision, Springer, pp 852–869
Zellers R, Yatskar M, Thomson S, Choi Y (2018) Neural motifs: Scene graph parsing with global context. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5831–5840
Gu J, Zhao H, Lin Z, Li S, Cai J, Ling M (2019) Scene graph generation with external knowledge and image reconstruction. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1969–1978
Li Y, Ouyang W, Zhou B, Shi J, Zhang C, Wang X (2018) Factorizable net: an efficient subgraph-based framework for scene graph generation. In: Proceedings of the european conference on computer vision (ECCV), pp 335–351
Yang J, Lu J, Lee S, Batra D, Parikh D (2018) Graph r-cnn for scene graph generation. In: Proceedings of the European conference on computer vision (ECCV), pp 670–685
Deng C, Yang X, Nie F, Tao D (2019) Saliency detection via a multiple self-weighted graph-based manifold ranking. IEEE Trans Multimed 22(4):885–896
Article Google Scholar
Li C, Tang H, Deng C, Zhan L, Liu W (2020) Vulnerability vs. reliability: Disentangled adversarial examples for cross-modal learning. In: Proceedings of the 26th ACM SIGKDD International conference on knowledge discovery & data mining, association for computing machinery, pp 421–429
Guo W, Cai J, Wang S (2020) Unsupervised discriminative feature representation via adversarial auto-encoder. Appl Intell 50(4):1155–1171
Article Google Scholar
Guo W, Wang J, Wang S (2019) Deep multimodal representation learning: A survey. IEEE Access 7:63373–63394
Article Google Scholar
Aytar Y, Castrejon L, Vondrick C, Pirsiavash H, Torralba A (2017) Cross-modal scene networks. IEEE Trans Pattern Anal Mach Intell 40(10):2303–2314
Article Google Scholar
Xie D, Deng C, Li C, Liu X, Tao D (2020) Multi-task consistency-preserving adversarial hashing for cross-modal retrieval. IEEE Trans Image Process 29:3626–3637
Article Google Scholar
Yang E, Deng C, Li C, Liu W, Li J, Tao D (2018) Shared predictive cross-modal deep quantization. IEEE Trans Neural Netw Learn Syst 29(11):5292–5303
Article Google Scholar
Wang S, Zhang H, Wang H (2017) Object co-segmentation via weakly supervised data fusion. Comput Vis Image Underst 155:43–54
Article Google Scholar
Jiang YG, Wu Z, Wang J, Xue X, Chang SF (2017) Exploiting feature and class relationships in video categorization with regularized deep neural networks. IEEE Trans Pattern Anal Mach Intell 40 (2):352–364
Article Google Scholar
Vo AD, Nguyen QP, Ock CY (2020) Semantic and syntactic analysis in learning representation based on a sentiment analysis model. Appl Intell 50(3):663–680
Article Google Scholar
Nickel M, Murphy K, Tresp V, Gabrilovich E (2015) A review of relational machine learning for knowledge graphs. Proc IEEE 104(1):11–33
Article Google Scholar
Marino K, Salakhutdinov R, Gupta A (2016) The more you know: Using knowledge graphs for image classification. arXiv:161204844
Lee CW, Fang W, Yeh CK, Frank Wang YC (2018) Multi-label zero-shot learning with structured knowledge graphs. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1576–1585
Yang X, Deng C, Liu T, Tao D (2020) Heterogeneous graph attention network for unsupervised multiple-target domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence
Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. arXiv:160902907
Redmon J, Farhadi A (2017) Yolo9000: better, faster, stronger. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7263–7271
Tang K, Niu Y, Huang J, Shi J, Zhang H (2020) Unbiased scene graph generation from biased training. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3716–3725
Chen V, Varma P, Krishna R, Bernstein M, Re C, Fei-Fei L (2019) Scene graph prediction with limited labels. In: International conference on computer vision
Newell A, Deng J (2017) Pixels to graphs by associative embedding. In: Advances in neural information processing systems, pp 2171–2180
Khademi M, Schulte O (2018) Dynamic gated graph neural networks for scene graph generation. In: Asian conference on computer vision, Springer, pp 669–685

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant Nos. 62076117, 61762061, 61763031), the Natural Science Foundation of Jiangxi Province, China (Grant No. 20161ACB20004) and Jiangxi Key Laboratory of Smart City (Grant No. 20192BCD40002).

Author information

Authors and Affiliations

School of Information Engineering, Nanchang University, Nanchang, 330031, China
Jingwen Duan & Xin Xiong
School of Software, Nanchang University, Nanchang, 330047, China
Weidong Min, Deyu Lin & Jianfeng Xu
Jiangxi Key Laboratory of Smart City, Nanchang, 330047, China
Weidong Min

Authors

Jingwen Duan
View author publications
You can also search for this author in PubMed Google Scholar
Weidong Min
View author publications
You can also search for this author in PubMed Google Scholar
Deyu Lin
View author publications
You can also search for this author in PubMed Google Scholar
Jianfeng Xu
View author publications
You can also search for this author in PubMed Google Scholar
Xin Xiong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Weidong Min.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Duan, J., Min, W., Lin, D. et al. Multimodal graph inference network for scene graph generation. Appl Intell 51, 8768–8783 (2021). https://doi.org/10.1007/s10489-021-02304-7

Download citation

Accepted: 02 March 2021
Published: 13 April 2021
Issue Date: December 2021
DOI: https://doi.org/10.1007/s10489-021-02304-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multimodal graph inference network for scene graph generation

Abstract

Access this article

Similar content being viewed by others

SG-Shuffle: Multi-aspect Shuffle Transformer for Scene Graph Generation

Transformer networks with adaptive inference for scene graph generation

Scene Graph Generation Based on Node-Relation Context Module

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multimodal graph inference network for scene graph generation

Abstract

Access this article

Similar content being viewed by others

SG-Shuffle: Multi-aspect Shuffle Transformer for Scene Graph Generation

Transformer networks with adaptive inference for scene graph generation

Scene Graph Generation Based on Node-Relation Context Module

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation