Skip to main content
Log in

Image Generation from Hyper Scene Graph with Multiple Types of Trinomial Hyperedges

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

Generating realistic images is one of the important problems in the field of computer vision. Generating consistent images with a user’s input is called conditional image generation. Due to recent advances in generating high-quality images with Generative Adversarial Networks, many conditional image generation models have been proposed, such as text-to-image, scene-graph-to-image, and layout-to-image models. Among them, scene-graph-to-image models have the advantage of generating an image for a complex situation according to the structure of a scene graph. However, existing scene-graph-to-image models have difficulty in capturing positional relations among three or more objects since a binomial edge in a scene graph can only represent relations between two objects. In this paper, we propose a novel image generation model hsg2im which addresses this shortcoming by generating images from a hyper scene graph with trinomial edges. We validate the effectiveness of hsg2im on multiple types of trinomial hyperedges. In addition, we introduce loss functions for the relative position of objects to improve the accuracy of the positions of generated bounding boxes. Experimental validations on COCO-Stuff and Visual Genome datasets show that the proposed model generates more consistent images with user’s inputs than our previous model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Data availability

We will provide the source code for creating additional trinomial hyperedges of COCO and VG datasets, only for academic research purpose upon request.

References

  1. Agnese J, Herrera J, Tao H, Zhu X. A Survey and Taxonomy of Adversarial Neural Networks for Text-to-Image Synthesis. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 2020;10(4):1345.

    Google Scholar 

  2. Wu X, Xu K, Hall P. A Survey of Image Synthesis and Editing with Generative Adversarial Networks. Tsinghua Science and Technology. 2017;22(6):660–74.

    Article  Google Scholar 

  3. Nie D, Trullo R, Lian J, Petitjean C, Ruan S, Wang Q, Shen D. Medical Image Synthesis with Context-Aware Generative Adversarial Networks. In: Proc. International Conference on Medical Image Computing and Computer-Assisted Intervention. 2017;pp. 417–425.

  4. Ghorbani A, Natarajan V, Coz D, Liu Y. DermGAN: Synthetic Generation of Clinical Skin Images with Pathology. In: Proc. Machine Learning for Health Workshop. 2020;pp. 155–170.

  5. Elgammal A, Liu B, Elhoseiny M, Mazzone M. CAN: Creative Adversarial Networks Generating “Art” by Learning About Styles and Deviating from Style Norms. arXiv preprint arXiv:1706.07068. 2017.

  6. Johnson J, Gupta A, Fei-Fei L. Image Generation from Scene Graphs. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018;pp. 1219–1228.

  7. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative Adversarial Nets. Advances in Neural Information Processing Systems. 2014;27.

  8. Reed SE, Akata Z, Yan X, Logeswaran L, Schiele B, Lee H. Generative Adversarial Text to Image Synthesis. ArXiv arXiv:1605.05396. 2016.

  9. Zhang H, Xu T, Li H, Zhang S, Wang X, Huang X, Metaxas DN. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proc. IEEE/CVF international conference on computer vision. 2017. p. 5907–15.

  10. Zhang H, Xu T, Li H, Zhang S, Wang X, Huang X, Metaxas DN. StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks. IEEE Trans Pattern Anal Mach Intell. 2018;41(8):1947–62.

    Article  Google Scholar 

  11. Odena A, Olah C, Shlens J. Conditional Image Synthesis with Auxiliary Classifier GANs. In: Proc. International Conference on Machine Learning. 2017;pp. 2642–2651.

  12. Miyake R, Matsukawa T, Suzuki E. Image Generation from a Hyper Scene Graph with Trinomial Hyperedges. In: Proc. 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) - Volume 5: VISAPP. 2023;pp. 185–195.

  13. He S, Liao W, Yang MY, Yang Y, Song Y-Z, Rosenhahn B, Xiang T. Context-Aware Layout to Image Generation with Enhanced Object Appearance. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021;pp. 15049–15058.

  14. Chen Q, Koltun V. Photographic Image Synthesis with Cascaded Refinement Networks. In: Proc. IEEE/CVF International Conference on Computer Vision. 2017;pp. 1511–1520.

  15. Sun W, Wu T. Image Synthesis From Reconfigurable Layout and Style. In: Proc. IEEE/CVF International Conference on Computer Vision. 2019;pp. 10531–10540.

  16. Hinz T, Heinrich S, Wermter S. Generating Multiple Objects at Spatially Distinct Locations. Proc. International Conference on Learning Representations; 2019.

  17. Hinz T, Heinrich S, Wermter S. Semantic Object Accuracy for Generative Text-to-Image Synthesis. IEEE Trans Pattern Anal Mach Intell. 2022;44:1552–65.

    Article  Google Scholar 

  18. Li Y, Ma T, Bai Y, Duan N, Wei S, Wang X. PasteGAN: A Semi-Parametric Method to Generate Image from Scene Graph. Proc. Conference on Neural Information Processing Systems arXiv:1905.01608. 2019.

  19. Vo DM, Sugimoto A. Visual-Relation Conscious Image Generation from Structured-Text. ArXiv arXiv:1908.01741. 2020.

  20. Mittal G, Agrawal S, Agarwal A, Mehta S, Marwah T. Interactive Image Generation Using Scene Graphs. Proc. ICLR 2019 Deep Generative Models for Highly Structured Data Workshop; 2019.

  21. Mirza M, Osindero S. Conditional Generative Adversarial Nets. arXiv preprint arXiv:1411.1784. 2014.

  22. Schuster S, Krishna R, Chang A, Fei-Fei L, Manning CD. Generating Semantically Precise Scene Graphs from Textual Descriptions for Improved Image Retrieval. In: Proc. Fourth Workshop on Vision and Language. 2015;pp. 70–80.

  23. Herzig R, Bar A, Xu H, Chechik G, Darrell T, Globerson A. Learning Canonical Representations for Scene Graph to Image Generation. In: Proc. European Conference on Computer Vision; 2020. https://api.semanticscholar.org/CorpusID:209376158.

  24. Vo DM, Sugimoto A. Visual-relation conscious image generation from structured-text. In: Proc. European conference on computer vision; 2020.

  25. Sortino R, Palazzo S, Spampinato C. Transformer-based Image Generation from Scene Graphs. Computer Vision and Image Understanding. 2023;233.

  26. Yang L, Huang Z, Song Y, Hong S, Li G, Zhang W, Cui B, Ghanem B, Yang M-H. Diffusion-Based Scene Graph to Image Generation with Masked Contrastive Pre-Training. ArXiv arXiv:2211.11138. 2022.

  27. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is All You Need. Advances in Neural Information Processing Systems. 2017;30.

  28. Ho J, Jain A, Abbeel P. Denoising Diffusion Probabilistic Models. Adv Neural Inf Process Syst. 2020;33:6840–51.

    Google Scholar 

  29. Frolov S, Hinz T, Raue F, Hees J, Dengel A. Adversarial Text-to-Image Synthesis: A review. arXiv preprint arXiv:2101.09983. 2021.

  30. Dinh TM, Nguyen R, Hua B-S. TISE: Bag of Metrics for Text-to-Image Synthesis Evaluation. In: Proc. European Conference on Computer Vision. 2022;pp. 594–609.

  31. Sylvain T, Zhang P, Bengio Y, Hjelm RD, Sharma S. Object-centric Image Generation from Layouts. In: Proc. AAAI Conference on Artificial Intelligence. 2021;vol. 35, pp. 2647–2655.

  32. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, et al. Learning Transferable Visual Models from Natural Language Supervision. In: Proc. International Conference on Machine Learning. 2021;pp. 8748–8763.

  33. Rezatofighi H, Tsoi N, Gwak J, Sadeghian A, Reid I, Savarese S. Generalized Intersection over Union: A Metric and a Loss for Bounding Box Regression. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019;pp. 658–666.

  34. Salimans T, Goodfellow I, Zaremba W, Cheung V, Radford A, Chen X. Improved Techniques for Training GANs. Advances in Neural Information Processing Systems. 2016;29.

  35. Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In: Proc. Advances in Neural Information Processing Systems; 2017.

  36. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. In: Proc. IEEE conference on computer vision and pattern recognition (CVPR). 2016. p. 2818–26.

  37. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, et al. ImageNet Large Scale Visual Recognition Challenge. Int J Comput Vision. 2015;115(3):211–52.

    Article  MathSciNet  Google Scholar 

  38. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going Deeper with Convolutions. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition. 2015;pp. 1–9.

  39. He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition. In: Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2016;pp. 770–778.

  40. Caesar H, Uijlings J, Ferrari V. Coco-Stuff: Thing and Stuff Classes in Context. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition. 2018;pp. 1209–1218.

  41. Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L-J, Shamma DA, et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. Int J Comput Vision. 2017;123(1):32–73.

    Article  MathSciNet  Google Scholar 

  42. Kingma DP, Ba J. Adam: a method for stochastic optimization. In: Proc. international conference on learning representations. 2015.

  43. Miyake R, Matsukawa T, Suzuki E. Image generation from hyper scene graphs with trinomial hyperedges using object attention. In: Proc. 19th international joint conference on computer vision, imaging and computer graphics theory and applications. 2024. p. 269–79.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ryosuke Miyake.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no Conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Recent Trends on Computer Vision, Imaging and Computer Graphics Theory and Applications” guest edited by Kadi Bouatouch, A. Augusto Sousa, Thomas Bashford-Rogers, Mounia Ziat and Helen Purchase.

Appendix A Structure of MLPs in Hyper Graph Convolutional Network

Appendix A Structure of MLPs in Hyper Graph Convolutional Network

We show a detailed structure of hg2sim. Figures 16, 17, and 18 correspond to net1, net2, and net3, which are used in the (hyper) graph convolutional network.

Fig. 16
figure 16

Structure of net1. It receives two \( 128 \)-dimensional object vectors and one relation vector corresponding to \(e\in E\), and outputs two \( 512 \)-dimensional object vectors and one \( 128 \)-dimensional relation vector

Fig. 17
figure 17

Structure of net2. It receives one \( 512 \)-dimensional object vector, which is the output of net1 and net3, and outputs a \( 128 \)-dimensional object vector by conducting dimensionality reduction

Fig. 18
figure 18

Structure of net3. It receives three \( 128 \)-dimensional object vectors and one relation vector corresponding to \(q\in Q\), and outputs three \( 512 \)-dimensional object vectors and one \( 128 \)-dimensional relation vector

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Miyake, R., Matsukawa, T. & Suzuki, E. Image Generation from Hyper Scene Graph with Multiple Types of Trinomial Hyperedges. SN COMPUT. SCI. 5, 624 (2024). https://doi.org/10.1007/s42979-024-02791-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-024-02791-8

Keywords

Navigation