skip to main content
10.1145/3487075.3487158acmotherconferencesArticle/Chapter ViewAbstractPublication PagescsaeConference Proceedingsconference-collections
research-article

Text Pared into Scene Graph for Diverse Image Generation

Published: 07 December 2021 Publication History

Abstract

Although significant recent advances in condition generative model have shown remarkable improvements for controlled image generation, the image generation for multiple complex objects is still a challenge. To address the challenge, we propose a module of text description parsed into scene graph, which can generate reasonable scene layout to ensure the generated image and object realistic. Our proposed method enhances the interaction between objects and global semantics by concatenates each object embedding with text embedding To preserve the local image semantics, the Spatially-adaptive normalization(SPADE) layer is added into the generator of our model. We validate our method on Visual Genome and COCO-Stuff, where qualitative results and ablation study demonstrate the ability of our model in generating images with multiple objects and complex relationships.

References

[1]
Reed S, Akata Z, Yan X, (2016). Generative adversarial text to image synthesis[C]//International Conference on Machine Learning. PMLR, 1060-1069.
[2]
Johnson J, Gupta A, Fei-Fei L (2018). Image generation from scene graphs[C]//Proceedings of the IEEE conference on computer vision and pattern recognition, 1219-1228.
[3]
Li Y, Ma T, Bai Y, (2019). Pastegan: A semi-parametric method to generate image from scene graph[J]. Advances in Neural Information Processing Systems, 32: 3948-3958.
[4]
Ashual O, Wolf L (2019). Specifying object attributes and relations in interactive scene generation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 4561-4569.
[5]
Zhao B, Meng L, Yin W, (2019). Image generation from layout[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8584-8593.
[6]
Sun W, Wu T (2019). Image synthesis from reconfigurable layout and style[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 10531-10540.
[7]
Sylvain T, Zhang P, Bengio Y, (2020). Object-centric image generation from layouts[J]. arXiv preprint arXiv:2003.07449, 1(2): 4.
[8]
Tan F, Feng S, Ordonez V (2019). Text2scene: Generating compositional scenes from textual descriptions[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6710-6719
[9]
Mikolov T, Sutskever I, Chen K, (2013). Distributed representations of words and phrases and their compositionality[C]//Advances in neural information processing systems, 3111-3119.
[10]
Lee K H, Palangi H, Chen X, (2019). Learning visual relation priors for image-text matching and image captioning with neural scene graph generators[J]. arXiv preprint arXiv:1909. 09953.
[11]
Li Y, Ouyang W, Zhou B, (2017). Scene graph generation from objects, phrases and region captions[C]//Proceedings of the IEEE international conference on computer vision, 1261-1270.
[12]
Cha M, Gwon Y L, Kung H T (2019). Adversarial learning of semantic relevance in text to image synthesis[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 33(01): 3272-3279.
[13]
Schuster S, Krishna R, Chang A, (2015). Generating semantically precise scene graphs from textual descriptions for improved image retrieval[C]//Proceedings of the fourth workshop on vision and language, 70-80.
[14]
Goodfellow I, Pouget-Abadie J, Mirza M, (2014). Generative adversarial nets[J]. Advances in neural information processing systems, 27.
[15]
Mirza M, Osindero S (2014). Conditional generative adversarial nets[J]. arXiv preprint arXiv:1411. 1784.
[16]
Reed S E, Akata Z, Mohan S, (2016). Learning what and where to draw[J]. Advances in neural information processing systems, 29: 217-225.
[17]
Zhang H, Xu T, Li H, (2017). Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks[C]//Proceedings of the IEEE international conference on computer vision, 5907-5915.
[18]
Zhang H, Xu T, Li H, (2018). Stackgan++: Realistic image synthesis with stacked generative adversarial networks[J]. IEEE transactions on pattern analysis and machine intelligence, 41(8): 1947-1962.
[19]
Xu T, Zhang P, Huang Q, (2018). Attngan: Fine-grained text to image generation with attentional generative adversarial networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition, 1316-1324.
[20]
Zhang Z, Xie Y, Yang L (2018). Photographic text-to-image synthesis with a hierarchically-nested adversarial network[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6199-6208.
[21]
Odena A, Olah C, Shlens J (2017). Conditional image synthesis with auxiliary classifier gans[C]//International conference on machine learning. PMLR, 2642-2651.
[22]
Anderson P, Fernando B, Johnson M, (2016). Spice: Semantic propositional image caption evaluation[C]//European conference on computer vision. Springer, Cham, 382-398.
[23]
Park T, Liu M Y, Wang T C, (2019). Semantic image synthesis with spatially-adaptive normalization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2337-2346.
[24]
Heusel M, Ramsauer H, Unterthiner T, (2017). Gans trained by a two time-scale update rule converge to a local nash equilibrium[J]. Advances in neural information processing systems, 30.
[25]
Isola P, Zhu J Y, Zhou T, (2017). Image-to-image translation with conditional adversarial networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition, 1125-1134.
[26]
Johnson J, Alahi A, Fei-Fei L (2016). Perceptual losses for real-time style transfer and super-resolution[C]//European conference on computer vision. Springer, Cham, 694-711.
[27]
Lin T Y, Maire M, Belongie S, (2014). Microsoft coco: Common objects in context[C]//European conference on computer vision. Springer, Cham, 740-755.
[28]
Krishna R, Zhu Y, Groth O, (2017). Visual genome: Connecting language and vision using crowdsourced dense image annotations[J]. International journal of computer vision, 123(1): 32-73.
[29]
Salimans T, Goodfellow I, Zaremba W, (2016). Improved techniques for training gans[J]. Advances in neural information processing systems, 29: 2234-2242
[30]
Simonyan K, Zisserman A. (2014). Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556.
[31]
Zhang R, Isola P, Efros A A, (2018). The unreasonable effectiveness of deep features as a perceptual metric[C]//Proceedings of the IEEE conference on computer vision and pattern recognition, 586-595.

Index Terms

  1. Text Pared into Scene Graph for Diverse Image Generation
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Other conferences
        CSAE '21: Proceedings of the 5th International Conference on Computer Science and Application Engineering
        October 2021
        660 pages
        ISBN:9781450389853
        DOI:10.1145/3487075
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 07 December 2021

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Image-text retrieval
        2. Scene graph
        3. Text-to-image generation

        Qualifiers

        • Research-article
        • Research
        • Refereed limited

        Funding Sources

        • National Key Research and Development Plan of China

        Conference

        CSAE 2021

        Acceptance Rates

        Overall Acceptance Rate 368 of 770 submissions, 48%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 165
          Total Downloads
        • Downloads (Last 12 months)13
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 02 Mar 2025

        Other Metrics

        Citations

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format.

        HTML Format

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media