skip to main content
10.1145/3579654.3579717acmotherconferencesArticle/Chapter ViewAbstractPublication PagesacaiConference Proceedingsconference-collections
research-article

A Semi-Parametric Method for Text-to-Image Synthesis from Prior Knowledge

Published:14 March 2023Publication History

ABSTRACT

Text-to-image synthesis adopts only text descriptions as input to generate consistent images which should have high visual quality and be semantically aligned with the input text. Compared to images, the textual semantics is ambiguous and sparse, which makes it challenging to map features directly and accurately from text space to image space. To address this issue, the intuitive method is to construct an intermediate space connecting text and image. Using layout as a bridge between text and image not only mitigates the difficulty of the task, but also constrains the spatial distribution of objects in the generated images, which is crucial to the quality of synthesized images. In this paper, we build a two-stage framework for text-to-image synthesis, i.e., Layout Searching by Text Matching, and Layout-to-Image Synthesis with Fine-Grained Textual Semantic Injection. Specifically, we build the prior layout knowledge from the training dataset and propose a semi-parametric layout searching strategy to retrieve the layout that matches the input sentence by measuring the semantic distance between different textual descriptions. In the stage of layout-to-image synthesis, we construct the Textual and Spatial Alignment Generative Adversarial Networks (TSAGANs) that are designed to guarantee the fine-grained alignment of the generated images with the input text and layout obtained in the first stage. Extensive experiments conducted on the COCO-stuff dataset manifest that our method can obtain more reasonable layouts and improve the performance of synthesized images significantly.

References

  1. Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. 2018. Coco-stuff: Thing and stuff classes in context. In CVPR. 1209–1218.Google ScholarGoogle Scholar
  2. Miriam Cha, Youngjune L Gwon, and HT Kung. 2019. Adversarial learning of semantic relevance in text to image synthesis. In AAAI. IEEE, 3272–3279.Google ScholarGoogle Scholar
  3. Jun Cheng, Fuxiang Wu, Yanling Tian, Lei Wang, and Dapeng Tao. 2020. RiFeGAN: Rich Feature Generation for Text-to-Image Synthesis From Prior Knowledge. In CVPR. 10911–10920.Google ScholarGoogle Scholar
  4. Jun Cheng, Fuxiang Wu, Yanling Tian, Lei Wang, and Dapeng Tao. 2021. RiFeGAN2: Rich Feature Generation for Text-to-Image Synthesis from Constrained Prior Knowledge. TCSVT (2021).Google ScholarGoogle Scholar
  5. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2020. Generative adversarial networks. Commun. ACM 63, 11 (2020), 139–144.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. arXiv preprint arXiv:1706.08500(2017).Google ScholarGoogle Scholar
  7. Tobias Hinz, Stefan Heinrich, and Stefan Wermter. 2020. Semantic object accuracy for generative text-to-image synthesis. TPAMI (2020).Google ScholarGoogle Scholar
  8. Seunghoon Hong, Dingdong Yang, Jongwook Choi, and Honglak Lee. 2018. Inferring semantic layout for hierarchical text-to-image synthesis. In CVPR. 7986–7994.Google ScholarGoogle Scholar
  9. Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991(2015).Google ScholarGoogle Scholar
  10. Qicheng Lao, Mohammad Havaei, Ahmad Pesaranghader, Francis Dutil, Lisa Di Jorio, and Thomas Fevens. 2019. Dual adversarial inference for text-to-image synthesis. In ICCV. IEEE, 7567–7576.Google ScholarGoogle Scholar
  11. Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, and Philip Torr. 2019. Controllable text-to-image generation. NIPS 32(2019).Google ScholarGoogle Scholar
  12. Wenbo Li, Pengchuan Zhang, Lei Zhang, Qiuyuan Huang, Xiaodong He, Siwei Lyu, and Jianfeng Gao. 2019. Object-driven text-to-image synthesis via adversarial training. In CVPR. 12174–12182.Google ScholarGoogle Scholar
  13. Jiadong Liang, Wenjie Pei, and Feng Lu. 2020. CPGAN: Content-Parsing Generative Adversarial Networks for Text-to-Image Synthesis. In ECCV. Springer, 491–508.Google ScholarGoogle Scholar
  14. Jiadong Liang, Wenjie Pei, and Feng Lu. 2022. Layout-Bridging Text-to-Image Synthesis. arXiv preprint arXiv:2208.06162(2022).Google ScholarGoogle Scholar
  15. Bingchen Liu, Kunpeng Song, Yizhe Zhu, Gerard de Melo, and Ahmed Elgammal. 2021. Time: text and image mutual-translation adversarial networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. IEEE, 2082–2090.Google ScholarGoogle ScholarCross RefCross Ref
  16. Tingting Qiao, Jing Zhang, Duanqing Xu, and Dacheng Tao. 2019. Learn, imagine and create: Text-to-image generation from prior knowledge. NIPS 32(2019), 887–897.Google ScholarGoogle Scholar
  17. Tingting Qiao, Jing Zhang, Duanqing Xu, and Dacheng Tao. 2019. Mirrorgan: Learning text-to-image generation by redescription. In CVPR. 1505–1514.Google ScholarGoogle Scholar
  18. Yanyuan Qiao, Qi Chen, Chaorui Deng, Ning Ding, Yuankai Qi, Mingkui Tan, Xincheng Ren, and Qi Wu. 2021. R-GAN: Exploring Human-like Way for Reasonable Text-to-Image Synthesis via Generative Adversarial Networks. In ACMMM. 2085–2093.Google ScholarGoogle Scholar
  19. Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125(2022).Google ScholarGoogle Scholar
  20. Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. 2016. Generative adversarial text to image synthesis. In ICML. PMLR, 1060–1069.Google ScholarGoogle Scholar
  21. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. NIPS 28(2015).Google ScholarGoogle Scholar
  22. Shulan Ruan, Yong Zhang, Kun Zhang, Yanbo Fan, Fan Tang, Qi Liu, and Enhong Chen. 2021. DAE-GAN: Dynamic Aspect-aware GAN for Text-to-Image Synthesis. In ICCV. 13960–13969.Google ScholarGoogle Scholar
  23. Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, 2022. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. arXiv preprint arXiv:2205.11487(2022).Google ScholarGoogle Scholar
  24. Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. 2016. Improved techniques for training gans. NIPS 29(2016).Google ScholarGoogle Scholar
  25. Wei Sun and Tianfu Wu. 2021. Learning layout and style reconfigurable gans for controllable image synthesis. TPAMI 44, 9 (2021), 5070–5087.Google ScholarGoogle Scholar
  26. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In CVPR. 2818–2826.Google ScholarGoogle Scholar
  27. Fuwen Tan, Song Feng, and Vicente Ordonez. 2019. Text2scene: Generating compositional scenes from textual descriptions. In CVPR. 6710–6719.Google ScholarGoogle Scholar
  28. Hongchen Tan, Xiuping Liu, Xin Li, Yi Zhang, and Baocai Yin. 2019. Semantics-enhanced adversarial nets for text-to-image synthesis. In ICCV. 10501–10510.Google ScholarGoogle Scholar
  29. Ming Tao, Hao Tang, Songsong Wu, Nicu Sebe, Fei Wu, and Xiao-Yuan Jing. 2022. Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis. CVPR (2022).Google ScholarGoogle Scholar
  30. Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. 2018. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR. 1316–1324.Google ScholarGoogle Scholar
  31. Runqi Yang, Jianhai Zhang, Xing Gao, Feng Ji, and Haiqing Chen. 2019. Simple and effective text matching with richer alignment features. arXiv preprint arXiv:1908.00300(2019).Google ScholarGoogle Scholar
  32. Guojun Yin, Bin Liu, Lu Sheng, Nenghai Yu, Xiaogang Wang, and Jing Shao. 2019. Semantics disentangling for text-to-image generation. In CVPR. 2327–2336.Google ScholarGoogle Scholar
  33. Mingkuan Yuan and Yuxin Peng. 2019. Ckd: Cross-task knowledge distillation for text-to-image synthesis. TMM 22, 8 (2019), 1955–1968.Google ScholarGoogle ScholarCross RefCross Ref
  34. Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. 2017. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV. 5907–5915.Google ScholarGoogle Scholar
  35. Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. 2018. Stackgan++: Realistic image synthesis with stacked generative adversarial networks. TPAMI 41, 8 (2018), 1947–1962.Google ScholarGoogle ScholarCross RefCross Ref
  36. Bin Zhu and Chong-Wah Ngo. 2020. CookGAN: Causality based Text-to-Image Synthesis. In CVPR. 5519–5527.Google ScholarGoogle Scholar
  37. Minfeng Zhu, Pingbo Pan, Wei Chen, and Yi Yang. 2019. Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In CVPR. 5802–5810.Google ScholarGoogle Scholar

Index Terms

  1. A Semi-Parametric Method for Text-to-Image Synthesis from Prior Knowledge

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      ACAI '22: Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence
      December 2022
      770 pages
      ISBN:9781450398336
      DOI:10.1145/3579654

      Copyright © 2022 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 14 March 2023

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

      Acceptance Rates

      Overall Acceptance Rate173of395submissions,44%
    • Article Metrics

      • Downloads (Last 12 months)16
      • Downloads (Last 6 weeks)1

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format