research-article

A Semi-Parametric Method for Text-to-Image Synthesis from Prior Knowledge

Author:
Jiadong Liang

Beihang University, China

Beihang University, China

0000-0002-3177-3810
View Profile

ACAI '22: Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial IntelligenceDecember 2022Article No.: 59Pages 1–7https://doi.org/10.1145/3579654.3579717

Published:14 March 2023Publication History

ACAI '22: Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence

Pages 1–7

ABSTRACT

Text-to-image synthesis adopts only text descriptions as input to generate consistent images which should have high visual quality and be semantically aligned with the input text. Compared to images, the textual semantics is ambiguous and sparse, which makes it challenging to map features directly and accurately from text space to image space. To address this issue, the intuitive method is to construct an intermediate space connecting text and image. Using layout as a bridge between text and image not only mitigates the difficulty of the task, but also constrains the spatial distribution of objects in the generated images, which is crucial to the quality of synthesized images. In this paper, we build a two-stage framework for text-to-image synthesis, i.e., Layout Searching by Text Matching, and Layout-to-Image Synthesis with Fine-Grained Textual Semantic Injection. Specifically, we build the prior layout knowledge from the training dataset and propose a semi-parametric layout searching strategy to retrieve the layout that matches the input sentence by measuring the semantic distance between different textual descriptions. In the stage of layout-to-image synthesis, we construct the Textual and Spatial Alignment Generative Adversarial Networks (TSAGANs) that are designed to guarantee the fine-grained alignment of the generated images with the input text and layout obtained in the first stage. Extensive experiments conducted on the COCO-stuff dataset manifest that our method can obtain more reasonable layouts and improve the performance of synthesized images significantly.

References

Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. 2018. Coco-stuff: Thing and stuff classes in context. In CVPR. 1209–1218.Google Scholar
Miriam Cha, Youngjune L Gwon, and HT Kung. 2019. Adversarial learning of semantic relevance in text to image synthesis. In AAAI. IEEE, 3272–3279.Google Scholar
Jun Cheng, Fuxiang Wu, Yanling Tian, Lei Wang, and Dapeng Tao. 2020. RiFeGAN: Rich Feature Generation for Text-to-Image Synthesis From Prior Knowledge. In CVPR. 10911–10920.Google Scholar
Jun Cheng, Fuxiang Wu, Yanling Tian, Lei Wang, and Dapeng Tao. 2021. RiFeGAN2: Rich Feature Generation for Text-to-Image Synthesis from Constrained Prior Knowledge. TCSVT (2021).Google Scholar
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2020. Generative adversarial networks. Commun. ACM 63, 11 (2020), 139–144.Google ScholarDigital Library
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. arXiv preprint arXiv:1706.08500(2017).Google Scholar
Tobias Hinz, Stefan Heinrich, and Stefan Wermter. 2020. Semantic object accuracy for generative text-to-image synthesis. TPAMI (2020).Google Scholar
Seunghoon Hong, Dingdong Yang, Jongwook Choi, and Honglak Lee. 2018. Inferring semantic layout for hierarchical text-to-image synthesis. In CVPR. 7986–7994.Google Scholar
Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991(2015).Google Scholar
Qicheng Lao, Mohammad Havaei, Ahmad Pesaranghader, Francis Dutil, Lisa Di Jorio, and Thomas Fevens. 2019. Dual adversarial inference for text-to-image synthesis. In ICCV. IEEE, 7567–7576.Google Scholar
Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, and Philip Torr. 2019. Controllable text-to-image generation. NIPS 32(2019).Google Scholar
Wenbo Li, Pengchuan Zhang, Lei Zhang, Qiuyuan Huang, Xiaodong He, Siwei Lyu, and Jianfeng Gao. 2019. Object-driven text-to-image synthesis via adversarial training. In CVPR. 12174–12182.Google Scholar
Jiadong Liang, Wenjie Pei, and Feng Lu. 2020. CPGAN: Content-Parsing Generative Adversarial Networks for Text-to-Image Synthesis. In ECCV. Springer, 491–508.Google Scholar
Jiadong Liang, Wenjie Pei, and Feng Lu. 2022. Layout-Bridging Text-to-Image Synthesis. arXiv preprint arXiv:2208.06162(2022).Google Scholar
Bingchen Liu, Kunpeng Song, Yizhe Zhu, Gerard de Melo, and Ahmed Elgammal. 2021. Time: text and image mutual-translation adversarial networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. IEEE, 2082–2090.Google ScholarCross Ref
Tingting Qiao, Jing Zhang, Duanqing Xu, and Dacheng Tao. 2019. Learn, imagine and create: Text-to-image generation from prior knowledge. NIPS 32(2019), 887–897.Google Scholar
Tingting Qiao, Jing Zhang, Duanqing Xu, and Dacheng Tao. 2019. Mirrorgan: Learning text-to-image generation by redescription. In CVPR. 1505–1514.Google Scholar
Yanyuan Qiao, Qi Chen, Chaorui Deng, Ning Ding, Yuankai Qi, Mingkui Tan, Xincheng Ren, and Qi Wu. 2021. R-GAN: Exploring Human-like Way for Reasonable Text-to-Image Synthesis via Generative Adversarial Networks. In ACMMM. 2085–2093.Google Scholar
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125(2022).Google Scholar
Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. 2016. Generative adversarial text to image synthesis. In ICML. PMLR, 1060–1069.Google Scholar
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. NIPS 28(2015).Google Scholar
Shulan Ruan, Yong Zhang, Kun Zhang, Yanbo Fan, Fan Tang, Qi Liu, and Enhong Chen. 2021. DAE-GAN: Dynamic Aspect-aware GAN for Text-to-Image Synthesis. In ICCV. 13960–13969.Google Scholar
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, 2022. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. arXiv preprint arXiv:2205.11487(2022).Google Scholar
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. 2016. Improved techniques for training gans. NIPS 29(2016).Google Scholar
Wei Sun and Tianfu Wu. 2021. Learning layout and style reconfigurable gans for controllable image synthesis. TPAMI 44, 9 (2021), 5070–5087.Google Scholar
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In CVPR. 2818–2826.Google Scholar
Fuwen Tan, Song Feng, and Vicente Ordonez. 2019. Text2scene: Generating compositional scenes from textual descriptions. In CVPR. 6710–6719.Google Scholar
Hongchen Tan, Xiuping Liu, Xin Li, Yi Zhang, and Baocai Yin. 2019. Semantics-enhanced adversarial nets for text-to-image synthesis. In ICCV. 10501–10510.Google Scholar
Ming Tao, Hao Tang, Songsong Wu, Nicu Sebe, Fei Wu, and Xiao-Yuan Jing. 2022. Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis. CVPR (2022).Google Scholar
Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. 2018. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR. 1316–1324.Google Scholar
Runqi Yang, Jianhai Zhang, Xing Gao, Feng Ji, and Haiqing Chen. 2019. Simple and effective text matching with richer alignment features. arXiv preprint arXiv:1908.00300(2019).Google Scholar
Guojun Yin, Bin Liu, Lu Sheng, Nenghai Yu, Xiaogang Wang, and Jing Shao. 2019. Semantics disentangling for text-to-image generation. In CVPR. 2327–2336.Google Scholar
Mingkuan Yuan and Yuxin Peng. 2019. Ckd: Cross-task knowledge distillation for text-to-image synthesis. TMM 22, 8 (2019), 1955–1968.Google ScholarCross Ref
Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. 2017. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV. 5907–5915.Google Scholar
Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. 2018. Stackgan++: Realistic image synthesis with stacked generative adversarial networks. TPAMI 41, 8 (2018), 1947–1962.Google ScholarCross Ref
Bin Zhu and Chong-Wah Ngo. 2020. CookGAN: Causality based Text-to-Image Synthesis. In CVPR. 5519–5527.Google Scholar
Minfeng Zhu, Pingbo Pan, Wei Chen, and Yi Yang. 2019. Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In CVPR. 5802–5810.Google Scholar

Index Terms

A Semi-Parametric Method for Text-to-Image Synthesis from Prior Knowledge
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision

Recommendations

Cycle-Consistent Inverse GAN for Text-to-Image Synthesis
MM '21: Proceedings of the 29th ACM International Conference on Multimedia

This paper investigates an open research task of text-to-image synthesis for automatically generating or manipulating images from text descriptions. Prevailing methods mainly take the textual descriptions as the conditional input for the GAN generation, ...
Read More
Text-to-image synthesis: Starting composite from the foreground content
Abstract
Recently, text-to-image synthesis has become a hot issue in computer vision and has been widely concerned. Many methods have achieved encouraging results in this field at present, but it is still a great challenge to improve the ...
Read More
Image synthesis: a review of methods, datasets, evaluation metrics, and future outlook
Abstract
Image synthesis is a process of converting the input text, sketch, or other sources, i.e., another image or mask, into an image. It is an important problem in the computer vision field, where it has attracted the research community to attempt to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ACAI '22: Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence
December 2022
770 pages
ISBN:9781450398336
DOI:10.1145/3579654

Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 March 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
image generation
multimodal
text-to-image synthesis
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate173of395submissions,44%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 22
  Total Downloads
- Downloads (Last 12 months)16
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

A Semi-Parametric Method for Text-to-Image Synthesis from Prior Knowledge

ACAI '22: Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence

ABSTRACT

References

Cited By

Index Terms

Recommendations

Cycle-Consistent Inverse GAN for Text-to-Image Synthesis

Text-to-image synthesis: Starting composite from the foreground content

Image synthesis: a review of methods, datasets, evaluation metrics, and future outlook

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

A Semi-Parametric Method for Text-to-Image Synthesis from Prior Knowledge

ACAI '22: Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence

ABSTRACT

References

Cited By

Index Terms

Recommendations

Cycle-Consistent Inverse GAN for Text-to-Image Synthesis

Text-to-image synthesis: Starting composite from the foreground content

Image synthesis: a review of methods, datasets, evaluation metrics, and future outlook

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media