ABSTRACT
Text-to-image synthesis adopts only text descriptions as input to generate consistent images which should have high visual quality and be semantically aligned with the input text. Compared to images, the textual semantics is ambiguous and sparse, which makes it challenging to map features directly and accurately from text space to image space. To address this issue, the intuitive method is to construct an intermediate space connecting text and image. Using layout as a bridge between text and image not only mitigates the difficulty of the task, but also constrains the spatial distribution of objects in the generated images, which is crucial to the quality of synthesized images. In this paper, we build a two-stage framework for text-to-image synthesis, i.e., Layout Searching by Text Matching, and Layout-to-Image Synthesis with Fine-Grained Textual Semantic Injection. Specifically, we build the prior layout knowledge from the training dataset and propose a semi-parametric layout searching strategy to retrieve the layout that matches the input sentence by measuring the semantic distance between different textual descriptions. In the stage of layout-to-image synthesis, we construct the Textual and Spatial Alignment Generative Adversarial Networks (TSAGANs) that are designed to guarantee the fine-grained alignment of the generated images with the input text and layout obtained in the first stage. Extensive experiments conducted on the COCO-stuff dataset manifest that our method can obtain more reasonable layouts and improve the performance of synthesized images significantly.
- Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. 2018. Coco-stuff: Thing and stuff classes in context. In CVPR. 1209–1218.Google Scholar
- Miriam Cha, Youngjune L Gwon, and HT Kung. 2019. Adversarial learning of semantic relevance in text to image synthesis. In AAAI. IEEE, 3272–3279.Google Scholar
- Jun Cheng, Fuxiang Wu, Yanling Tian, Lei Wang, and Dapeng Tao. 2020. RiFeGAN: Rich Feature Generation for Text-to-Image Synthesis From Prior Knowledge. In CVPR. 10911–10920.Google Scholar
- Jun Cheng, Fuxiang Wu, Yanling Tian, Lei Wang, and Dapeng Tao. 2021. RiFeGAN2: Rich Feature Generation for Text-to-Image Synthesis from Constrained Prior Knowledge. TCSVT (2021).Google Scholar
- Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2020. Generative adversarial networks. Commun. ACM 63, 11 (2020), 139–144.Google ScholarDigital Library
- Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. arXiv preprint arXiv:1706.08500(2017).Google Scholar
- Tobias Hinz, Stefan Heinrich, and Stefan Wermter. 2020. Semantic object accuracy for generative text-to-image synthesis. TPAMI (2020).Google Scholar
- Seunghoon Hong, Dingdong Yang, Jongwook Choi, and Honglak Lee. 2018. Inferring semantic layout for hierarchical text-to-image synthesis. In CVPR. 7986–7994.Google Scholar
- Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991(2015).Google Scholar
- Qicheng Lao, Mohammad Havaei, Ahmad Pesaranghader, Francis Dutil, Lisa Di Jorio, and Thomas Fevens. 2019. Dual adversarial inference for text-to-image synthesis. In ICCV. IEEE, 7567–7576.Google Scholar
- Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, and Philip Torr. 2019. Controllable text-to-image generation. NIPS 32(2019).Google Scholar
- Wenbo Li, Pengchuan Zhang, Lei Zhang, Qiuyuan Huang, Xiaodong He, Siwei Lyu, and Jianfeng Gao. 2019. Object-driven text-to-image synthesis via adversarial training. In CVPR. 12174–12182.Google Scholar
- Jiadong Liang, Wenjie Pei, and Feng Lu. 2020. CPGAN: Content-Parsing Generative Adversarial Networks for Text-to-Image Synthesis. In ECCV. Springer, 491–508.Google Scholar
- Jiadong Liang, Wenjie Pei, and Feng Lu. 2022. Layout-Bridging Text-to-Image Synthesis. arXiv preprint arXiv:2208.06162(2022).Google Scholar
- Bingchen Liu, Kunpeng Song, Yizhe Zhu, Gerard de Melo, and Ahmed Elgammal. 2021. Time: text and image mutual-translation adversarial networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. IEEE, 2082–2090.Google ScholarCross Ref
- Tingting Qiao, Jing Zhang, Duanqing Xu, and Dacheng Tao. 2019. Learn, imagine and create: Text-to-image generation from prior knowledge. NIPS 32(2019), 887–897.Google Scholar
- Tingting Qiao, Jing Zhang, Duanqing Xu, and Dacheng Tao. 2019. Mirrorgan: Learning text-to-image generation by redescription. In CVPR. 1505–1514.Google Scholar
- Yanyuan Qiao, Qi Chen, Chaorui Deng, Ning Ding, Yuankai Qi, Mingkui Tan, Xincheng Ren, and Qi Wu. 2021. R-GAN: Exploring Human-like Way for Reasonable Text-to-Image Synthesis via Generative Adversarial Networks. In ACMMM. 2085–2093.Google Scholar
- Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125(2022).Google Scholar
- Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. 2016. Generative adversarial text to image synthesis. In ICML. PMLR, 1060–1069.Google Scholar
- Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. NIPS 28(2015).Google Scholar
- Shulan Ruan, Yong Zhang, Kun Zhang, Yanbo Fan, Fan Tang, Qi Liu, and Enhong Chen. 2021. DAE-GAN: Dynamic Aspect-aware GAN for Text-to-Image Synthesis. In ICCV. 13960–13969.Google Scholar
- Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, 2022. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. arXiv preprint arXiv:2205.11487(2022).Google Scholar
- Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. 2016. Improved techniques for training gans. NIPS 29(2016).Google Scholar
- Wei Sun and Tianfu Wu. 2021. Learning layout and style reconfigurable gans for controllable image synthesis. TPAMI 44, 9 (2021), 5070–5087.Google Scholar
- Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In CVPR. 2818–2826.Google Scholar
- Fuwen Tan, Song Feng, and Vicente Ordonez. 2019. Text2scene: Generating compositional scenes from textual descriptions. In CVPR. 6710–6719.Google Scholar
- Hongchen Tan, Xiuping Liu, Xin Li, Yi Zhang, and Baocai Yin. 2019. Semantics-enhanced adversarial nets for text-to-image synthesis. In ICCV. 10501–10510.Google Scholar
- Ming Tao, Hao Tang, Songsong Wu, Nicu Sebe, Fei Wu, and Xiao-Yuan Jing. 2022. Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis. CVPR (2022).Google Scholar
- Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. 2018. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR. 1316–1324.Google Scholar
- Runqi Yang, Jianhai Zhang, Xing Gao, Feng Ji, and Haiqing Chen. 2019. Simple and effective text matching with richer alignment features. arXiv preprint arXiv:1908.00300(2019).Google Scholar
- Guojun Yin, Bin Liu, Lu Sheng, Nenghai Yu, Xiaogang Wang, and Jing Shao. 2019. Semantics disentangling for text-to-image generation. In CVPR. 2327–2336.Google Scholar
- Mingkuan Yuan and Yuxin Peng. 2019. Ckd: Cross-task knowledge distillation for text-to-image synthesis. TMM 22, 8 (2019), 1955–1968.Google ScholarCross Ref
- Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. 2017. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV. 5907–5915.Google Scholar
- Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. 2018. Stackgan++: Realistic image synthesis with stacked generative adversarial networks. TPAMI 41, 8 (2018), 1947–1962.Google ScholarCross Ref
- Bin Zhu and Chong-Wah Ngo. 2020. CookGAN: Causality based Text-to-Image Synthesis. In CVPR. 5519–5527.Google Scholar
- Minfeng Zhu, Pingbo Pan, Wei Chen, and Yi Yang. 2019. Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In CVPR. 5802–5810.Google Scholar
Index Terms
- A Semi-Parametric Method for Text-to-Image Synthesis from Prior Knowledge
Recommendations
Cycle-Consistent Inverse GAN for Text-to-Image Synthesis
MM '21: Proceedings of the 29th ACM International Conference on MultimediaThis paper investigates an open research task of text-to-image synthesis for automatically generating or manipulating images from text descriptions. Prevailing methods mainly take the textual descriptions as the conditional input for the GAN generation, ...
Text-to-image synthesis: Starting composite from the foreground content
AbstractRecently, text-to-image synthesis has become a hot issue in computer vision and has been widely concerned. Many methods have achieved encouraging results in this field at present, but it is still a great challenge to improve the ...
Image synthesis: a review of methods, datasets, evaluation metrics, and future outlook
AbstractImage synthesis is a process of converting the input text, sketch, or other sources, i.e., another image or mask, into an image. It is an important problem in the computer vision field, where it has attracted the research community to attempt to ...
Comments