Elsevier

Information Sciences

Volume 607, August 2022, Pages 1265-1285
Information Sciences

Text-to-image synthesis: Starting composite from the foreground content

https://doi.org/10.1016/j.ins.2022.06.044Get rights and content

Abstract

Recently, text-to-image synthesis has become a hot issue in computer vision and has been widely concerned. Many methods have achieved encouraging results in this field at present, but it is still a great challenge to improve the quality of the synthesized image further. In this paper, we propose a multi-stage synthesis method, which starts composite from the foreground content. The whole synthesis process is divided into three stages. The first stage generates the foreground results, and the third stage synthesizes the final image results. The second stage results include two situations: one is to continue to synthesize the foreground results; the other is to synthesize the image results with background information. Experiments demonstrate that the method of continuing to generate the foreground results in the second stage can achieve better results on the Caltech-UCSD Birds (CUB) and Oxford-102 datasets, while the way of synthesizing foreground results only in the first stage can obtain better performance on the Microsoft Common Objects in Context (MS COCO) dataset. Besides, our synthesized results on the three datasets are subjectively more realistic with better detail processing. It also outperforms most existing methods in quantitative comparison results, which demonstrates the effectiveness and superiority of our method.

Introduction

In computer vision, image synthesis has always been a widely concerned research field. Due to the rapid development of deep learning, many breakthroughs have been made in the field of image synthesis, especially the recent introduction of Generative Adversarial Networks (GAN) [19], which has obtained many encouraging results in this field. Nevertheless, the traditional GAN only uses the noise vectors obtained by Gaussian or uniform distribution for image synthesis, making the image synthesis category of the training model entirely dependent on training datasets. For example, when using the bird image dataset for training, the corresponding training model can synthesize the bird images, and when using the flower image dataset, the model can generate the flower images. Therefore, using only noise vectors as input leads to the trained model not having good flexibility and controllability.

In order to solve this problem, conditional Generative Adversarial Networks (CGAN) [26] is proposed. CGAN introduces conditional variables into the input to achieve reasonable control of composite image types. For example, using the image category of birds or flowers in training, the trained model can generate the corresponding bird or flower images. CGAN achieves the flexibility and controllability required in image synthesis to a certain extent. However, CGAN can only determine the specific type of the synthesis image through the category label but cannot determine the specific content of the composite image. For example, the category label ‘bird’ is input, the model can synthesize a bird image, but the specific color, size, and other information of the bird can not be determined. To further improve the overall flexibility and control of image synthesis, text-to-image synthesis research has been proposed. The text-to-image synthesis can synthesize the corresponding image results by inputting the text description. The text includes more basic information so that it can determine the specific content of the composite image. For example, input a text description “this grey bird has an impressive wingspan, a grey bill, and a white stripe that surrounds the feathers near the bill”, and the model can synthesize image results related to the semantic information of the input text. Hence the text-to-image synthesis research has better flexibility and controllability. Besides, it has tremendous potential applications, such as computer-aided design, art generation, image editing, video games, and so on. Based on the above reasons, the text-to-image synthesis research field has received extensive attention at present.

At present, many promising results have been achieved in the text-to-image synthesis research. Reed et al. [32] first proposed an end-to-end GAN structure and realized the image synthesis from the text description. However, the overall clarity and authenticity of synthesized images are poor in this work. To further improve the quality of the synthesized image, many improved methods have been proposed later. Zhang et al. [14], [13] proposed a stack generation method and achieved high-quality synthesis results. Xu et al. [36] introduced the attention mechanism to obtain high-resolution results. Then, the methods of hierarchical nesting [49], mirror text comparison [39], and prior knowledge guidance [38] were proposed and achieved higher-quality image results. Although the current text-to-image synthesis methods have achieved encouraging results, there is still room for improvement in the quality of synthetic images. Hence the study of text-to-image synthesis is still challenging.

In order to shorten the gap between the synthetic image and the real image, we propose a multi-stage synthesis method starting the composite from the foreground content. Firstly, the foreground result is synthesized based on the text description, and then the final image can be synthesized by combining the foreground result with the text description. In the foreground synthesis stage, the whole architecture can pay more attention to the synthesis of foreground objects so as to generate the refined foreground result. The refined foreground result can play a good role in promoting the subsequent image synthesis, and the higher-quality image result can finally be achieved. Fig. 1 shows the comparison between the synthesized results of our method and the real images. The comparison shows that our results have basically equivalent to the real image synthesis effect.

Our contributions are as follows: (1) a new multi-stage text-to-image synthesis method starting composite from the foreground content is proposed to achieve higher quality image generation; (2) the dynamic selection method is introduced into the generator structure to fine-grained fine-tuning synthetic results and achieve the excellent synthesis effect; (3) extensive experiments on the CUB [6], Oxford-102 flower [24], and MS COCO [40] datasets show the effectiveness, generality, and superiority of the proposed method; (4) the ablation experiments demonstrate that the way of synthesizing the foreground image in the first two stages can achieve the best results on the CUB and Oxford-102 datasets, and the best performance on the MS COCO dataset can be obtained by the method that only generates the foreground result in the first stage; (5) the better synthetic performance achieved by our method can accelerate the development of text-to-image research towards application fields to a certain extent. Besides, we analyze the problems of the proposed method and discuss some feasible solutions that have a good reference value.

The rest of the paper is arranged as follows. Section 2 briefly reviews the related works on image synthesis and text-to-image synthesis. The fundamental techniques of text-to-image synthesis are introduced in Section 3. Our method details are discussed in Section 4 and validated in Section 5 with promising experimental results. Section 6 concludes our work.

Section snippets

Related work

Image Synthesis. Effective construction of image generation modeling is the fundamental problem in computer vision. For image synthesis research, the core task is to establish a useful image generation model to synthesize more realistic image results. There has been remarkable progress in the image synthesis field with the emergence of deep learning techniques. Variational Autoencoders (VAE) [10] utilized a probability graph model to achieve a better generation by maximizing the lower bound of

Generative Adversarial Networks

Generative Adversarial Networks consist of a generator G and a discriminator D. The performance of G and D can be improved simultaneously through adversarial learning. Among them, the goal of G is to synthesize the data distribution similar to the original data so that it can deceive D into believing, while the goal of D is not to be deceived by G. The specific process is a min–max game. The corresponding equation is as follows:minGmaxDV(D,G)=xpdata[logD(x)]+zpz[log(1-D(G(z)))],where x and z

Specific generation structure

In our specific method, the synthesis process is divided into three stages. The first stage’s generation structure is shown in Fig. 2. For the input text description, a text encoder [36] is used to encode the sentence feature, and then the sentence features are enhanced by conditional augmentation technology [14]. On the one hand, conditional augmentation technology can expand the number of training. On the other hand, it can avoid the over-fitting problem. The specific equation of conditional

Experiments

Our experiments are carried out on a Linux 18.04 computer with 1 NVIDIA 2080Ti GPU and 32 GB memory.

We verify the performance of our method on the CUB [6], Oxford-102 [24], and MS COCO [40] datasets. The CUB dataset includes 11,788 images with 200 classes, where 8,855 images with 150 classes are utilized for training, and the rest of 2,933 images with 50 classes are employed for testing. The Oxford-102 dataset contains 8,189 images with 102 categories, 7,034 images with 82 categories of which

Conclusion

In this paper, the method of starting composite from the foreground content is proposed to achieve higher-quality synthesized results in the text-to-image synthesis field. Unlike the existing text-to-image synthesis work, our method first synthesizes the foreground result based on the text description and then combines the foreground result with the text description to synthesize the final image result. In addition, the dynamic selection method is also introduced into the network to fine-tune

CRediT authorship contribution statement

Zhiqiang Zhang: Conceptualization, Methodology, Writing – original draft, Investigation. Jinjia Zhou: Project administration, Supervision. Wenxin Yu: Writing – review & editing, Formal analysis. Ning Jiang: Data curation, Validation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This Research is supported by the Joint Research Project of Young Researchers of Hosei University in 2021. This research is supported by Sichuan Science and Technology Program (No. 2022YFG0324).

Zhiqiang Zhang received the B.S. and M.S. degree from Southwest University of Science and Technology, Mianyang, China, in 2017 and 2020. Now he is currently pursuing the Ph.D. degree with Hosei University, Toky, Japan. His research interests include image synthesis, multi-modal information transformation and fusion, game theory, computer vision, and deep learning.

References (49)

  • R. Alec et al.

    Unsupervised representation learning with deep convolutional generative adversarial networks

  • N. Anh et al.

    Plug & play generative networks: Conditional iterative generation of images in latent space

  • T.G. Aurele et al.

    αβ)gan: Robust generative adversarial networks

    Inf. Sci.

    (2022)
  • X. Bing et al.

    Empirical evaluation of rectified activations in convolutional network

    CoRR

    (2015)
  • L. Bowen et al.

    Controllable text-to-image generation

    Proc. NeurIPS Conf.

    (2019)
  • W. Catherine, B. Steve, W. Peter, P. Pietro, B. Serge, The caltech-ucsd birds-200-2011 dataset, in: Tech. Rep....
  • G. Chengying et al.

    Sketchycoco: Image generation from freehand scene sketches

    In Proc. CVPR Conf.

    (2020)
  • S. Christian et al.

    Rethinking the inception architecture for computer vision

    In Proc. CVPR Conf.

    (2016)
  • P.K. Diederik et al.

    Adam: A method for stochastic optimization

  • P.K. Diederik et al.

    Auto-encoding variational bayes

  • P. Dunlu et al.

    Sam-gan: Self-attention supporting multi-stage generative adversarial networks for text-to-image synthesis

    Neural Networks.

    (2021)
  • M. Fengling et al.

    Learning efficient text-to-image synthesis via interstage cross-sample similarity distillation

    Sci. China Inf. Sci.

    (2021)
  • Z. Han et al.

    Stackgan++: Realistic image synthesis with stacked generative adversarial networks

    IEEE Tran. Pattern Anal. Mach. Intell.

    (2019)
  • Z. Han et al.

    Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks

    Proc. ICML Conf.

    (2017)
  • D. Hao et al.

    Semantic image synthesis via adversarial learning

    Proc. ICCV Conf.

    (2017)
  • D. Hong et al.

    Deep attentive style transfer for images with wavelet decomposition

    Inf. Sci.

    (2022)
  • T. Hongchen et al.

    Semantics-enhanced adversarial nets for text-to-image synthesis

    Proc. ICCV Conf.

    (2019)
  • H. Hua et al.

    Image style transfer for autonomous multi-robot systems

    Inf. Sci.

    (2021)
  • J.G. Ian et al.

    Generative adversarial nets

    Proc. NIPS Conf.

    (2014)
  • K.J. Joseph et al.

    C4synth: Cross-caption cycle-consistent text-to-image synthesis

    Proc. WACV Conf.

    (2019)
  • H. Kaiming et al.

    Deep residual learning for image recognition

    In Proc. CVPR Conf.

    (2016)
  • G. Lianli et al.

    Perceptual pyramid adversarial networks for text-to-image synthesis

    Proc. AAAI Conf.

    (2019)
  • G. Lianli, C. Daiyuan, Z. Zhou, S. Jie, S. HengTao, Lightweight dynamic conditional gan with pyramid attention for...
  • N. Maria-Elena et al.

    Automated flower classification over a large number of classes

    Proc. ICVGIP Conf.

    (2008)
  • Zhiqiang Zhang received the B.S. and M.S. degree from Southwest University of Science and Technology, Mianyang, China, in 2017 and 2020. Now he is currently pursuing the Ph.D. degree with Hosei University, Toky, Japan. His research interests include image synthesis, multi-modal information transformation and fusion, game theory, computer vision, and deep learning.

    Jinjia Zhou received B.E. degree from Shanghai Jiao Tong University, China, in 2007. She received M.E. and Ph.D. degrees from Waseda University, Japan, in 2010 and 2013, respectively. From 2013 to 2016, she was a junior researcher with Waseda University, Fukuoka, Japan. Currently, she is an Associate Professor and a co-director of the English based graduate program at Hosei University. She is also a senior visiting scholar in State Key Laboratory of ASIC & System, Fudan University, China. Her interests are in algorithms and VLSI architectures for multimedia signal processing. Dr. Zhou was selected as JST PRESTO researcher during 2017–2021. She received the research fellowship of the Japan Society for the Promotion of Science during 2010–2013. Dr. Zhou is a recipient of the Chinese Government Award for Outstanding Students Abroad of 2012. She received the Hibikino Best Thesis Award in 2011. She was a co-recipient of ISSCC 2016 Takuo Sugano Award for Outstanding Far-East Paper, the best student paper award of VLSI Circuits Symposium 2010 and the design contest award of ACM ISLPED 2010. She participated in the design of the world first 8 K UHDTV video decoder chip, which was granted the 2012 Semiconductor of the Year Award of Japan.

    Wenxin Yu was born in Mianyang, Sichuan, China, in 1984. He received the B.S. degree from Shanghai Jiaotong University, Shanghai, China, in 2006, the M.S. and Ph.D. degree from Waseda University, Tokyo, Japan, in 2010 and 2013. He was an associate research fellow from 2015 to 2021 and is a professor from 2022 to now with the School of Computer Science and Technology, Southwest University of Science and Technology. He has been exploring the cutting-edge direction of video and image processing and focused on 3-D image synthesis, image stereo matching, and other issues. The main research fields include 3D multi-view synthesis filling technology, image stereo matching technology, multi-view compatible fast coding algorithm, neural network, pattern recognition, low power consumption video decoding algorithm, image error concealment technology.

    Ning Jiang received the B.S. degree from Shanghai Jiaotong University, Shanghai, China, in 2006, the M.S. and Ph.D. degree from Waseda University, Tokyo, Japan, in 2010 and 2013. From 2016 to 2018, he was is a lecturer in the School of Computer Science and Technology, Nantong University. From 2018 to now, he is associate professor with the School of Computer Science and Technology, Southwest University of Science and Technology. His research interests include computer vision and pattern recognition, deep learning and artificial neural networks.

    View full text