Abstract
This paper presents an improved adversarial network for visual content generation from textual description. Synthesizing high-quality images from the textual description is the most challenging problem in Computer vision. Existing methods first generate the initial image sketch and then refine that to fine-grained details at different portions of that image. Mostly available text to image generation methods and approaches nearly reflect the meaning of a given text description. But have not successfully generated details and different parts of the objects. As these methods depend on (1) the initial generated image. If the initial image is not generated correctly, the process fails to generate the fine-grained image with details. (2) According to the image’s content, each word has a different level of importance; however, similar text representation is used even for different image contents. Here, an improved Adversarial Network based on hyper-parameter optimization to generate fine-grained images is proposed. Inception Score (IS), t-Distributed Stochastic Neighbor Embedding (TSNE) and R-precision as a metric is used to evaluate and refine the initial image automatically. An attention mechanism is used to pay attention to more valuable words of text description to generate more refined sub-parts of the image. For which an attentional module is used to calculate the matching loss of image-text for generator training. The proposed model has been evaluated on the Caltech-UCSD Birds 200 dataset. Results using Inception score, R-precision, and TSNE matrix shows the model performs favourably against state of the art approaches ATT-GAN (51) and DM-GAN (61) improving by 25.72% and 19.37% respectively in terms of Inception score.







Similar content being viewed by others
Data Availability
Work is still in progress, data is not available publicly.
References
Abbood SH, Abdull Hamed HN, Mohd Rahim MS, Alaidi AHM, Salim ALRikabi HT (2022) Dr-ll gan: Diabetic retinopathy lesions synthesis using generative adversarial network. International Journal of Online & Biomedical Engineering 18(3)
Aggarwal A, Alshehri M, Kumar M, Sharma P, Alfarraj O, Deep V (2021) Principal component analysis, hidden markov model, and artificial neural network inspired techniques to recognize faces. Concurr Comput: Pract Exper 33 (9):6157
Aggarwal A, Kumar M (2021) Image surface texture analysis and classification using deep learning. Multimed Tools Appl 80(1):1289–1309
Agnese J, Herrera J, Tao H, Zhu X (2020) A survey and taxonomy of adversarial neural networks for text-to-image synthesis. Wiley Interdiscip Rev: Data Min Knowl Discov 10(4):1345
Agnese J, Herrera J, Tao H, Zhu X (2020) A survey and taxonomy of adversarial neural networks for text-to-image synthesis. Wiley Interdiscip Rev: Data Min Knowl Discov 10(4):1345
Banerjee S, Das S (2020) Sd-gan: Structural and denoising gan reveals facial parts under occlusion. arXiv:2002.08448
Chen X, Duan Y, Houthooft R, Schulman J, Sutskever I, Abbeel P (2016) Infogan: Interpretable representation learning by information maximizing generative adversarial nets. Advances in neural information processing systems 29
Cheng J, Wu F, Tian Y, Wang L, Tao D (2020) Rifegan: Rich feature generation for text-to-image synthesis from prior knowledge. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 10911–10920
Dash A, Gamboa JCB, Ahmed S, Liwicki M, Afzal MZ (2017) Tac-gan-text conditioned auxiliary classifier generative adversarial network. arXiv:1703.06412
Ding M, Yang Z, Hong W, Zheng W, Zhou C, Yin D, Lin J, Zou X, Shao Z, Yang H et al (2021) Cogview: Mastering text-to-image generation via transformers. arXiv:2105.13290
Dolhansky B, Ferrer CC (2018) Eye in-painting with exemplar generative adversarial networks. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 7902–7911
Dong H, Yu S, Wu C, Guo Y (2017) Semantic image synthesis via adversarial learning. In: Proceedings of the IEEE International conference on computer vision, pp 5706–5714
Fu A, Hou Y (2017) Text-to-image generation using multi-instance stackgan
Gao L, Chen D, Zhao Z, Shao J, Shen HT (2021) Lightweight dynamic conditional gan with pyramid attention for text-to-image synthesis. Pattern Recogn 110:107384
Garg K, Singh V, Tiwary US (2021) Textual description generation for visual content using neural networks. In: International Conference on intelligent human computer interaction, pp 16–26. Springer
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. Advances in neural information processing systems 27
Gou Y, Wu Q, Li M, Gong B, Han M (2020) Segattngan:, Text to image generation with segmentation attention. arXiv:2005.12444
Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems 30
Hinz T, Heinrich S, Wermter S (2019) Semantic object accuracy for generative text-to-image synthesis. arXiv:1910.13321
Hong S, Yang D, Choi J, Lee H (2018) Inferring semantic layout for hierarchical text-to-image synthesis. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 7986–7994
Huang H, Yu PS, Wang C (2018) An introduction to image synthesis with generative adversarial nets. arXiv:1803.04469
Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539
Kiros R, Salakhutdinov R, Zemel R (2014) Multimodal neural language models. In: International Conference on Machine Learning, pp 595–603. PMLR
Kumar M, Aggarwal J, Rani A, Stephan T, Shankar A, Mirjalili S (2021) Secure video communication using firefly optimization and visual cryptography. Artificial Intelligence Review, pp 1–21
Lee S, Tariq S, Shin Y, Woo SS (2021) Detecting handcrafted facial image manipulations and gan-generated facial images using shallow-fakefacenet. Appl Soft Comput 105:107256
Li B, Qi X, Lukasiewicz T, Torr P (2019) Controllable text-to-image generation. Advances in Neural Information Processing Systems 32
Li W, Zhang P, Zhang L, Huang Q, He X, Lyu S, Gao J (2019) Object-driven text-to-image synthesis via adversarial training. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 12174–12182
Mansimov E, Parisotto E, Ba JL, Salakhutdinov R (2015) Generating images from captions with attention. arXiv:1511.02793
Mirza M, Osindero S (2014) Conditional generative adversarial nets. arXiv:1411.1784
Mishra P, Rathore TS, Shivani S, Tendulkar S (2020) Text to image synthesis using residual gan. In: 2020 3rd International conference on emerging technologies in computer engineering: Machine learning and internet of things (ICETCE), pp. 139–144. IEEE
Nguyen A, Clune J, Bengio Y, Dosovitskiy A, Yosinski J (2017) Plug & play generative networks: Conditional iterative generation of images in latent space. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 4467–4477
Nguyen A, Clune J, Bengio Y, Dosovitskiy A, Yosinski J (2017) Plug & play generative networks: Conditional iterative generation of images in latent space. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 4467–4477
Odena A, Olah C, Shlens J (2017) Conditional image synthesis with auxiliary classifier gans. In: International conference on machine learning, pp 2642–2651. PMLR
Pan Z, Yu W, Yi X, Khan A, Yuan F, Zheng Y (2019) Recent progress on generative adversarial networks (gans): a survey. IEEE Access 7:36322–36333
Peng D, Yang W, Liu C, Lü S (2021) Sam-gan: Self-attention supporting multi-stage generative adversarial networks for text-to-image synthesis. Neural Netw 138:57–67
Qiao T, Zhang J, Xu D, Tao D (2019) Mirrorgan: Learning text-to-image generation by redescription. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 1505–1514
Ramesh A, Pavlov M, Goh G, Gray S, Voss C, Radford A, Chen M, Sutskever I (2021) Zero-shot text-to-image generation. arXiv:2102.12092
Reed S, Akata Z, Lee H, Schiele B (2016) Learning deep representations of fine-grained visual descriptions. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 49–58
Reed SE, Akata Z, Mohan S, Tenka S, Schiele B, Lee H (2016) Learning what and where to draw. Adv Neural Inf Process Syst 29:217–225
Reed S, Akata Z, Yan X, Logeswaran L, Schiele B, Lee H (2016) Generative adversarial text to image synthesis. In: International conference on machine learning, pp 1060–1069. PMLR
Sah S, Peri D, Shringi A, Zhang C, Dominguez M, Savakis A, Ptucha R (2018) Semantically invariant text-to-image generation. In: 2018 25th IEEE International Conference on Image Processing (ICIP), pp 3783–3787. IEEE
Salimans T, Goodfellow I, Zaremba W, Cheung V, Radford A, Chen X (2016) Improved techniques for training gans. Adv Neural Inf Process Syst 29:2234–2242
Sun Q, Chang K-H, Dormer KJ, Dyer Jr RK, Gan RZ (2002) An advanced computer-aided geometric modeling and fabrication method for human middle ear. Med Eng Phys 24(9):595–606
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 2818–2826
Tao M, Tang H, Wu S, Sebe N, Jing X-Y, Wu F, Bao B (2020) Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis. arXiv:2008.05865
Valle R (2019) Hands-on generative adversarial networks with keras: Your guide to implementing next-generation generative adversarial networks. Packt Publishing Ltd???
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 3156–3164
Wah C, Branson S, Welinder P, Perona P, Belongie S (2011) The caltech-ucsd birds-200-2011 dataset
Xia W, Yang Y, Xue J-H, Wu B (2021) Tedigan: Text-guided diverse face image generation and manipulation. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 2256–2265
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057. PMLR
Xu T, Zhang P, Huang Q, Zhang H, Gan Z, Huang X, He X (2018) Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 1316–1324
Ye H, Yang X, Takac M, Sunderraman R, Ji S (2021) Improving text-to-image synthesis using contrastive learning. arXiv:2107.02423
Yuan M, Peng Y (2018) Text-to-image synthesis via symmetrical distillation networks, pp 1407–1415
Zakraoui J, Saleh M, Al-Maadeed S, Jaam JM (2021) Improving text-to-image generation with object layout guidance. Multimedia Tools and Applications, pp 1–21
Zhang Y, Han S, Zhang Z, Wang J, Bi H (2022) Cf-gan: cross-domain feature fusion generative adversarial network for text-to-image synthesis. The Visual Computer, pp 1–11
Zhang H, Koh JY, Baldridge J, Lee H, Yang Y (2021) Cross-modal contrastive learning for text-to-image generation. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 833–842
Zhang C, Peng Y (2018) Stacking vae and gan for context-aware text-to-image generation. In: 2018 IEEE Fourth International Conference on Multimedia Big Data (BigMM), pp 1–5. IEEE
Zhang H, Xu T, Li H, Zhang S, Wang X, Huang X, Metaxas DN (2017) Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proceedings of the IEEE International conference on computer vision, pp 5907–5915
Zhang H, Xu T, Li H, Zhang S, Wang X, Huang X, Metaxas DN (2018) Stackgan++: Realistic image synthesis with stacked generative adversarial networks. IEEE Trans Pattern Anal Mach Intell 41(8):1947–1962
Zhou P, Yu N, Wu Z, Davis LS, Shrivastava A, Lim S-N (2021) Deep video inpainting detection. arXiv:2101.11080
Zhu M, Pan P, Chen W, Yang Y (2019) Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 5802–5810
Funding
The authors did not receive support from any organization for the submitted work.
Author information
Authors and Affiliations
Contributions
Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Software, Writing - original draft preparation, Writing - review and editing: Varsha Singh; Validation: Varsha Singh and Uma Shanker Tiwary; Funding acquisition, Project administration, Resources and Supervision: Uma Shanker Tiwary.
Corresponding author
Ethics declarations
Ethics approval
We confirm that the manuscript has been read and approved by both named authors and that there are no other persons who satisfied the criteria for authorship but are not listed. We further confirm that the order of authors listed in the manuscript has been approved by both of us.
Conflict of interest/Competing interests
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Singh, V., Tiwary, U.S. Visual content generation from textual description using improved adversarial network. Multimed Tools Appl 82, 10943–10960 (2023). https://doi.org/10.1007/s11042-022-13720-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-13720-3