Skip to main content

Advertisement

Log in

Visual content generation from textual description using improved adversarial network

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

This paper presents an improved adversarial network for visual content generation from textual description. Synthesizing high-quality images from the textual description is the most challenging problem in Computer vision. Existing methods first generate the initial image sketch and then refine that to fine-grained details at different portions of that image. Mostly available text to image generation methods and approaches nearly reflect the meaning of a given text description. But have not successfully generated details and different parts of the objects. As these methods depend on (1) the initial generated image. If the initial image is not generated correctly, the process fails to generate the fine-grained image with details. (2) According to the image’s content, each word has a different level of importance; however, similar text representation is used even for different image contents. Here, an improved Adversarial Network based on hyper-parameter optimization to generate fine-grained images is proposed. Inception Score (IS), t-Distributed Stochastic Neighbor Embedding (TSNE) and R-precision as a metric is used to evaluate and refine the initial image automatically. An attention mechanism is used to pay attention to more valuable words of text description to generate more refined sub-parts of the image. For which an attentional module is used to calculate the matching loss of image-text for generator training. The proposed model has been evaluated on the Caltech-UCSD Birds 200 dataset. Results using Inception score, R-precision, and TSNE matrix shows the model performs favourably against state of the art approaches ATT-GAN (51) and DM-GAN (61) improving by 25.72% and 19.37% respectively in terms of Inception score.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data Availability

Work is still in progress, data is not available publicly.

References

  1. Abbood SH, Abdull Hamed HN, Mohd Rahim MS, Alaidi AHM, Salim ALRikabi HT (2022) Dr-ll gan: Diabetic retinopathy lesions synthesis using generative adversarial network. International Journal of Online & Biomedical Engineering 18(3)

  2. Aggarwal A, Alshehri M, Kumar M, Sharma P, Alfarraj O, Deep V (2021) Principal component analysis, hidden markov model, and artificial neural network inspired techniques to recognize faces. Concurr Comput: Pract Exper 33 (9):6157

    Article  Google Scholar 

  3. Aggarwal A, Kumar M (2021) Image surface texture analysis and classification using deep learning. Multimed Tools Appl 80(1):1289–1309

    Article  Google Scholar 

  4. Agnese J, Herrera J, Tao H, Zhu X (2020) A survey and taxonomy of adversarial neural networks for text-to-image synthesis. Wiley Interdiscip Rev: Data Min Knowl Discov 10(4):1345

    Google Scholar 

  5. Agnese J, Herrera J, Tao H, Zhu X (2020) A survey and taxonomy of adversarial neural networks for text-to-image synthesis. Wiley Interdiscip Rev: Data Min Knowl Discov 10(4):1345

    Google Scholar 

  6. Banerjee S, Das S (2020) Sd-gan: Structural and denoising gan reveals facial parts under occlusion. arXiv:2002.08448

  7. Chen X, Duan Y, Houthooft R, Schulman J, Sutskever I, Abbeel P (2016) Infogan: Interpretable representation learning by information maximizing generative adversarial nets. Advances in neural information processing systems 29

  8. Cheng J, Wu F, Tian Y, Wang L, Tao D (2020) Rifegan: Rich feature generation for text-to-image synthesis from prior knowledge. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 10911–10920

  9. Dash A, Gamboa JCB, Ahmed S, Liwicki M, Afzal MZ (2017) Tac-gan-text conditioned auxiliary classifier generative adversarial network. arXiv:1703.06412

  10. Ding M, Yang Z, Hong W, Zheng W, Zhou C, Yin D, Lin J, Zou X, Shao Z, Yang H et al (2021) Cogview: Mastering text-to-image generation via transformers. arXiv:2105.13290

  11. Dolhansky B, Ferrer CC (2018) Eye in-painting with exemplar generative adversarial networks. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 7902–7911

  12. Dong H, Yu S, Wu C, Guo Y (2017) Semantic image synthesis via adversarial learning. In: Proceedings of the IEEE International conference on computer vision, pp 5706–5714

  13. Fu A, Hou Y (2017) Text-to-image generation using multi-instance stackgan

  14. Gao L, Chen D, Zhao Z, Shao J, Shen HT (2021) Lightweight dynamic conditional gan with pyramid attention for text-to-image synthesis. Pattern Recogn 110:107384

    Article  Google Scholar 

  15. Garg K, Singh V, Tiwary US (2021) Textual description generation for visual content using neural networks. In: International Conference on intelligent human computer interaction, pp 16–26. Springer

  16. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. Advances in neural information processing systems 27

  17. Gou Y, Wu Q, Li M, Gong B, Han M (2020) Segattngan:, Text to image generation with segmentation attention. arXiv:2005.12444

  18. Heusel M, Ramsauer H, Unterthiner T, Nessler B, Hochreiter S (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems 30

  19. Hinz T, Heinrich S, Wermter S (2019) Semantic object accuracy for generative text-to-image synthesis. arXiv:1910.13321

  20. Hong S, Yang D, Choi J, Lee H (2018) Inferring semantic layout for hierarchical text-to-image synthesis. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 7986–7994

  21. Huang H, Yu PS, Wang C (2018) An introduction to image synthesis with generative adversarial nets. arXiv:1803.04469

  22. Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539

  23. Kiros R, Salakhutdinov R, Zemel R (2014) Multimodal neural language models. In: International Conference on Machine Learning, pp 595–603. PMLR

  24. Kumar M, Aggarwal J, Rani A, Stephan T, Shankar A, Mirjalili S (2021) Secure video communication using firefly optimization and visual cryptography. Artificial Intelligence Review, pp 1–21

  25. Lee S, Tariq S, Shin Y, Woo SS (2021) Detecting handcrafted facial image manipulations and gan-generated facial images using shallow-fakefacenet. Appl Soft Comput 105:107256

    Article  Google Scholar 

  26. Li B, Qi X, Lukasiewicz T, Torr P (2019) Controllable text-to-image generation. Advances in Neural Information Processing Systems 32

  27. Li W, Zhang P, Zhang L, Huang Q, He X, Lyu S, Gao J (2019) Object-driven text-to-image synthesis via adversarial training. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 12174–12182

  28. Mansimov E, Parisotto E, Ba JL, Salakhutdinov R (2015) Generating images from captions with attention. arXiv:1511.02793

  29. Mirza M, Osindero S (2014) Conditional generative adversarial nets. arXiv:1411.1784

  30. Mishra P, Rathore TS, Shivani S, Tendulkar S (2020) Text to image synthesis using residual gan. In: 2020 3rd International conference on emerging technologies in computer engineering: Machine learning and internet of things (ICETCE), pp. 139–144. IEEE

  31. Nguyen A, Clune J, Bengio Y, Dosovitskiy A, Yosinski J (2017) Plug & play generative networks: Conditional iterative generation of images in latent space. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 4467–4477

  32. Nguyen A, Clune J, Bengio Y, Dosovitskiy A, Yosinski J (2017) Plug & play generative networks: Conditional iterative generation of images in latent space. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 4467–4477

  33. Odena A, Olah C, Shlens J (2017) Conditional image synthesis with auxiliary classifier gans. In: International conference on machine learning, pp 2642–2651. PMLR

  34. Pan Z, Yu W, Yi X, Khan A, Yuan F, Zheng Y (2019) Recent progress on generative adversarial networks (gans): a survey. IEEE Access 7:36322–36333

    Article  Google Scholar 

  35. Peng D, Yang W, Liu C, Lü S (2021) Sam-gan: Self-attention supporting multi-stage generative adversarial networks for text-to-image synthesis. Neural Netw 138:57–67

    Article  Google Scholar 

  36. Qiao T, Zhang J, Xu D, Tao D (2019) Mirrorgan: Learning text-to-image generation by redescription. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 1505–1514

  37. Ramesh A, Pavlov M, Goh G, Gray S, Voss C, Radford A, Chen M, Sutskever I (2021) Zero-shot text-to-image generation. arXiv:2102.12092

  38. Reed S, Akata Z, Lee H, Schiele B (2016) Learning deep representations of fine-grained visual descriptions. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 49–58

  39. Reed SE, Akata Z, Mohan S, Tenka S, Schiele B, Lee H (2016) Learning what and where to draw. Adv Neural Inf Process Syst 29:217–225

    Google Scholar 

  40. Reed S, Akata Z, Yan X, Logeswaran L, Schiele B, Lee H (2016) Generative adversarial text to image synthesis. In: International conference on machine learning, pp 1060–1069. PMLR

  41. Sah S, Peri D, Shringi A, Zhang C, Dominguez M, Savakis A, Ptucha R (2018) Semantically invariant text-to-image generation. In: 2018 25th IEEE International Conference on Image Processing (ICIP), pp 3783–3787. IEEE

  42. Salimans T, Goodfellow I, Zaremba W, Cheung V, Radford A, Chen X (2016) Improved techniques for training gans. Adv Neural Inf Process Syst 29:2234–2242

    Google Scholar 

  43. Sun Q, Chang K-H, Dormer KJ, Dyer Jr RK, Gan RZ (2002) An advanced computer-aided geometric modeling and fabrication method for human middle ear. Med Eng Phys 24(9):595–606

  44. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 2818–2826

  45. Tao M, Tang H, Wu S, Sebe N, Jing X-Y, Wu F, Bao B (2020) Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis. arXiv:2008.05865

  46. Valle R (2019) Hands-on generative adversarial networks with keras: Your guide to implementing next-generation generative adversarial networks. Packt Publishing Ltd???

  47. Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 3156–3164

  48. Wah C, Branson S, Welinder P, Perona P, Belongie S (2011) The caltech-ucsd birds-200-2011 dataset

  49. Xia W, Yang Y, Xue J-H, Wu B (2021) Tedigan: Text-guided diverse face image generation and manipulation. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 2256–2265

  50. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057. PMLR

  51. Xu T, Zhang P, Huang Q, Zhang H, Gan Z, Huang X, He X (2018) Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 1316–1324

  52. Ye H, Yang X, Takac M, Sunderraman R, Ji S (2021) Improving text-to-image synthesis using contrastive learning. arXiv:2107.02423

  53. Yuan M, Peng Y (2018) Text-to-image synthesis via symmetrical distillation networks, pp 1407–1415

  54. Zakraoui J, Saleh M, Al-Maadeed S, Jaam JM (2021) Improving text-to-image generation with object layout guidance. Multimedia Tools and Applications, pp 1–21

  55. Zhang Y, Han S, Zhang Z, Wang J, Bi H (2022) Cf-gan: cross-domain feature fusion generative adversarial network for text-to-image synthesis. The Visual Computer, pp 1–11

  56. Zhang H, Koh JY, Baldridge J, Lee H, Yang Y (2021) Cross-modal contrastive learning for text-to-image generation. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 833–842

  57. Zhang C, Peng Y (2018) Stacking vae and gan for context-aware text-to-image generation. In: 2018 IEEE Fourth International Conference on Multimedia Big Data (BigMM), pp 1–5. IEEE

  58. Zhang H, Xu T, Li H, Zhang S, Wang X, Huang X, Metaxas DN (2017) Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proceedings of the IEEE International conference on computer vision, pp 5907–5915

  59. Zhang H, Xu T, Li H, Zhang S, Wang X, Huang X, Metaxas DN (2018) Stackgan++: Realistic image synthesis with stacked generative adversarial networks. IEEE Trans Pattern Anal Mach Intell 41(8):1947–1962

    Article  Google Scholar 

  60. Zhou P, Yu N, Wu Z, Davis LS, Shrivastava A, Lim S-N (2021) Deep video inpainting detection. arXiv:2101.11080

  61. Zhu M, Pan P, Chen W, Yang Y (2019) Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pp 5802–5810

Download references

Funding

The authors did not receive support from any organization for the submitted work.

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Software, Writing - original draft preparation, Writing - review and editing: Varsha Singh; Validation: Varsha Singh and Uma Shanker Tiwary; Funding acquisition, Project administration, Resources and Supervision: Uma Shanker Tiwary.

Corresponding author

Correspondence to Varsha Singh.

Ethics declarations

Ethics approval

We confirm that the manuscript has been read and approved by both named authors and that there are no other persons who satisfied the criteria for authorship but are not listed. We further confirm that the order of authors listed in the manuscript has been approved by both of us.

Conflict of interest/Competing interests

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Singh, V., Tiwary, U.S. Visual content generation from textual description using improved adversarial network. Multimed Tools Appl 82, 10943–10960 (2023). https://doi.org/10.1007/s11042-022-13720-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-13720-3

Keywords