Skip to main content
Log in

PBGN: Phased Bidirectional Generation Network in Text-to-Image Synthesis

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

Text-to-image synthesis methods are mainly evaluated from two aspects: one is the quality and diversity of the generated images, and the other is the semantic consistency between the generated images and the input sentences. To address the problem of semantic consistency during image generation, we propose a Phased Bidirectional Generative Network. We use a bidirectional generative mechanism based on a multi-level generative adversarial network, where the images generated at each level are used to generate text, and the generated images are constrained by introducing a reconstruction loss. At the same time, we explore the effectiveness of the self-attention mechanism and spectral normalization techniques to improve the performance of generative networks. Furthermore, we propose an efficient boundary augmentation strategy to improve the performance of the model on small-scale datasets. Our method achieves Inception Scores of 4.71, 5.13, 32.42, and R-precision scores of 92.55, 87.72, and 92.29 on Oxford-102, CUB-200, and MS-COCO datasets, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

References

  1. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp 2672–2680

  2. Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R (2019) Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942

  3. Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

  4. Zhang H, Goodfellow I, Metaxas D, Odena A (2019) Self-attention generative adversarial networks. In: Proceedings of the International Conference on Machine Learning, pp 7354–7363

  5. Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7794–7803

  6. Farnia F, Zhang JM, Tse D (2018) Generalizable adversarial training via spectral normalization. arXiv preprint arXiv:1811.07457

  7. Zhao L, Liu Y (2020) Spectral normalization for domain adaptation. Information 11(2):68

    Article  Google Scholar 

  8. Zhang J, Li Z, Zhang C, Ma H (2021) Stable self-attention adversarial learning for semi-supervised semantic image segmentation. J Vis Commun Image Represent 78:103170

    Article  Google Scholar 

  9. Reed S, Akata Z, Lee H, Schiele B (2016) Learning deep representations of fine-grained visual descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 49–58

  10. Zhang H, Xu T, Li H, Zhang S, Wang X, Huang X, Metaxas DN (2017) Stackgan: text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proceedings of the IEEE international conference on computer vision, pp 5907–5915

  11. Xu T, Zhang P, Huang Q, Zhang H, Gan Z, Huang X, He X (2018) Attngan: fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1316–1324

  12. Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(11):2673–2681

    Article  Google Scholar 

  13. Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365

  14. Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692

  15. Pavllo D, Lucchi A, Hofmann T (2020) Controlling style and semantics in weakly-supervised image generation. In: Proceedings of the European conference on computer vision, pp 482–499

  16. Wang T, Zhang T, Lovell B (2021) Faces à la carte: Text-to-face generation via attribute disentanglement. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 3380–3388

  17. Souza DM, Wehrmann J, Ruiz DD (2020) Efficient neural architecture for text-to-image synthesis. In: Proceedings of the international joint conference on neural networks (IJCNN), pp 1–8

  18. Liang J, Pei W, Lu F (2020) Cpgan: Content-parsing generative adversarial networks for text-to-image synthesis. In: Proceedings of the European conference on computer vision, pp 491–508

  19. Mirza M, Osindero S (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784

  20. Reed S, Akata Z, Yan X, Logeswaran L, Schiele B, Lee H (2016) Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396

  21. Dash A, Gamboa JCB, Ahmed S, Liwicki M, Afzal MZ (2017) Tac-gan-text conditioned auxiliary classifier generative adversarial network. arXiv preprint arXiv:1703.06412

  22. Odena A, Olah C, Shlens J (2017) Conditional image synthesis with auxiliary classifier gans. In: Proceedings of the international conference on machine learning, pp 2642–2651

  23. Reed SE, Akata Z, Mohan S, Tenka S, Schiele B, Lee H (2016) Learning what and where to draw. In: Advances in neural information processing systems, pp 217–225

  24. Wah C, Branson S, Welinder P, Perona P, Belongie S (2011) The caltech-ucsd birds-200-2011 dataset. Technical report, California Institute of Technology

  25. Zhang H, Xu T, Li H, Zhang S, Wang X, Huang X, Metaxas DN (2018) Stackgan++: realistic image synthesis with stacked generative adversarial networks. IEEE Trans Pattern Anal Mach Intell 41(8):1947–1962

    Article  Google Scholar 

  26. Yuan M, Peng Y (2018) Text-to-image synthesis via symmetrical distillation networks. In: Proceedings of the 26th ACM international conference on multimedia, pp 1407–1415

  27. Ma S, Fu J, Chen CW, Mei T (2018) Da-gan: Instance-level image translation by deep attention generative adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5657–5666

  28. Zhu M, Pan P, Chen W, Yang Y (2019) Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5802–5810

  29. Weston J, Chopra S, Bordes A (2014) Memory networks. arXiv preprint arXiv:1410.3916

  30. Gulcehre C, Chandar S, Cho K, Bengio Y (2018) Dynamic neural turing machine with continuous and discrete addressing schemes. Neural Comput 30(4):857–884

    Article  MathSciNet  MATH  Google Scholar 

  31. Cheng J, Wu F, Tian Y, Wang L, Tao D (2020) Rifegan: Rich feature generation for text-to-image synthesis from prior knowledge. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10911–10920

  32. Qiao T, Zhang J, Xu D, Tao D (2019) Mirrorgan: Learning text-to-image generation by redescription. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 1505–1514

  33. Lee K-H, Chen X, Hua G, Hu H, He X (2018) Stacked cross attention for image-text matching. In: Proceedings of the European conference on computer vision, pp 201–216

  34. Cui H, Zhu L, Li J, Yang Y, Nie L (2020) Scalable deep hashing for large-scale social image retrieval. IEEE Trans Image Process 29:1271–1284

    Article  MathSciNet  MATH  Google Scholar 

  35. Zhu L, Lu X, Cheng Z, Li J, Zhang H (2020) Deep collaborative multi-view hashing for large-scale image search. IEEE Trans Image Process 29:4643–4655

    Article  MathSciNet  MATH  Google Scholar 

  36. Li Z, Xie X, Ling F, Ma H, Shi Z (2021) Matching images and texts with multi-head attention network for cross-media hashing retrieval. Eng Appl Artif Intell 106:104475

    Article  Google Scholar 

  37. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826

  38. Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 248–255

  39. He X, Deng L, Chou W (2008) Discriminative learning in sequential pattern recognition. IEEE Signal Process Mag 25(5):14–36

    Article  Google Scholar 

  40. Juang B-H, Hou W, Lee C-H (1997) Minimum classification error rate methods for speech recognition. IEEE Trans Speech Audio Process 5(3):257–265

    Article  Google Scholar 

  41. Fang H, Gupta S, Iandola F, Srivastava RK, Deng L, Dollár P, Gao J, He X, Mitchell M, Platt JC, Zitnick CL, Zweig G (2015) From captions to visual concepts and back. In: Proceedings of the IEEE Conference on computer vision and pattern recognition (CVPR), pp 1473–1482

  42. Huang P-S, He X, Gao J, Deng L, Acero A, Heck L (2013) Learning deep structured semantic models for web search using clickthrough data. In: Proceedings of the 22nd ACM international conference on information & knowledge management, pp 2333–2338

  43. Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164

  44. Wei H, Li Z, Huang F, Zhang C, Ma H, Shi Z (2021) Integrating scene semantic knowledge into image captioning. ACM Trans Multimed Comput Commun Appl 17(2):1–22

    Article  Google Scholar 

  45. Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141

  46. Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015) Sequence to sequence-video to text. In: Proceedings of the IEEE international conference on computer vision, pp 4534–4542

  47. Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: Proceedings of the European conference on computer vision, pp 740–755

  48. Nilsback M-E, Zisserman A (2008) Automated flower classification over a large number of classes. In: Proceedings of the Sixth Indian conference on computer vision, graphics & image processing, pp 722–729

  49. Salimans T, Goodfellow I, Zaremba W, Cheung V, Radford A, Chen X (2016) Improved techniques for training gans. In: Advances in neural information processing systems, pp 2234–2242

  50. Miyato T, Kataoka T, Koyama M, Yoshida Y (2018) Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957

Download references

Acknowledgements

This work is supported by National Natural Science Foundation of China (Nos. 61966004, 61866004), Guangxi Natural Science Foundation (No. 2019GXNSFDA245018), Innovation Project of Guangxi Graduate Education (No. XYCBZ2021002), Guangxi “Bagui Scholar” Teams for Innovation and Research Project, Guangxi Talent Highland Project of Big Data Intelligence and Application, and Guangxi Collaborative Innovation Center of Multi-source Information Integration and Intelligent Processing.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhixin Li.

Ethics declarations

Conflict of interest

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work. There is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhu, J., Li, Z., Wei, J. et al. PBGN: Phased Bidirectional Generation Network in Text-to-Image Synthesis. Neural Process Lett 54, 5371–5391 (2022). https://doi.org/10.1007/s11063-022-10866-x

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-022-10866-x

Keywords

Navigation