Abstract
Text-to-image synthesis methods are mainly evaluated from two aspects: one is the quality and diversity of the generated images, and the other is the semantic consistency between the generated images and the input sentences. To address the problem of semantic consistency during image generation, we propose a Phased Bidirectional Generative Network. We use a bidirectional generative mechanism based on a multi-level generative adversarial network, where the images generated at each level are used to generate text, and the generated images are constrained by introducing a reconstruction loss. At the same time, we explore the effectiveness of the self-attention mechanism and spectral normalization techniques to improve the performance of generative networks. Furthermore, we propose an efficient boundary augmentation strategy to improve the performance of the model on small-scale datasets. Our method achieves Inception Scores of 4.71, 5.13, 32.42, and R-precision scores of 92.55, 87.72, and 92.29 on Oxford-102, CUB-200, and MS-COCO datasets, respectively.
Similar content being viewed by others
References
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp 2672–2680
Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R (2019) Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Zhang H, Goodfellow I, Metaxas D, Odena A (2019) Self-attention generative adversarial networks. In: Proceedings of the International Conference on Machine Learning, pp 7354–7363
Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7794–7803
Farnia F, Zhang JM, Tse D (2018) Generalizable adversarial training via spectral normalization. arXiv preprint arXiv:1811.07457
Zhao L, Liu Y (2020) Spectral normalization for domain adaptation. Information 11(2):68
Zhang J, Li Z, Zhang C, Ma H (2021) Stable self-attention adversarial learning for semi-supervised semantic image segmentation. J Vis Commun Image Represent 78:103170
Reed S, Akata Z, Lee H, Schiele B (2016) Learning deep representations of fine-grained visual descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 49–58
Zhang H, Xu T, Li H, Zhang S, Wang X, Huang X, Metaxas DN (2017) Stackgan: text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proceedings of the IEEE international conference on computer vision, pp 5907–5915
Xu T, Zhang P, Huang Q, Zhang H, Gan Z, Huang X, He X (2018) Attngan: fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1316–1324
Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(11):2673–2681
Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692
Pavllo D, Lucchi A, Hofmann T (2020) Controlling style and semantics in weakly-supervised image generation. In: Proceedings of the European conference on computer vision, pp 482–499
Wang T, Zhang T, Lovell B (2021) Faces à la carte: Text-to-face generation via attribute disentanglement. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 3380–3388
Souza DM, Wehrmann J, Ruiz DD (2020) Efficient neural architecture for text-to-image synthesis. In: Proceedings of the international joint conference on neural networks (IJCNN), pp 1–8
Liang J, Pei W, Lu F (2020) Cpgan: Content-parsing generative adversarial networks for text-to-image synthesis. In: Proceedings of the European conference on computer vision, pp 491–508
Mirza M, Osindero S (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784
Reed S, Akata Z, Yan X, Logeswaran L, Schiele B, Lee H (2016) Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396
Dash A, Gamboa JCB, Ahmed S, Liwicki M, Afzal MZ (2017) Tac-gan-text conditioned auxiliary classifier generative adversarial network. arXiv preprint arXiv:1703.06412
Odena A, Olah C, Shlens J (2017) Conditional image synthesis with auxiliary classifier gans. In: Proceedings of the international conference on machine learning, pp 2642–2651
Reed SE, Akata Z, Mohan S, Tenka S, Schiele B, Lee H (2016) Learning what and where to draw. In: Advances in neural information processing systems, pp 217–225
Wah C, Branson S, Welinder P, Perona P, Belongie S (2011) The caltech-ucsd birds-200-2011 dataset. Technical report, California Institute of Technology
Zhang H, Xu T, Li H, Zhang S, Wang X, Huang X, Metaxas DN (2018) Stackgan++: realistic image synthesis with stacked generative adversarial networks. IEEE Trans Pattern Anal Mach Intell 41(8):1947–1962
Yuan M, Peng Y (2018) Text-to-image synthesis via symmetrical distillation networks. In: Proceedings of the 26th ACM international conference on multimedia, pp 1407–1415
Ma S, Fu J, Chen CW, Mei T (2018) Da-gan: Instance-level image translation by deep attention generative adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5657–5666
Zhu M, Pan P, Chen W, Yang Y (2019) Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5802–5810
Weston J, Chopra S, Bordes A (2014) Memory networks. arXiv preprint arXiv:1410.3916
Gulcehre C, Chandar S, Cho K, Bengio Y (2018) Dynamic neural turing machine with continuous and discrete addressing schemes. Neural Comput 30(4):857–884
Cheng J, Wu F, Tian Y, Wang L, Tao D (2020) Rifegan: Rich feature generation for text-to-image synthesis from prior knowledge. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10911–10920
Qiao T, Zhang J, Xu D, Tao D (2019) Mirrorgan: Learning text-to-image generation by redescription. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 1505–1514
Lee K-H, Chen X, Hua G, Hu H, He X (2018) Stacked cross attention for image-text matching. In: Proceedings of the European conference on computer vision, pp 201–216
Cui H, Zhu L, Li J, Yang Y, Nie L (2020) Scalable deep hashing for large-scale social image retrieval. IEEE Trans Image Process 29:1271–1284
Zhu L, Lu X, Cheng Z, Li J, Zhang H (2020) Deep collaborative multi-view hashing for large-scale image search. IEEE Trans Image Process 29:4643–4655
Li Z, Xie X, Ling F, Ma H, Shi Z (2021) Matching images and texts with multi-head attention network for cross-media hashing retrieval. Eng Appl Artif Intell 106:104475
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 248–255
He X, Deng L, Chou W (2008) Discriminative learning in sequential pattern recognition. IEEE Signal Process Mag 25(5):14–36
Juang B-H, Hou W, Lee C-H (1997) Minimum classification error rate methods for speech recognition. IEEE Trans Speech Audio Process 5(3):257–265
Fang H, Gupta S, Iandola F, Srivastava RK, Deng L, Dollár P, Gao J, He X, Mitchell M, Platt JC, Zitnick CL, Zweig G (2015) From captions to visual concepts and back. In: Proceedings of the IEEE Conference on computer vision and pattern recognition (CVPR), pp 1473–1482
Huang P-S, He X, Gao J, Deng L, Acero A, Heck L (2013) Learning deep structured semantic models for web search using clickthrough data. In: Proceedings of the 22nd ACM international conference on information & knowledge management, pp 2333–2338
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Wei H, Li Z, Huang F, Zhang C, Ma H, Shi Z (2021) Integrating scene semantic knowledge into image captioning. ACM Trans Multimed Comput Commun Appl 17(2):1–22
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141
Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015) Sequence to sequence-video to text. In: Proceedings of the IEEE international conference on computer vision, pp 4534–4542
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: Proceedings of the European conference on computer vision, pp 740–755
Nilsback M-E, Zisserman A (2008) Automated flower classification over a large number of classes. In: Proceedings of the Sixth Indian conference on computer vision, graphics & image processing, pp 722–729
Salimans T, Goodfellow I, Zaremba W, Cheung V, Radford A, Chen X (2016) Improved techniques for training gans. In: Advances in neural information processing systems, pp 2234–2242
Miyato T, Kataoka T, Koyama M, Yoshida Y (2018) Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957
Acknowledgements
This work is supported by National Natural Science Foundation of China (Nos. 61966004, 61866004), Guangxi Natural Science Foundation (No. 2019GXNSFDA245018), Innovation Project of Guangxi Graduate Education (No. XYCBZ2021002), Guangxi “Bagui Scholar” Teams for Innovation and Research Project, Guangxi Talent Highland Project of Big Data Intelligence and Application, and Guangxi Collaborative Innovation Center of Multi-source Information Integration and Intelligent Processing.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work. There is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhu, J., Li, Z., Wei, J. et al. PBGN: Phased Bidirectional Generation Network in Text-to-Image Synthesis. Neural Process Lett 54, 5371–5391 (2022). https://doi.org/10.1007/s11063-022-10866-x
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-022-10866-x