Abstract
Although generative adversarial networks are commonly used in text-to-image generation tasks and have made great progress, there are still some problems. The convolution operation used in these GANs-based methods works on local regions, but not disjoint regions of the image, leading to structural anomalies in the generated image. Moreover, the semantic consistency of generated images and corresponding text descriptions still needs to be improved. In this paper, we propose a multi-attention generative adversarial networks (MAGAN) for text-to-image generation. We use self-attention mechanism to improve the overall quality of images, so that the target image with a certain structure can also be generated well. We use multi-head attention mechanism to improve the semantic consistency of generated images and text descriptions. We conducted extensive experiments on three datasets: Oxford-102 Flowers dataset, Caltech-UCSD Birds dataset and COCO dataset. Our MAGAN has better results than representative methods such as AttnGAN, MirrorGAN and ControlGAN.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Goodfellow, I., Xu, B., et al.: Generative adversarial nets. In: NIPS, pp. 2672–2680 (2014)
Reed, S., Akata, Z., et al.: Generative adversarial text to image synthesis. In: ICML, pp. 1060–1069 (2016)
Cho, K., Gulcehre, C., Schwenk, H., et al.: Learning phrase representations using RNN encoder-decoder for statistical ma-chine translation. In: EMNLP (2014)
Reed, S., Akata, Z., et al.: Learning what and where to draw. In: NIPS, pp. 217–225 (2016)
Zhang, H., Xu, T., Li, H., et al.: StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. In: ICCV, pp. 5907–5915 (2017)
Zhang, H., Xu, T., Li, H., et al.: StackGAN++: realistic image synthesis with stacked generative adversarial networks. In: TPAMI, pp. 1947–1962 (2018)
Xu, T., Zhang, P., Huang, Q., et al.: AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In: CVPR, pp. 1316–1324 (2018)
Qiao, T., Zhang, J., Xu, D., et al.: MirrorGAN: learning text-to-image generation by redescription. In: CVPR, pp. 1505–1514 (2019)
Li, B., Qi, X., et al.: Controllable text-to-image generation. In: NIPS, pp. 2065–2075 (2019)
Qiao, T., Zhang, J., et al.: Learn, imagine and create: text-to-image generation from prior knowledge. In: NIPS, pp. 885–895 (2019)
Zhang, Z., Xie, Y., Yang, L.: Photographic text-to-image synthesis with a hierarchical-ly-nested adversarial network. In: CVPR, pp. 6199–6208 (2018)
Zhu, M., Pan, P., Chen, W., et al.: DM-GAN: dynamic memory generative adversarial networks for text-to-image synthesis. In: CVPR, pp. 5802–5810 (2019)
Shazeer, N., Jones, L., et al.: Attention is all you need. In: NIPS, pp. 5998–6008 (2017)
Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)
Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. In: ICLR (2016)
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR (2015)
Xu, K., Ba, J., Kiros, R., et al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML (2015)
Zhang, H., Goodfellow, I., Metaxas, D., et al.: Self-attention generative adversarial networks. In: ICML, pp. 7354–7363 (2019)
Russakovsky, O., Deng, J., Su, H., et al.: ImageNet large scale visual recognition challenge. In: IJCV, pp. 211–252 (2015)
Cao, Y., Xu, J., Lin, S., et al.: GCNet: non-local networks meet squeeze-excitation networks and beyond. In: ICCV (2019)
Wang, X., Girshick, R., Gupta, A., et al.: Non-local neural networks. In: CVPR (2018)
Wah, C., Branson, P., Welinder, P., et al.: The Caltech-UCSD Birds-200-2011 Da-taset. California Institute of Technology, Technical Report CNS-TR-2011-001 (2011)
Nilsback, M., Zisserman, A.: Automated flower classifification over a large number of classes. In: ICVGIP, pp. 722–729 (2008)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Salimanx, T., Goodfellow, I., Zaremba, W., et al.: Improved techniques for training GANs. In: NIPS, pp. 2226–2234 (2016)
Heusel, M., Ramsauer, H., et al.: GANs trained by a two time-scale update rule con-verge to a local nash equilibrium. In: NIPS, pp. 6626–6637 (2017)
Szegedy, C., Ioffe, S., Shlens, J., et al.: Rethinking the inception architecture for com-puter vision. In: CVPR, pp. 2818–2826 (2016)
Kingma. D., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Acknowledgement
This work is supported by Beijing Natural Science Foundation under No. 4202004.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Jia, X., Mi, Q., Dai, Q. (2021). MAGAN: Multi-attention Generative Adversarial Networks for Text-to-Image Generation. In: Ma, H., et al. Pattern Recognition and Computer Vision. PRCV 2021. Lecture Notes in Computer Science(), vol 13022. Springer, Cham. https://doi.org/10.1007/978-3-030-88013-2_26
Download citation
DOI: https://doi.org/10.1007/978-3-030-88013-2_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-88012-5
Online ISBN: 978-3-030-88013-2
eBook Packages: Computer ScienceComputer Science (R0)