MAGAN: Multi-attention Generative Adversarial Networks for Text-to-Image Generation

Jia, Xibin; Mi, Qing; Dai, Qi

doi:10.1007/978-3-030-88013-2_26

Xibin Jia¹⁶,
Qing Mi¹⁶ &
Qi Dai¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 13022))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

1858 Accesses

Abstract

Although generative adversarial networks are commonly used in text-to-image generation tasks and have made great progress, there are still some problems. The convolution operation used in these GANs-based methods works on local regions, but not disjoint regions of the image, leading to structural anomalies in the generated image. Moreover, the semantic consistency of generated images and corresponding text descriptions still needs to be improved. In this paper, we propose a multi-attention generative adversarial networks (MAGAN) for text-to-image generation. We use self-attention mechanism to improve the overall quality of images, so that the target image with a certain structure can also be generated well. We use multi-head attention mechanism to improve the semantic consistency of generated images and text descriptions. We conducted extensive experiments on three datasets: Oxford-102 Flowers dataset, Caltech-UCSD Birds dataset and COCO dataset. Our MAGAN has better results than representative methods such as AttnGAN, MirrorGAN and ControlGAN.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Goodfellow, I., Xu, B., et al.: Generative adversarial nets. In: NIPS, pp. 2672–2680 (2014)
Google Scholar
Reed, S., Akata, Z., et al.: Generative adversarial text to image synthesis. In: ICML, pp. 1060–1069 (2016)
Google Scholar
Cho, K., Gulcehre, C., Schwenk, H., et al.: Learning phrase representations using RNN encoder-decoder for statistical ma-chine translation. In: EMNLP (2014)
Google Scholar
Reed, S., Akata, Z., et al.: Learning what and where to draw. In: NIPS, pp. 217–225 (2016)
Google Scholar
Zhang, H., Xu, T., Li, H., et al.: StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. In: ICCV, pp. 5907–5915 (2017)
Google Scholar
Zhang, H., Xu, T., Li, H., et al.: StackGAN++: realistic image synthesis with stacked generative adversarial networks. In: TPAMI, pp. 1947–1962 (2018)
Google Scholar
Xu, T., Zhang, P., Huang, Q., et al.: AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In: CVPR, pp. 1316–1324 (2018)
Google Scholar
Qiao, T., Zhang, J., Xu, D., et al.: MirrorGAN: learning text-to-image generation by redescription. In: CVPR, pp. 1505–1514 (2019)
Google Scholar
Li, B., Qi, X., et al.: Controllable text-to-image generation. In: NIPS, pp. 2065–2075 (2019)
Google Scholar
Qiao, T., Zhang, J., et al.: Learn, imagine and create: text-to-image generation from prior knowledge. In: NIPS, pp. 885–895 (2019)
Google Scholar
Zhang, Z., Xie, Y., Yang, L.: Photographic text-to-image synthesis with a hierarchical-ly-nested adversarial network. In: CVPR, pp. 6199–6208 (2018)
Google Scholar
Zhu, M., Pan, P., Chen, W., et al.: DM-GAN: dynamic memory generative adversarial networks for text-to-image synthesis. In: CVPR, pp. 5802–5810 (2019)
Google Scholar
Shazeer, N., Jones, L., et al.: Attention is all you need. In: NIPS, pp. 5998–6008 (2017)
Google Scholar
Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)
Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. In: ICLR (2016)
Google Scholar
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR (2015)
Google Scholar
Xu, K., Ba, J., Kiros, R., et al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML (2015)
Google Scholar
Zhang, H., Goodfellow, I., Metaxas, D., et al.: Self-attention generative adversarial networks. In: ICML, pp. 7354–7363 (2019)
Google Scholar
Russakovsky, O., Deng, J., Su, H., et al.: ImageNet large scale visual recognition challenge. In: IJCV, pp. 211–252 (2015)
Google Scholar
Cao, Y., Xu, J., Lin, S., et al.: GCNet: non-local networks meet squeeze-excitation networks and beyond. In: ICCV (2019)
Google Scholar
Wang, X., Girshick, R., Gupta, A., et al.: Non-local neural networks. In: CVPR (2018)
Google Scholar
Wah, C., Branson, P., Welinder, P., et al.: The Caltech-UCSD Birds-200-2011 Da-taset. California Institute of Technology, Technical Report CNS-TR-2011-001 (2011)
Google Scholar
Nilsback, M., Zisserman, A.: Automated flower classifification over a large number of classes. In: ICVGIP, pp. 722–729 (2008)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Salimanx, T., Goodfellow, I., Zaremba, W., et al.: Improved techniques for training GANs. In: NIPS, pp. 2226–2234 (2016)
Google Scholar
Heusel, M., Ramsauer, H., et al.: GANs trained by a two time-scale update rule con-verge to a local nash equilibrium. In: NIPS, pp. 6626–6637 (2017)
Google Scholar
Szegedy, C., Ioffe, S., Shlens, J., et al.: Rethinking the inception architecture for com-puter vision. In: CVPR, pp. 2818–2826 (2016)
Google Scholar
Kingma. D., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Google Scholar

Download references

Acknowledgement

This work is supported by Beijing Natural Science Foundation under No. 4202004.

Author information

Authors and Affiliations

Beijing University of Technology, Beijing, 100124, China
Xibin Jia, Qing Mi & Qi Dai

Authors

Xibin Jia
View author publications
You can also search for this author in PubMed Google Scholar
Qing Mi
View author publications
You can also search for this author in PubMed Google Scholar
Qi Dai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qi Dai .

Editor information

Editors and Affiliations

University of Science and Technology Beijing, Beijing, China
Huimin Ma
Chinese Academy of Sciences, Beijing, China
Liang Wang
Tsinghua University, Beijing, China
Changshui Zhang
Zhejiang University, Hangzhou, China
Fei Wu
Chinese Academy of Sciences, Beijing, China
Tieniu Tan
Hunan University, Changsha, China
Yaonan Wang
Sun Yat-Sen University, Guangzhou, Guangdong, China
Jianhuang Lai
Beijing Jiaotong University, Beijing, China
Yao Zhao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jia, X., Mi, Q., Dai, Q. (2021). MAGAN: Multi-attention Generative Adversarial Networks for Text-to-Image Generation. In: Ma, H., et al. Pattern Recognition and Computer Vision. PRCV 2021. Lecture Notes in Computer Science(), vol 13022. Springer, Cham. https://doi.org/10.1007/978-3-030-88013-2_26

Download citation

DOI: https://doi.org/10.1007/978-3-030-88013-2_26
Published: 22 October 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-88012-5
Online ISBN: 978-3-030-88013-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics