Abstract
The main objective of text-to-image (Txt2Img) synthesis is to generate realistic images from text descriptions. We propose to insert a gated cross word-visual attention unit (GCAU) into the conventional multiple-stage generative adversarial network Txt2Img framework. Our GCAU consists of two key components. First, a cross word-visual attention mechanism is proposed to draw fine-grained details at different subregions of the image by focusing on the relevant words (via the visual-to-word attention), and select important words by paying attention to the relevant synthesized subregions of the image (via the word-to-visual attention). Second, a gated refinement mechanism is proposed to dynamically select important word information for refining the generated image. Extensive experiments are conducted to demonstrate the superior image generation performance of the proposed approach on CUB and MS-COCO benchmark datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, Montreal, Canada, pp. 2672–2680, December 2014
Gregor, K., Danihelka, I., Graves, A.: DRAW: a recurrent neural network for image generation. In: International Conference on Machine Learning, Lille, France, pp. 1462–1471, July 2015
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: Advances in Neural Information Processing Systems, Long Beach, CA, USA, pp. 6626–6637, December 2017
Huang, L., Wang, W., Chen, J., Wei, X.: Attention on attention for image captioning. In: IEEE/CVF International Conference on Computer Vision, Seoul, Korea, pp. 4633–4642, October 2019
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations, vol. 5, San Diego, CA, May 2015
Li, B., Qi, X., Lukasiewicz, T., Torr, P.: Controllable text-to-image generation. In: Advances in Neural Information Processing Systems, Vancouver, BC, Canada, pp. 2065–2075, December 2019
Li, W.: Object-driven text-to-image synthesis via adversarial training. In: IEEE International Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp. 12166–12174, June 2019
Li, W., Zhu, H., Yang, S., Zhang, H.: DADAN: dual-path attention with distribution analysis network for text-image matching. SIViP 16(3), 797–805 (2022). https://doi.org/10.1007/s11760-021-02020-2
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Lu, Z., Chen, Y.: Single image super-resolution based on a modified U-net with mixed gradient loss. SIViP 16(5), 1143–1151 (2022). https://doi.org/10.1007/s11760-021-02063-5
Pathak, D., Krahenbuhl, P., Donahue, J.: Context encoders: feature learning by inpainting. In: IEEE International Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, pp. 2536–2544, June 2016
Qiao, T., Zhang, J., Xu, D., Tao, D.: Learn, imagine and create: text-to-image generation from prior knowledge. In: Neural Information Processing Systems (2019)
Qiao, T., Zhang, J., Xu, D., Tao, D.: MirrorGAN: learning text-to-image generation by redescription. In: IEEE International Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp. 1505–1514, June 2019
Reed, S., Akata, Z., Mohan, S., Tenka, S., Schiele, B., Lee, H.: Learning what and where to draw, New Republic (2016)
Reed, S.E., Akata, Z., Yan, X.: Generative adversarial text to image synthesis. In: International Conference on Machine Learning, New York City, NY, USA, pp. 1060–1069, June 2016
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Xi, C.: Improved techniques for training GANs. In: Advances in Neural Information Processing Systems, Barcelona, Spain, pp. 2234–2242, December 2016
Tan, H., Liu, X., Li, X., Zhang, Y., Yin, B.: Semantics-enhanced adversarial nets for text-to-image synthesis. In: IEEE/CVF International Conference on Computer Vision, Seoul, Korea, pp. 10500–10509, October 2019
Tan, X., Liu, M., Yin, B., Li, X.: KT-GAN: knowledge-transfer generative adversarial network for text-to-image synthesis. IEEE Trans. Image Process. 30, 1275–1290 (2021)
Tao, M., et al.: DF-GAN: deep fusion generative adversarial networks for text-to-image synthesis. arXiv preprint arXiv:2008.05865 (2020)
Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The caltech-UCSD birds-200-2011 dataset. Technical report CNS-TR-2011-001, California Institute of Technology (2011)
Wang, Z., Quan, Z., Wang, Z.J., Hu, X., Chen, Y.: Text to image synthesis with bidirectional generative adversarial network. In: IEEE International Conference on Multimedia and Expo, London, UK, pp. 1–6, July 2020
Xu, T.: AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In: IEEE International Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp. 1316–1324, June 2018
Yin, G., Liu, B., Sheng, L., Yu, N., Wang, X., Shao, J.: Semantics disentangling for text-to-image generation. In: IEEE International Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp. 2322–2331, June 2019
Zhang, H., et al.: StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. In: IEEE International Conference on Computer Vision, Venice, Italy, pp. 5907–5915, October 2017
Zhang, H., et al.: StackGAN++: realistic image synthesis with stacked generative adversarial networks. IEEE Trans. Pattern Anal. Mach. Intell. 41(8), 1947–1962 (2019)
Zhang, Z., Xie, Y., Yang, L.: Photographic text-to-image synthesis with a hierarchically-nested adversarial network. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018)
Zhu, M., Pan, P., Chen, W., Yang, Y.: DM-GAN: dynamic memory generative adversarial networks for text-to-image synthesis. In: IEEE International Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, pp. 5795–5803, June 2019
Acknowledgement
The work described in this paper is supported by China GDSF No. 2019A1515011949.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Lai, B., Ma, L., Tian, J. (2023). Gated Cross Word-Visual Attention-Driven Generative Adversarial Networks for Text-to-Image Synthesis. In: Wang, L., Gall, J., Chin, TJ., Sato, I., Chellappa, R. (eds) Computer Vision – ACCV 2022. ACCV 2022. Lecture Notes in Computer Science, vol 13847. Springer, Cham. https://doi.org/10.1007/978-3-031-26293-7_6
Download citation
DOI: https://doi.org/10.1007/978-3-031-26293-7_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26292-0
Online ISBN: 978-3-031-26293-7
eBook Packages: Computer ScienceComputer Science (R0)