COMIM-GAN: Improved Text-to-Image Generation via Condition Optimization and Mutual Information Maximization

Zhou, Longlong; Wu, Xiao-Jun; Xu, Tianyang

doi:10.1007/978-3-031-27077-2_30

Longlong Zhou¹⁵,
Xiao-Jun Wu¹⁵ &
Tianyang Xu¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13833))

Included in the following conference series:

International Conference on Multimedia Modeling

1326 Accesses

Abstract

Language-based image generation is a challenging task. Current studies normally employ conditional generative adversarial network (cGAN) as the model framework and have achieved significant progress. Nonetheless, a close examination of their methods reveals two fundamental issues. First, the discrete linguistic conditions make the training of cGAN extremely difficult and impair the generalization performance of cGAN. Second, the conditional discriminator cannot extract semantically consistent features based on linguistic conditions, which is not conducive to conditional discrimination. To address these issues, we propose a condition optimization and mutual information maximization GAN (COMIM-GAN). To be specific, we design (1) a text condition construction module, which can construct a compact linguistic condition space, and (2) a mutual information loss between images and linguistic conditions to motivate the discriminator to extract more features associated with the linguistic conditions. Extensive experiments on CUB-200 and MS-COCO datasets demonstrate that our method is superior to the existing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. IEEE Trans. Patt. Anal. Mach. Intell. 35(8), 1798–1828 (2013)
Article Google Scholar
Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., Abbeel, P.: Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 29 (2016)
Google Scholar
Cheng, C., Wu, X.J., Xu, T., Chen, G.: Unifusion: a lightweight unified image fusion network. IEEE Trans. Instrument. Measure. 70, 1–14 (2021)
Google Scholar
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27 (2014)
Google Scholar
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Hjelm, R.D., et al.: Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670 (2018)
Lee, K.S., Tran, N.T., Cheung, N.M.: Infomax-gan: Improved adversarial image generation via information maximization and contrastive learning. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3942–3952 (2021)
Google Scholar
Liao, W., Hu, K., Yang, M.Y., Rosenhahn, B.: Text to image generation with semantic-spatial aware gan. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18187–18196 (2022)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Van der Maaten, L., Hinton, G.: Visualizing data using t-sne. J. Mach. Learn. Res. 9(11), 2579–2605 (2008)
Google Scholar
Mirza, M., Osindero, S.: Conditional generative adversarial nets. Computer Science. pp. 2672–2680 (2014)
Google Scholar
Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis. In: International Conference on Machine Learning, pp. 1060–1069. PMLR (2016)
Google Scholar
Ruan, S., et al.: Dae-gan: Dynamic aspect-aware gan for text-to-image synthesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13960–13969 (2021)
Google Scholar
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. In: Advances in Neural Information Processing Systems, vol. 29, pp. 2234–2242 (2016)
Google Scholar
Shen, Z., Wu, X.J., Xu, T.: Fexnet: foreground extraction network for human action recognition. IEEE Trans. Circ. Syst. Video Technol. 32(5), 3141–3151 (2021)
Article Google Scholar
Souza, D.M., Wehrmann, J., Ruiz, D.D.: Efficient neural architecture for text-to-image synthesis. In: 2020 International Joint Conference on Neural Networks (IJCNN)., pp. 1–8. IEEE (2020)
Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Google Scholar
Tan, H., Liu, X., Li, X., Zhang, Y., Yin, B.: Semantics-enhanced adversarial nets for text-to-image synthesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10501–10510 (2019)
Google Scholar
Tao, M., Tang, H., Wu, F., Jing, X.Y., Bao, B.K., Xu, C.: Df-gan: A simple and effective baseline for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16515–16525 (2022)
Google Scholar
Tschannen, M., Djolonga, J., Rubenstein, P.K., Gelly, S., Lucic, M.: On mutual information maximization for representation learning. arXiv preprint arXiv:1907.13625 (2019)
Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The caltech-ucsd birds-200-2011 dataset (2011)
Google Scholar
Xu, T., et al.: Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1316–1324 (2018)
Google Scholar
Xu, T., Feng, Z.H., Wu, X.J., Kittler, J.: Learning adaptive discriminative correlation filters via temporal consistency preserving spatial feature selection for robust visual object tracking. IEEE Trans. Image Process. 28(11), 5596–5609 (2019)
Article MathSciNet MATH Google Scholar
Xu, T., Feng, Z.H., Wu, X.J., Kittler, J.: An accelerated correlation filter tracker. Pattern Recogn. 102, 107172 (2020)
Article Google Scholar
Xu, T., Feng, Z., Wu, X.J., Kittler, J.: Adaptive channel selection for robust visual object tracking with discriminative correlation filters. Int. J. Comput. Vision 129(5), 1359–1375 (2021)
Article MATH Google Scholar
Xu, T., Wu, X.J., Kittler, J.: Non-negative subspace representation learning scheme for correlation filter based tracking. In: 2018 24th International Conference on Pattern Recognition (ICPR), pp. 1888–1893. IEEE (2018)
Google Scholar
Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., Metaxas, D.N.: Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5907–5915 (2017)
Google Scholar
Zhang, Z., Schomaker, L.: Dtgan: Dual attention generative adversarial networks for text-to-image generation. In: 2021 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2021)
Google Scholar
Zhu, M., Pan, P., Chen, W., Yang, Y.: Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5802–5810 (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

The School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, 214122, People’s Republic of China
Longlong Zhou, Xiao-Jun Wu & Tianyang Xu

Authors

Longlong Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Xiao-Jun Wu
View author publications
You can also search for this author in PubMed Google Scholar
Tianyang Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiao-Jun Wu .

Editor information

Editors and Affiliations

University of Bergen, Bergen, Norway
Duc-Tien Dang-Nguyen
Dublin City University, Dublin, Ireland
Cathal Gurrin
Radboud University Nijmegen, Nijmegen, The Netherlands
Martha Larson
Dublin City University, Dublin, Ireland
Alan F. Smeaton
University of Amsterdam, Amsterdam, The Netherlands
Stevan Rudinac
National Institute of Information and Communications Technology, Tokyo, Japan
Minh-Son Dao
Department of Information Science and Media Studies, University of Bergen, Bergen, Norway
Christoph Trattner
La Trobe University, Melbourne, VIC, Australia
Phoebe Chen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhou, L., Wu, XJ., Xu, T. (2023). COMIM-GAN: Improved Text-to-Image Generation via Condition Optimization and Mutual Information Maximization. In: Dang-Nguyen, DT., et al. MultiMedia Modeling. MMM 2023. Lecture Notes in Computer Science, vol 13833. Springer, Cham. https://doi.org/10.1007/978-3-031-27077-2_30

Download citation

DOI: https://doi.org/10.1007/978-3-031-27077-2_30
Published: 29 March 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-27076-5
Online ISBN: 978-3-031-27077-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

COMIM-GAN: Improved Text-to-Image Generation via Condition Optimization and Mutual Information Maximization