Skip to main content
Log in

An efficient multi-path structure with staged connection and multi-scale mechanism for text-to-image synthesis

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Generating a realistic image which matches the given text description is a challenging task. The multi-stage framework obtains the high-resolution image by constructing a low-resolution image firstly, which is widely adopted for text-to-image synthesis task. However, subsequent stages of existing generator have to construct the whole image repeatedly, while the primitive features of the objects have been sketched out in the previously adjacent stage. In order to make the subsequent stages focus on enriching fine-grained details and improve the quality of the final generated image, an efficient multi-path structure is proposed for multi-stage framework in this paper. The proposed structure contains two parts: staged connection and multi-scale module. Staged connection is employed to transfer the feature maps of the generated image from previously adjacent stage to the end of current stage. Such path can avoid the requirement of long-term memory and guide the network focus on modifying and supplementing the details of generated image. In addition, the multi-scale module is explored to extract feature at different scales and generate image with more fine-grained details. The proposed multi-path structure can be introduced to multi-stage based algorithm such as StackGAN-v2 and AttnGAN. Extensive experiments are conducted on two widely used datasets, i.e. Oxford-102 and CUB dataset, for the text-to-image synthesis task. The results demonstrate the superior performance of the methods with multi-path structure over the base models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Yuan, M., Peng, Y.: Ckd: Cross-task knowledge distillation for text-to-image synthesis. IEEE Trans. Multimedia 22(8), 1955–1968 (2019)

    Article  Google Scholar 

  2. Li, R., Wang, N., Feng, F., Zhang, G., Wang, X.: Exploring global and local linguistic representations for text-to-image synthesis. IEEE Trans. Multimedia 22(12), 3075–3087 (2020)

    Article  Google Scholar 

  3. Zhou, R., Jiang, C., Xu, Q.: A survey on generative adversarial network-based text-to-image synthesis. Neurocomputing 451, 316–336 (2021). https://doi.org/10.1016/j.neucom.2021.04.069

    Article  Google Scholar 

  4. Frolov, S., Hinz, T., Raue, F., Hees, J., Dengel, A.: Adversarial text-to-image synthesis: a review. Neural Netw. 144, 187–209 (2021). https://doi.org/10.1016/j.neunet.2021.07.019

    Article  Google Scholar 

  5. Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis. In: International Conference on Machine Learning, pp. 1060–1069. PMLR (2016).

  6. Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., Metaxas, D.N.: Stackgan++: realistic image synthesis with stacked generative adversarial networks. IEEE Trans. Pattern Anal. Mach. Intell. 41(8), 1947–1962 (2018)

    Article  Google Scholar 

  7. Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., He, X.: Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1316–1324 (2018)

  8. Zhu, M., Pan, P., Chen, W., Yang, Y.: Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5802–5810 (2019)

  9. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Ghahramani Z., Welling M., Cortes C., Lawrence N., Weinberger K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 27, pp. 2672–2680. Curran Associates, Inc. (2014)

  10. Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of GANs for improved quality, stability, and variation. In: International Conference on Learning Representations (2018). https://openreview.net/forum?id=Hk99zCeAb

  11. Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv:1411.1784 (2014)

  12. Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., Metaxas, D.N.: Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5907–5915 (2017)

  13. Qiao, T., Zhang, J., Xu, D., Tao, D.: Mirrorgan: Learning text-to-image generation by redescription. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1505–1514 (2019)

  14. Cheng, J., Wu, F., Tian, Y., Wang, L., Tao, D.: Rifegan: Rich feature generation for text-to-image synthesis from prior knowledge. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10911–10920 (2020)

  15. Cheng, J., Wu, F., Tian, Y., Wang, L., Tao, D.: Rifegan2: Rich feature generation for text-to-image synthesis from constrained prior knowledge. IEEE Transactions on Circuits and Systems for Video Technology (2021)

  16. Hinz, T., Heinrich, S., Wermter, S.: Semantic object accuracy for generative text-to-image synthesis. IEEE Trans. Pattern Anal. Mach. Intell. 44(3), 1552–1565 (2022). https://doi.org/10.1109/TPAMI.2020.3021209

    Article  Google Scholar 

  17. Yuan, M., Peng, Y.: Bridge-gan: interpretable representation learning for text-to-image synthesis. IEEE Trans. Circuits Syst. Video Technol. 30(11), 4258–4268 (2019)

    Article  Google Scholar 

  18. Feng, F., Niu, T., Li, R., Wang, X.: Modality disentangled discriminator for text-to-image synthesis. IEEE Transactions on Multimedia (2021)

  19. Peng, D., Yang, W., Liu, C., Lü, S.: Sam-gan: self-attention supporting multi-stage generative adversarial networks for text-to-image synthesis. Neural Netw. 138, 57–67 (2021). https://doi.org/10.1016/j.neunet.2021.01.023

    Article  Google Scholar 

  20. Yuan, M., Peng, Y.: Text-to-image synthesis via symmetrical distillation networks. In: Proceedings of the 26th ACM International Conference on Multimedia, pp. 1407–1415 (2018)

  21. Ma, S., Fu, J., Chen, C.W., Mei, T.: Da-gan: Instance-level image translation by deep attention generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5657–5666 (2018)

  22. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  23. Qin, J., Liu, F., Liu, K., Jeon, G., Yang, X.: Lightweight hierarchical residual feature fusion network for single-image super-resolution. Neurocomputing 478, 104–123 (2022). https://doi.org/10.1016/j.neucom.2021.12.090

    Article  Google Scholar 

  24. Ding, J., Guo, H., Zhou, H., Yu, J., He, X., Jiang, B.: Distributed feedback network for single-image deraining. Inf. Sci. 572, 611–626 (2021)

    Article  MathSciNet  Google Scholar 

  25. Kim, J., Lee, J.K., Lee, K.M.: Accurate image super-resolution using very deep convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1646–1654 (2016)

  26. Zhang, K., Zuo, W., Chen, Y., Meng, D., Zhang, L.: Beyond a gaussian denoiser: residual learning of deep cnn for image denoising. IEEE Trans. Image Process. 26(7), 3142–3155 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  27. Du, Y., Li, X.: Recursive deep residual learning for single image dehazing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 730–737 (2018)

  28. Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=B1xsqj09Fm

  29. Muhammad Umer, R., Luca Foresti, G., Micheloni, C.: Deep generative adversarial residual convolutional networks for real-world super-resolution. In: The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (2020)

  30. Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of StyleGAN. In: Proc. CVPR (2020)

  31. Liang, J., Xu, F., Yu, S.: A multi-scale semantic attention representation for multi-label image recognition with graph networks. Neurocomputing 491, 14–23 (2022). https://doi.org/10.1016/j.neucom.2022.03.057

    Article  Google Scholar 

  32. Lv, X., Wang, C., Fan, X., Leng, Q., Jiang, X.: A novel image super-resolution algorithm based on multi-scale dense recursive fusion network. Neurocomputing 489, 98–111 (2022). https://doi.org/10.1016/j.neucom.2022.02.042

    Article  Google Scholar 

  33. Li, W., Li, J., Li, J., Huang, Z., Zhou, D.: A lightweight multi-scale channel attention network for image super-resolution. Neurocomputing 456, 327–337 (2021). https://doi.org/10.1016/j.neucom.2021.05.090

    Article  Google Scholar 

  34. Li, J., Fang, F., Mei, K., Zhang, G.: Multi-scale residual network for image super-resolution. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 517–532 (2018)

  35. Dong, X., Wang, L., Sun, X., Jia, X., Gao, L., Zhang, B.: Remote sensing image super-resolution using second-order multi-scale networks. IEEE Trans. Geosci. Remote Sens. 59(4), 3473–3485 (2020)

    Article  Google Scholar 

  36. Wang, Q., Gao, Q., Wu, L., Sun, G., Jiao, L.: Adversarial multi-path residual network for image super-resolution. IEEE Trans. Image Process. 30, 6648–6658 (2021)

    Article  Google Scholar 

  37. Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997)

    Article  Google Scholar 

  38. Wang, Q., Wu, B., Zhu, P., Li, P., Zuo, W., Hu, Q.: Eca-net: Efficient channel attention for deep convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

  39. Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The caltech-ucsd birds-200-2011 dataset (2011)

  40. Nilsback, M.-E., Zisserman, A.: Automated flower classification over a large number of classes. In: 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729. IEEE(2008)

  41. Ruan, S., Zhang, Y., Zhang, K., Fan, Y., Tang, F., Liu, Q., Chen, E.: Dae-gan: Dynamic aspect-aware gan for text-to-image synthesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13960–13969 (2021)

  42. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. In: Lee D., Sugiyama M., Luxburg U., Guyon I., Garnett R. (eds.) Advances in Neural Information Processing Systems, vol. 29, pp. 2234–2242. Curran Associates, Inc. (2016)

  43. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)

  44. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. In: Guyon I., Von Luxburg U., Bengio S., Wallach H., Fergus R., Vishwanathan S., Garnett R. (eds.) Advances in Neural Information Processing Systems, vol. 30, pp. 6629–6640. Curran Associates, Inc. (2017)

Download references

Acknowledgements

The authors acknowledge the financial support from the Fundamental Research Funds for the Provincial Universities of Zhejiang (Grant No. GK219909299001-015), Natural Science Foundation of China (Grant No. 62206082), National Undergraduate Training Program for Innovation and Entrepreneurship (Grant No. 202110336042), Planted talent plan (Grant No. 2022R407A002) and Research on higher teaching reform (YBJG202233).

Author information

Authors and Affiliations

Authors

Contributions

Jiajun Ding and Huanlei Guo contributed to the study conception and design. Beili Liu, Huanlei Guo and Ming Shen conducted the experiments. Material preparation, data collection and were performed by Beili Liu, Huanlei Guo, Ming Shen and Kenong Shen. Jiajun Ding, Beili Liu, Huanlei Guo performed the data analyses and wrote the first draft of manuscript. Jiajun Ding and Jun Yu were responsible for writing-review&editing and supervision. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Huanlei Guo.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ding, J., Liu, B., Yu, J. et al. An efficient multi-path structure with staged connection and multi-scale mechanism for text-to-image synthesis. Multimedia Systems 29, 1391–1403 (2023). https://doi.org/10.1007/s00530-023-01067-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-023-01067-0

Keywords

Navigation