An efficient multi-path structure with staged connection and multi-scale mechanism for text-to-image synthesis

Ding, Jiajun; Liu, Beili; Yu, Jun; Guo, Huanlei; Shen, Ming; Shen, Kenong

doi:10.1007/s00530-023-01067-0

An efficient multi-path structure with staged connection and multi-scale mechanism for text-to-image synthesis

Regular Paper
Published: 27 February 2023

Volume 29, pages 1391–1403, (2023)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Jiajun Ding^1,5,
Beili Liu¹,
Jun Yu¹,
Huanlei Guo²,
Ming Shen³ &
…
Kenong Shen⁴

334 Accesses
2 Citations
Explore all metrics

Abstract

Generating a realistic image which matches the given text description is a challenging task. The multi-stage framework obtains the high-resolution image by constructing a low-resolution image firstly, which is widely adopted for text-to-image synthesis task. However, subsequent stages of existing generator have to construct the whole image repeatedly, while the primitive features of the objects have been sketched out in the previously adjacent stage. In order to make the subsequent stages focus on enriching fine-grained details and improve the quality of the final generated image, an efficient multi-path structure is proposed for multi-stage framework in this paper. The proposed structure contains two parts: staged connection and multi-scale module. Staged connection is employed to transfer the feature maps of the generated image from previously adjacent stage to the end of current stage. Such path can avoid the requirement of long-term memory and guide the network focus on modifying and supplementing the details of generated image. In addition, the multi-scale module is explored to extract feature at different scales and generate image with more fine-grained details. The proposed multi-path structure can be introduced to multi-stage based algorithm such as StackGAN-v2 and AttnGAN. Extensive experiments are conducted on two widely used datasets, i.e. Oxford-102 and CUB dataset, for the text-to-image synthesis task. The results demonstrate the superior performance of the methods with multi-path structure over the base models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Co-GAN: A Text-to-Image Synthesis Model with Local and Integral Features

Text-to-Image Synthesis with Threshold-Equipped Matching-Aware GAN

Gated Cross Word-Visual Attention-Driven Generative Adversarial Networks for Text-to-Image Synthesis

References

Yuan, M., Peng, Y.: Ckd: Cross-task knowledge distillation for text-to-image synthesis. IEEE Trans. Multimedia 22(8), 1955–1968 (2019)
Article Google Scholar
Li, R., Wang, N., Feng, F., Zhang, G., Wang, X.: Exploring global and local linguistic representations for text-to-image synthesis. IEEE Trans. Multimedia 22(12), 3075–3087 (2020)
Article Google Scholar
Zhou, R., Jiang, C., Xu, Q.: A survey on generative adversarial network-based text-to-image synthesis. Neurocomputing 451, 316–336 (2021). https://doi.org/10.1016/j.neucom.2021.04.069
Article Google Scholar
Frolov, S., Hinz, T., Raue, F., Hees, J., Dengel, A.: Adversarial text-to-image synthesis: a review. Neural Netw. 144, 187–209 (2021). https://doi.org/10.1016/j.neunet.2021.07.019
Article Google Scholar
Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis. In: International Conference on Machine Learning, pp. 1060–1069. PMLR (2016).
Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., Metaxas, D.N.: Stackgan++: realistic image synthesis with stacked generative adversarial networks. IEEE Trans. Pattern Anal. Mach. Intell. 41(8), 1947–1962 (2018)
Article Google Scholar
Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., He, X.: Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1316–1324 (2018)
Zhu, M., Pan, P., Chen, W., Yang, Y.: Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5802–5810 (2019)
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Ghahramani Z., Welling M., Cortes C., Lawrence N., Weinberger K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 27, pp. 2672–2680. Curran Associates, Inc. (2014)
Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of GANs for improved quality, stability, and variation. In: International Conference on Learning Representations (2018). https://openreview.net/forum?id=Hk99zCeAb
Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv:1411.1784 (2014)
Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., Metaxas, D.N.: Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5907–5915 (2017)
Qiao, T., Zhang, J., Xu, D., Tao, D.: Mirrorgan: Learning text-to-image generation by redescription. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1505–1514 (2019)
Cheng, J., Wu, F., Tian, Y., Wang, L., Tao, D.: Rifegan: Rich feature generation for text-to-image synthesis from prior knowledge. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10911–10920 (2020)
Cheng, J., Wu, F., Tian, Y., Wang, L., Tao, D.: Rifegan2: Rich feature generation for text-to-image synthesis from constrained prior knowledge. IEEE Transactions on Circuits and Systems for Video Technology (2021)
Hinz, T., Heinrich, S., Wermter, S.: Semantic object accuracy for generative text-to-image synthesis. IEEE Trans. Pattern Anal. Mach. Intell. 44(3), 1552–1565 (2022). https://doi.org/10.1109/TPAMI.2020.3021209
Article Google Scholar
Yuan, M., Peng, Y.: Bridge-gan: interpretable representation learning for text-to-image synthesis. IEEE Trans. Circuits Syst. Video Technol. 30(11), 4258–4268 (2019)
Article Google Scholar
Feng, F., Niu, T., Li, R., Wang, X.: Modality disentangled discriminator for text-to-image synthesis. IEEE Transactions on Multimedia (2021)
Peng, D., Yang, W., Liu, C., Lü, S.: Sam-gan: self-attention supporting multi-stage generative adversarial networks for text-to-image synthesis. Neural Netw. 138, 57–67 (2021). https://doi.org/10.1016/j.neunet.2021.01.023
Article Google Scholar
Yuan, M., Peng, Y.: Text-to-image synthesis via symmetrical distillation networks. In: Proceedings of the 26th ACM International Conference on Multimedia, pp. 1407–1415 (2018)
Ma, S., Fu, J., Chen, C.W., Mei, T.: Da-gan: Instance-level image translation by deep attention generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5657–5666 (2018)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Qin, J., Liu, F., Liu, K., Jeon, G., Yang, X.: Lightweight hierarchical residual feature fusion network for single-image super-resolution. Neurocomputing 478, 104–123 (2022). https://doi.org/10.1016/j.neucom.2021.12.090
Article Google Scholar
Ding, J., Guo, H., Zhou, H., Yu, J., He, X., Jiang, B.: Distributed feedback network for single-image deraining. Inf. Sci. 572, 611–626 (2021)
Article MathSciNet Google Scholar
Kim, J., Lee, J.K., Lee, K.M.: Accurate image super-resolution using very deep convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1646–1654 (2016)
Zhang, K., Zuo, W., Chen, Y., Meng, D., Zhang, L.: Beyond a gaussian denoiser: residual learning of deep cnn for image denoising. IEEE Trans. Image Process. 26(7), 3142–3155 (2017)
Article MathSciNet MATH Google Scholar
Du, Y., Li, X.: Recursive deep residual learning for single image dehazing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 730–737 (2018)
Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=B1xsqj09Fm
Muhammad Umer, R., Luca Foresti, G., Micheloni, C.: Deep generative adversarial residual convolutional networks for real-world super-resolution. In: The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (2020)
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of StyleGAN. In: Proc. CVPR (2020)
Liang, J., Xu, F., Yu, S.: A multi-scale semantic attention representation for multi-label image recognition with graph networks. Neurocomputing 491, 14–23 (2022). https://doi.org/10.1016/j.neucom.2022.03.057
Article Google Scholar
Lv, X., Wang, C., Fan, X., Leng, Q., Jiang, X.: A novel image super-resolution algorithm based on multi-scale dense recursive fusion network. Neurocomputing 489, 98–111 (2022). https://doi.org/10.1016/j.neucom.2022.02.042
Article Google Scholar
Li, W., Li, J., Li, J., Huang, Z., Zhou, D.: A lightweight multi-scale channel attention network for image super-resolution. Neurocomputing 456, 327–337 (2021). https://doi.org/10.1016/j.neucom.2021.05.090
Article Google Scholar
Li, J., Fang, F., Mei, K., Zhang, G.: Multi-scale residual network for image super-resolution. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 517–532 (2018)
Dong, X., Wang, L., Sun, X., Jia, X., Gao, L., Zhang, B.: Remote sensing image super-resolution using second-order multi-scale networks. IEEE Trans. Geosci. Remote Sens. 59(4), 3473–3485 (2020)
Article Google Scholar
Wang, Q., Gao, Q., Wu, L., Sun, G., Jiao, L.: Adversarial multi-path residual network for image super-resolution. IEEE Trans. Image Process. 30, 6648–6658 (2021)
Article Google Scholar
Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997)
Article Google Scholar
Wang, Q., Wu, B., Zhu, P., Li, P., Zuo, W., Hu, Q.: Eca-net: Efficient channel attention for deep convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The caltech-ucsd birds-200-2011 dataset (2011)
Nilsback, M.-E., Zisserman, A.: Automated flower classification over a large number of classes. In: 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729. IEEE(2008)
Ruan, S., Zhang, Y., Zhang, K., Fan, Y., Tang, F., Liu, Q., Chen, E.: Dae-gan: Dynamic aspect-aware gan for text-to-image synthesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13960–13969 (2021)
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. In: Lee D., Sugiyama M., Luxburg U., Guyon I., Garnett R. (eds.) Advances in Neural Information Processing Systems, vol. 29, pp. 2234–2242. Curran Associates, Inc. (2016)
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. In: Guyon I., Von Luxburg U., Bengio S., Wallach H., Fergus R., Vishwanathan S., Garnett R. (eds.) Advances in Neural Information Processing Systems, vol. 30, pp. 6629–6640. Curran Associates, Inc. (2017)

Download references

Acknowledgements

The authors acknowledge the financial support from the Fundamental Research Funds for the Provincial Universities of Zhejiang (Grant No. GK219909299001-015), Natural Science Foundation of China (Grant No. 62206082), National Undergraduate Training Program for Innovation and Entrepreneurship (Grant No. 202110336042), Planted talent plan (Grant No. 2022R407A002) and Research on higher teaching reform (YBJG202233).

Author information

Authors and Affiliations

Computer and Software School, Hangzhou Dianzi University, Hangzhou, 310018, China
Jiajun Ding, Beili Liu & Jun Yu
Department of Statistics and Data Science, Southern University of Science and Technology, Shenzhen, 518055, China
Huanlei Guo
Zhuoyue Honors College, Hangzhou Dianzi University, Hangzhou, 310018, China
Ming Shen
Hangzhou oke Technology Co Ltd, Hangzhou, 310000, China
Kenong Shen
Hangzhou Dianzi University Shangyu Institute of Science and Engineering, Shangyu, 312300, China
Jiajun Ding

Authors

Jiajun Ding
View author publications
You can also search for this author inPubMed Google Scholar
Beili Liu
View author publications
You can also search for this author inPubMed Google Scholar
Jun Yu
View author publications
You can also search for this author inPubMed Google Scholar
Huanlei Guo
View author publications
You can also search for this author inPubMed Google Scholar
Ming Shen
View author publications
You can also search for this author inPubMed Google Scholar
Kenong Shen
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Jiajun Ding and Huanlei Guo contributed to the study conception and design. Beili Liu, Huanlei Guo and Ming Shen conducted the experiments. Material preparation, data collection and were performed by Beili Liu, Huanlei Guo, Ming Shen and Kenong Shen. Jiajun Ding, Beili Liu, Huanlei Guo performed the data analyses and wrote the first draft of manuscript. Jiajun Ding and Jun Yu were responsible for writing-review&editing and supervision. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Huanlei Guo.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Ding, J., Liu, B., Yu, J. et al. An efficient multi-path structure with staged connection and multi-scale mechanism for text-to-image synthesis. Multimedia Systems 29, 1391–1403 (2023). https://doi.org/10.1007/s00530-023-01067-0

Download citation

Received: 24 August 2022
Accepted: 15 February 2023
Published: 27 February 2023
Issue Date: June 2023
DOI: https://doi.org/10.1007/s00530-023-01067-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An efficient multi-path structure with staged connection and multi-scale mechanism for text-to-image synthesis

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Co-GAN: A Text-to-Image Synthesis Model with Local and Integral Features

Text-to-Image Synthesis with Threshold-Equipped Matching-Aware GAN

Gated Cross Word-Visual Attention-Driven Generative Adversarial Networks for Text-to-Image Synthesis

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now