Abstract
Generating a realistic image which matches the given text description is a challenging task. The multi-stage framework obtains the high-resolution image by constructing a low-resolution image firstly, which is widely adopted for text-to-image synthesis task. However, subsequent stages of existing generator have to construct the whole image repeatedly, while the primitive features of the objects have been sketched out in the previously adjacent stage. In order to make the subsequent stages focus on enriching fine-grained details and improve the quality of the final generated image, an efficient multi-path structure is proposed for multi-stage framework in this paper. The proposed structure contains two parts: staged connection and multi-scale module. Staged connection is employed to transfer the feature maps of the generated image from previously adjacent stage to the end of current stage. Such path can avoid the requirement of long-term memory and guide the network focus on modifying and supplementing the details of generated image. In addition, the multi-scale module is explored to extract feature at different scales and generate image with more fine-grained details. The proposed multi-path structure can be introduced to multi-stage based algorithm such as StackGAN-v2 and AttnGAN. Extensive experiments are conducted on two widely used datasets, i.e. Oxford-102 and CUB dataset, for the text-to-image synthesis task. The results demonstrate the superior performance of the methods with multi-path structure over the base models.









Similar content being viewed by others
References
Yuan, M., Peng, Y.: Ckd: Cross-task knowledge distillation for text-to-image synthesis. IEEE Trans. Multimedia 22(8), 1955–1968 (2019)
Li, R., Wang, N., Feng, F., Zhang, G., Wang, X.: Exploring global and local linguistic representations for text-to-image synthesis. IEEE Trans. Multimedia 22(12), 3075–3087 (2020)
Zhou, R., Jiang, C., Xu, Q.: A survey on generative adversarial network-based text-to-image synthesis. Neurocomputing 451, 316–336 (2021). https://doi.org/10.1016/j.neucom.2021.04.069
Frolov, S., Hinz, T., Raue, F., Hees, J., Dengel, A.: Adversarial text-to-image synthesis: a review. Neural Netw. 144, 187–209 (2021). https://doi.org/10.1016/j.neunet.2021.07.019
Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis. In: International Conference on Machine Learning, pp. 1060–1069. PMLR (2016).
Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., Metaxas, D.N.: Stackgan++: realistic image synthesis with stacked generative adversarial networks. IEEE Trans. Pattern Anal. Mach. Intell. 41(8), 1947–1962 (2018)
Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., He, X.: Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1316–1324 (2018)
Zhu, M., Pan, P., Chen, W., Yang, Y.: Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5802–5810 (2019)
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Ghahramani Z., Welling M., Cortes C., Lawrence N., Weinberger K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 27, pp. 2672–2680. Curran Associates, Inc. (2014)
Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of GANs for improved quality, stability, and variation. In: International Conference on Learning Representations (2018). https://openreview.net/forum?id=Hk99zCeAb
Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv:1411.1784 (2014)
Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., Metaxas, D.N.: Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5907–5915 (2017)
Qiao, T., Zhang, J., Xu, D., Tao, D.: Mirrorgan: Learning text-to-image generation by redescription. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1505–1514 (2019)
Cheng, J., Wu, F., Tian, Y., Wang, L., Tao, D.: Rifegan: Rich feature generation for text-to-image synthesis from prior knowledge. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10911–10920 (2020)
Cheng, J., Wu, F., Tian, Y., Wang, L., Tao, D.: Rifegan2: Rich feature generation for text-to-image synthesis from constrained prior knowledge. IEEE Transactions on Circuits and Systems for Video Technology (2021)
Hinz, T., Heinrich, S., Wermter, S.: Semantic object accuracy for generative text-to-image synthesis. IEEE Trans. Pattern Anal. Mach. Intell. 44(3), 1552–1565 (2022). https://doi.org/10.1109/TPAMI.2020.3021209
Yuan, M., Peng, Y.: Bridge-gan: interpretable representation learning for text-to-image synthesis. IEEE Trans. Circuits Syst. Video Technol. 30(11), 4258–4268 (2019)
Feng, F., Niu, T., Li, R., Wang, X.: Modality disentangled discriminator for text-to-image synthesis. IEEE Transactions on Multimedia (2021)
Peng, D., Yang, W., Liu, C., Lü, S.: Sam-gan: self-attention supporting multi-stage generative adversarial networks for text-to-image synthesis. Neural Netw. 138, 57–67 (2021). https://doi.org/10.1016/j.neunet.2021.01.023
Yuan, M., Peng, Y.: Text-to-image synthesis via symmetrical distillation networks. In: Proceedings of the 26th ACM International Conference on Multimedia, pp. 1407–1415 (2018)
Ma, S., Fu, J., Chen, C.W., Mei, T.: Da-gan: Instance-level image translation by deep attention generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5657–5666 (2018)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Qin, J., Liu, F., Liu, K., Jeon, G., Yang, X.: Lightweight hierarchical residual feature fusion network for single-image super-resolution. Neurocomputing 478, 104–123 (2022). https://doi.org/10.1016/j.neucom.2021.12.090
Ding, J., Guo, H., Zhou, H., Yu, J., He, X., Jiang, B.: Distributed feedback network for single-image deraining. Inf. Sci. 572, 611–626 (2021)
Kim, J., Lee, J.K., Lee, K.M.: Accurate image super-resolution using very deep convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1646–1654 (2016)
Zhang, K., Zuo, W., Chen, Y., Meng, D., Zhang, L.: Beyond a gaussian denoiser: residual learning of deep cnn for image denoising. IEEE Trans. Image Process. 26(7), 3142–3155 (2017)
Du, Y., Li, X.: Recursive deep residual learning for single image dehazing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 730–737 (2018)
Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=B1xsqj09Fm
Muhammad Umer, R., Luca Foresti, G., Micheloni, C.: Deep generative adversarial residual convolutional networks for real-world super-resolution. In: The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (2020)
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of StyleGAN. In: Proc. CVPR (2020)
Liang, J., Xu, F., Yu, S.: A multi-scale semantic attention representation for multi-label image recognition with graph networks. Neurocomputing 491, 14–23 (2022). https://doi.org/10.1016/j.neucom.2022.03.057
Lv, X., Wang, C., Fan, X., Leng, Q., Jiang, X.: A novel image super-resolution algorithm based on multi-scale dense recursive fusion network. Neurocomputing 489, 98–111 (2022). https://doi.org/10.1016/j.neucom.2022.02.042
Li, W., Li, J., Li, J., Huang, Z., Zhou, D.: A lightweight multi-scale channel attention network for image super-resolution. Neurocomputing 456, 327–337 (2021). https://doi.org/10.1016/j.neucom.2021.05.090
Li, J., Fang, F., Mei, K., Zhang, G.: Multi-scale residual network for image super-resolution. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 517–532 (2018)
Dong, X., Wang, L., Sun, X., Jia, X., Gao, L., Zhang, B.: Remote sensing image super-resolution using second-order multi-scale networks. IEEE Trans. Geosci. Remote Sens. 59(4), 3473–3485 (2020)
Wang, Q., Gao, Q., Wu, L., Sun, G., Jiao, L.: Adversarial multi-path residual network for image super-resolution. IEEE Trans. Image Process. 30, 6648–6658 (2021)
Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997)
Wang, Q., Wu, B., Zhu, P., Li, P., Zuo, W., Hu, Q.: Eca-net: Efficient channel attention for deep convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The caltech-ucsd birds-200-2011 dataset (2011)
Nilsback, M.-E., Zisserman, A.: Automated flower classification over a large number of classes. In: 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729. IEEE(2008)
Ruan, S., Zhang, Y., Zhang, K., Fan, Y., Tang, F., Liu, Q., Chen, E.: Dae-gan: Dynamic aspect-aware gan for text-to-image synthesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13960–13969 (2021)
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. In: Lee D., Sugiyama M., Luxburg U., Guyon I., Garnett R. (eds.) Advances in Neural Information Processing Systems, vol. 29, pp. 2234–2242. Curran Associates, Inc. (2016)
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. In: Guyon I., Von Luxburg U., Bengio S., Wallach H., Fergus R., Vishwanathan S., Garnett R. (eds.) Advances in Neural Information Processing Systems, vol. 30, pp. 6629–6640. Curran Associates, Inc. (2017)
Acknowledgements
The authors acknowledge the financial support from the Fundamental Research Funds for the Provincial Universities of Zhejiang (Grant No. GK219909299001-015), Natural Science Foundation of China (Grant No. 62206082), National Undergraduate Training Program for Innovation and Entrepreneurship (Grant No. 202110336042), Planted talent plan (Grant No. 2022R407A002) and Research on higher teaching reform (YBJG202233).
Author information
Authors and Affiliations
Contributions
Jiajun Ding and Huanlei Guo contributed to the study conception and design. Beili Liu, Huanlei Guo and Ming Shen conducted the experiments. Material preparation, data collection and were performed by Beili Liu, Huanlei Guo, Ming Shen and Kenong Shen. Jiajun Ding, Beili Liu, Huanlei Guo performed the data analyses and wrote the first draft of manuscript. Jiajun Ding and Jun Yu were responsible for writing-review&editing and supervision. All authors have read and agreed to the published version of the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ding, J., Liu, B., Yu, J. et al. An efficient multi-path structure with staged connection and multi-scale mechanism for text-to-image synthesis. Multimedia Systems 29, 1391–1403 (2023). https://doi.org/10.1007/s00530-023-01067-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00530-023-01067-0