MUSH: Multi-scale Hierarchical Feature Extraction for Semantic Image Synthesis

Wang, Zicong; Ren, Qiang; Wang, Junli; Yan, Chungang; Jiang, Changjun

doi:10.1007/978-3-031-26293-7_12

Zicong Wang^12,13,
Qiang Ren^12,13,
Junli Wang^12,13,
Chungang Yan^12,13 &
…
Changjun Jiang^12,13

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13847))

Included in the following conference series:

Asian Conference on Computer Vision

369 Accesses

Abstract

Semantic image synthesis aims to translate semantic label masks to photo-realistic images. Previous methods have limitations that extract semantic features with limited convolutional kernels and ignores some crucial information, such as relative positions of pixels. To address these issues, we propose MUSH, a novel semantic image synthesis model that utilizes multi-scale information. In the generative network stage, a multi-scale hierarchical architecture is proposed for feature extraction and merged successfully with guided sampling operation to enhance semantic image synthesis. Meanwhile, in the discriminative network stage, the model contains two different modules for feature extraction of semantic masks and real images, respectively, which helps use semantic masks information more effectively. Furthermore, our proposed model achieves the state-of-the-art qualitative evaluation and quantitative metrics on some challenging datasets. Experimental results show that our method can be generalized to various models for semantic image synthesis. Our code is available at https://github.com/WangZC525/MUSH.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Almahairi, A., Rajeshwar, S., Sordoni, A., Bachman, P., Courville, A.: Augmented cycleGAN: learning many-to-many mappings from unpaired data. In: International Conference on Machine Learning, pp. 195–204. PMLR (2018)
Google Scholar
Caesar, H., Uijlings, J., Ferrari, V.: Coco-stuff: thing and stuff classes in context. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1209–1218 (2018)
Google Scholar
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, Atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2017)
Article Google Scholar
Chen, Q., Koltun, V.: Photographic image synthesis with cascaded refinement networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1511–1520 (2017)
Google Scholar
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213–3223 (2016)
Google Scholar
Dundar, A., Sapra, K., Liu, G., Tao, A., Catanzaro, B.: Panoptic-based image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8070–8079 (2020)
Google Scholar
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27 (2014)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1501–1510 (2017)
Google Scholar
Huang, X., Liu, M.-Y., Belongie, S., Kautz, J.: Multimodal unsupervised image-to-image translation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 179–196. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01219-9_11
Chapter Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456. PMLR (2015)
Google Scholar
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1125–1134 (2017)
Google Scholar
Jiang, L., Zhang, C., Huang, M., Liu, C., Shi, J., Loy, C.C.: TSIT: a simple and versatile framework for image-to-image translation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp. 206–222. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_13
Chapter Google Scholar
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_43
Chapter Google Scholar
Karacan, L., Akata, Z., Erdem, A., Erdem, E.: Learning to generate images of outdoor scenes from attributes and semantic layouts. arXiv preprint arXiv:1612.00215 (2016)
Karacan, L., Akata, Z., Erdem, A., Erdem, E.: Manipulating attributes of natural scenes via hallucination. ACM Trans. Graph. (TOG) 39(1), 1–17 (2019)
Article Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (Poster) (2015)
Google Scholar
Lim, J.H., Ye, J.C.: Geometric GAN. arXiv preprint arXiv:1705.02894 (2017)
Liu, M.Y., Breuel, T., Kautz, J.: Unsupervised image-to-image translation networks. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Liu, X., Yin, G., Shao, J., Wang, X., et al.: Learning to predict layout-to-image conditional convolutions for semantic image synthesis. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Long, J., Lu, H.: Generative adversarial networks with bi-directional normalization for semantic image synthesis. In: Proceedings of the 2021 International Conference on Multimedia Retrieval, pp. 219–226 (2021)
Google Scholar
Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)
Miyato, T., Kataoka, T., Koyama, M., Yoshida, Y.: Spectral normalization for generative adversarial networks. In: International Conference on Learning Representations (2018)
Google Scholar
Nakashima, K.: Deeplab-pytorch. https://github.com/kazuto1011/deeplab-pytorch (2018)
Ntavelis, E., Romero, A., Kastanis, I., Van Gool, L., Timofte, R.: SESAME: semantic editing of scenes by adding, manipulating or erasing objects. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12367, pp. 394–411. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58542-6_24
Chapter Google Scholar
Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: Semantic image synthesis with spatially-adaptive normalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2337–2346 (2019)
Google Scholar
Qi, X., Chen, Q., Jia, J., Koltun, V.: Semi-parametric image synthesis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8808–8816 (2018)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Schonfeld, E., Sushko, V., Zhang, D., Gall, J., Schiele, B., Khoreva, A.: You only need adversarial supervision for semantic image synthesis. In: International Conference on Learning Representations (2020)
Google Scholar
Seitzer, M.: Pytorch-fid: FID score for PyTorch, August 2020. https://github.com/mseitzer/pytorch-fid. version 0.2.1
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Tan, Z., et al.: Diverse semantic image synthesis via probability distribution modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7962–7971 (2021)
Google Scholar
Tan, Z., et al.: Efficient semantic image synthesis via class-adaptive normalization. IEEE Trans. Pattern Anal. Mach. Intell. 44, 4852–4866 (2021)
Google Scholar
Tan, Z., et al.: Rethinking spatially-adaptive normalization. arXiv preprint arXiv:2004.02867 (2020)
Tang, H., Bai, S., Sebe, N.: Dual attention GANs for semantic image synthesis. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 1994–2002 (2020)
Google Scholar
Tang, H., Qi, X., Xu, D., Torr, P.H., Sebe, N.: Edge guided GANs with semantic preserving for semantic image synthesis. arXiv preprint arXiv:2003.13898 (2020)
Tang, H., Xu, D., Yan, Y., Torr, P.H., Sebe, N.: Local class-specific and global image-level generative adversarial networks for semantic-guided scene generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7870–7879 (2020)
Google Scholar
Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional GANs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8798–8807 (2018)
Google Scholar
Xiao, Tete, Liu, Yingcheng, Zhou, Bolei, Jiang, Yuning, Sun, Jian: Unified perceptual parsing for scene understanding. In: Ferrari, Vittorio, Hebert, Martial, Sminchisescu, Cristian, Weiss, Yair (eds.) ECCV 2018. LNCS, vol. 11209, pp. 432–448. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_26
Chapter Google Scholar
Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In: International Conference on Learning Representations (ICLR) (2016)
Google Scholar
Yu, F., Koltun, V., Funkhouser, T.: Dilated residual networks. In: Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Zhang, H., Goodfellow, I., Metaxas, D., Odena, A.: Self-attention generative adversarial networks. In: International Conference on Machine Learning, pp. 7354–7363. PMLR (2019)
Google Scholar
Zhao, B., Meng, L., Yin, W., Sigal, L.: Image generation from layout. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8584–8593 (2019)
Google Scholar
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 633–641 (2017)
Google Scholar
Zhou, B., et al.: Semantic understanding of scenes through the ade20k dataset. Int. J. Comput. Vision. 127, 302–321 (2018)
Article Google Scholar
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2223–2232 (2017)
Google Scholar
Zhu, J.Y., Zhang, R., Pathak, D., Darrell, T., Efros, A.A., Wang, O., Shechtman, E.: Toward multimodal image-to-image translation. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Zhu, P., Abdal, R., Qin, Y., Wonka, P.: Sean: image synthesis with semantic region-adaptive normalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5104–5113 (2020)
Google Scholar
Zhu, Z., Xu, Z., You, A., Bai, X.: Semantically multi-modal image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5467–5476 (2020)
Google Scholar

Download references

Acknowledgements

This work was supported by the National Key Research and Development Program of China under Grant 2018YFB2100801.

Author information

Authors and Affiliations

Key Laboratory of Embedded System and Service Computing (Tongji University), Ministry of Education, Shanghai, 201804, China
Zicong Wang, Qiang Ren, Junli Wang, Chungang Yan & Changjun Jiang
National (Province-Ministry Joint) Collaborative Innovation Center for Financial Network Security, Tongji University, Shanghai, 201804, China
Zicong Wang, Qiang Ren, Junli Wang, Chungang Yan & Changjun Jiang

Authors

Zicong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Qiang Ren
View author publications
You can also search for this author in PubMed Google Scholar
Junli Wang
View author publications
You can also search for this author in PubMed Google Scholar
Chungang Yan
View author publications
You can also search for this author in PubMed Google Scholar
Changjun Jiang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Changjun Jiang .

Editor information

Editors and Affiliations

University of Wollongong, Wollongong, NSW, Australia
Lei Wang
University of Bonn, Bonn, Germany
Juergen Gall
University of Adelaide, Adelaide, SA, Australia
Tat-Jun Chin
National Institute of Informatics, Tokyo, Japan
Imari Sato
Johns Hopkins University, Baltimore, MD, USA
Rama Chellappa

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 36364 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, Z., Ren, Q., Wang, J., Yan, C., Jiang, C. (2023). MUSH: Multi-scale Hierarchical Feature Extraction for Semantic Image Synthesis. In: Wang, L., Gall, J., Chin, TJ., Sato, I., Chellappa, R. (eds) Computer Vision – ACCV 2022. ACCV 2022. Lecture Notes in Computer Science, vol 13847. Springer, Cham. https://doi.org/10.1007/978-3-031-26293-7_12

Download citation

DOI: https://doi.org/10.1007/978-3-031-26293-7_12
Published: 11 March 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26292-0
Online ISBN: 978-3-031-26293-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

MUSH: Multi-scale Hierarchical Feature Extraction for Semantic Image Synthesis