research-article

Fine-Grained Semantic Image Synthesis with Object-Attention Generative Adversarial Network

Authors:

Yutong GaoAuthors Info & Claims

ACM Transactions on Intelligent Systems and Technology (TIST), Volume 12, Issue 5

Article No.: 60, Pages 1 - 18

https://doi.org/10.1145/3470008

Published: 21 December 2021 Publication History

Abstract

Semantic image synthesis is a new rising and challenging vision problem accompanied by the recent promising advances in generative adversarial networks. The existing semantic image synthesis methods only consider the global information provided by the semantic segmentation mask, such as class label, global layout, and location, so the generative models cannot capture the rich local fine-grained information of the images (e.g., object structure, contour, and texture). To address this issue, we adopt a multi-scale feature fusion algorithm to refine the generated images by learning the fine-grained information of the local objects. We propose OA-GAN, a novel object-attention generative adversarial network that allows attention-driven, multi-fusion refinement for fine-grained semantic image synthesis. Specifically, the proposed model first generates multi-scale global image features and local object features, respectively, then the local object features are fused into the global image features to improve the correlation between the local and the global. In the process of feature fusion, the global image features and the local object features are fused through the channel-spatial-wise fusion block to learn ‘what’ and ‘where’ to attend in the channel and spatial axes, respectively. The fused features are used to construct correlation filters to obtain feature response maps to determine the locations, contours, and textures of the objects. Extensive quantitative and qualitative experiments on COCO-Stuff, ADE20K and Cityscapes datasets demonstrate that our OA-GAN significantly outperforms the state-of-the-art methods.

References

[1]

Yogesh Balaji, Martin Renqiang Min, Bing Bai, Rama Chellappa, and Hans Peter Graf. 2019. Conditional GAN with discriminative filter generation for text-to-video synthesis. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI’19). 1995–2001.

Digital Library

[2]

Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. 2018. COCO-Stuff: Thing and stuff classes in context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 1209–1218.

[3]

Liang Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. 2018. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 4 (2018), 834–848.

[4]

Qifeng Chen and Vladlen Koltun. 2017. Photographic image synthesis with cascaded refinement networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17). 1511–1520.

[5]

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16), Vol. 1. 3213–3223.

[6]

Alexey Dosovitskiy, Jost Tobias Springenberg, and Thomas Brox. 2015. Learning to generate chairs with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 1538–1546.

[7]

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Xu Bing, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Vol. 2. 2672–2680.

Digital Library

[8]

Zhang Han, Xu Tao, and Hongsheng Li. 2017. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’17), Vol. 1. 5908–5916.

[9]

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Advances in Neural Information Processing Systems 30. 6626–6637.

Digital Library

[10]

Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 7132–7141.

[11]

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2017. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 5967–5976.

[12]

Justin Johnson, Alexandre Alahi, and Fei Fei Li. 2016. Perceptual losses for real-time style transfer and super-resolution. In Proceedings of the European Conference on Computer Vision (ECCV’16). 694–711.

[13]

Levent Karacan, Zeynep Akata, Aykut Erdem, and Erkut Erdem. 2016. Learning to generate images of outdoor scenes from attributes and semantic layouts. arXiv:1612.00215

[14]

Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR’15). 1–15.

[15]

Diederik P. Kingma and Max Welling. 2014. Auto-encoding variational Bayes. In Proceedings of the International Conference on Learning Representations (ICLR’14). 1–14.

[16]

Wenbo Li, Pengchuan Zhang, Lei Zhang, Qiuyuan Huang, Xiaodong He, Siwei Lyu, and Jianfeng Gao. 2019. Object-driven text-to-image synthesis via adversarial training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). 12174–12182.

[17]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV’14). 740–755.

[18]

Shuang Ma, Jianlong Fu, Chang Wen Chen, and Tao Mei. 2018. DA-GAN: Instance-level image translation by deep attention generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 5657–5666.

[19]

Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. Computer Science (2014), 2672–2680.

[20]

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. 2018. Spectral normalization for generative adversarial networks. In Proceedings of the International Conference on Learning Representations (ICLR’18).

[21]

Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. 2014. Stochastic backpropagation and approximate inference in deep generative models. Proceedings of Machine Learning Research 32, 2 (2014), 1278–1286.

Digital Library

[22]

Augustus Odena, Christopher Olah, and Jonathon Shlen. 2017. Conditional image synthesis with auxiliary classifier GANs. In Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research), Doina Precup and Yee Whye Teh (Eds.), Vol. 70. PMLR, International Convention Centre, Sydney, Australia, 2642–2651.

Digital Library

[23]

Aaron Van Den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray Kavukcuoglu. 2016. Conditional image generation with PixelCNN decoders. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.). Curran Associates, 4790–4798.

Digital Library

[24]

Hyojin Park, Youngjoon Yoo, and Nojun Kwak. 2018. MC-GAN: Multi-conditional generative adversarial network for image synthesis. In Proceedings of the British Machine Vision Conference (BMVC’18).

[25]

Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. 2019. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). 2337–2346.

[26]

Xiaojuan Qi, Qifeng Chen, Jiaya Jia, and Vladlen Koltun. 2018. Semi-parametric image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 8808–8816.

[27]

Alec Radford, Luke Metz, and Soumith Chintala. 2016. Unsupervised representation learning with deep convolutional generative adversarial networks. In Proceedings of the International Conference on Learning Representations (ICLR’16), Vol. 2.

[28]

Scott Reed, Zeynep Akata, Honglak Lee, and Bernte Schiele. 2016. Learning deep representations of fine-grained visual descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16), Vol. 1. 49–58.

[29]

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen, and Xi Chen. 2016. Improved techniques for training GANs. In Advances in Neural Information Processing Systems 29. 2234–2242.

Digital Library

[30]

Yuanlong Shao, Yuan Zhou, and Deng Cai. 2011. Variational inference with graph regularization for image annotation. ACM Transactions on Intelligent Systems and Technology 2, 2 (2011), Article 11, 21 pages.

Digital Library

[31]

Zhixin Shu, Mihir Sahasrabudhe, Rıza Alp Güler, Dimitris Samaras, Nikos Paragios, and Iasonas Kokkinos. 2018. Deforming autoencoders: Unsupervised disentangling of shape and appearance. In Proceedings of the European Conference on Computer Vision (ECCV), Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss (Eds.). Springer International Publishing, Cham, Switzerland, 664–680.

[32]

Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations (ICLR’15).

[33]

Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. 2018. MoCoGAN: Decomposing motion and content for video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1526–1535.

[34]

Ting Chun Wang, Ming Yu Liu, Jun Yan Zhu, Andrew Tao, and Bryan Catanzaro. 2018. High-resolution image synthesis and semantic manipulation with conditional GANs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 8798–8807.

[35]

Olivia Wiles, A. Sophia Koepke, and Andrew Zisserman. 2018. X2Face: A network for controlling face generation using images, audio, and pose codes. In Proceedings of the European Conference on Computer Vision (ECCV), Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss (Eds.). Springer International Publishing, Cham, Switzerland, 690–706.

[36]

Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. 2018. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV’18). 3–19.

Digital Library

[37]

Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Sun Jian. 2018. Unified perceptual parsing for scene understanding. In Proceedings of the European Conference on Computer Vision (ECCV’18). 432–448.

[38]

Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. 2018. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 1316–1324.

[39]

Xinchen Yan, Jimei Yang, Kihyuk Sohn, and Honglak Lee. 2016. Attribute2Image: Conditional image generation from visual attributes. In Proceedings of the European Conference on Computer Vision (ECCV), Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (Eds.). Springer International Publishing, Cham, Switzerland, 776–791.

[40]

Jimei Yang, Scott E. Reed, Ming-Hsuan Yang, and Honglak Lee. 2015. Weakly-supervised disentangling with recurrent transformations for 3D view synthesis. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.). Curran Associates, 1099–1107.

Digital Library

[41]

Fisher Yu, Vladlen Koltun, and Thomas Funkhouser. 2017. Dilated residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 472–480.

[42]

Zeng Yu, Tianrui Li, Ning Yu, Yi Pan, Hongmei Chen, and Bing Liu. 2019. Reconstruction of hidden representation for robust feature extraction. ACM Transactions on Intelligent Systems and Technology 10, 2 (2019), Article 18, 24 pages.

Digital Library

[43]

Jichao Zhang, Yezhi Shu, Songhua Xu, Gongze Cao, Fan Zhong, Meng Liu, and Xueying Qin. 2018. Sparsely grouped multi-task generative adversarial networks for facial attribute manipulation. In Proceedings of the 26th ACM International Conference on Multimedia. 392–401.

Digital Library

[44]

Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. 2016. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 2921–2929.

[45]

Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. 2017. Scene parsing through ADE20K dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 633–641.

Cited By

Abou Akar CAbdel Massih RYaghi AKhalil JKamradt MMakhoul A(2024)Generative Adversarial Network Applications in Industry 4.0: A ReviewInternational Journal of Computer Vision10.1007/s11263-023-01966-9132:6(2195-2254)Online publication date: 12-Jan-2024
https://dl.acm.org/doi/10.1007/s11263-023-01966-9
Zhang PZhang ZGong D(2024)An improved normal wiggly hesitant fuzzy FMEA model and its application to risk assessment of electric bus systemsApplied Intelligence10.1007/s10489-024-05458-254:8(6213-6237)Online publication date: 8-May-2024
https://dl.acm.org/doi/10.1007/s10489-024-05458-2
Xu RHuang WZhao JChen MNie L(2023)A Spatial and Adversarial Representation Learning Approach for Land Use Classification with POIsACM Transactions on Intelligent Systems and Technology10.1145/362782414:6(1-25)Online publication date: 14-Nov-2023
https://dl.acm.org/doi/10.1145/3627824
Show More Cited By

Index Terms

Fine-Grained Semantic Image Synthesis with Object-Attention Generative Adversarial Network
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Reconstruction
      2. Computer vision representations
        Image representations
  2. Computer graphics
    1. Image manipulation
      1. Image processing

Recommendations

Generative Adversarial Networks with Bi-directional Normalization for Semantic Image Synthesis
ICMR '21: Proceedings of the 2021 International Conference on Multimedia Retrieval

Semantic image synthesis aims at translating semantic label maps to photo-realistic images. However, most of previous methods easily generate blurred regions and artifacts, and the quality of these images is far from realistic. There are two unresolved ...
OASIS: Only Adversarial Supervision for Semantic Image Synthesis
Abstract
Despite their recent successes, generative adversarial networks (GANs) for semantic image synthesis still suffer from poor image quality when trained with only adversarial supervision. Previously, additionally employing the VGG-based perceptual ...
Fine-grained Text to Image Synthesis
Pattern Recognition
Abstract
Fine-grained text to image synthesis involves generating images from texts that belong to different categories. In contrast to general text to image synthesis, in fine-grained synthesis there is high similarity between images of different ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Intelligent Systems and Technology

ACM Transactions on Intelligent Systems and Technology Volume 12, Issue 5

October 2021

383 pages

ISSN:2157-6904

EISSN:2157-6912

DOI:10.1145/3484925

Editor:
Huan Liu
Arizona State University, USA

Issue’s Table of Contents

Copyright © 2021 Association for Computing Machinery.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 December 2021

Accepted: 01 June 2021

Revised: 01 June 2021

Received: 01 June 2020

Published in TIST Volume 12, Issue 5

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Funding Sources

National Natural Science Foundation of China
Beijing Natural Science Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
150
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Abou Akar CAbdel Massih RYaghi AKhalil JKamradt MMakhoul A(2024)Generative Adversarial Network Applications in Industry 4.0: A ReviewInternational Journal of Computer Vision10.1007/s11263-023-01966-9132:6(2195-2254)Online publication date: 12-Jan-2024
https://dl.acm.org/doi/10.1007/s11263-023-01966-9
Zhang PZhang ZGong D(2024)An improved normal wiggly hesitant fuzzy FMEA model and its application to risk assessment of electric bus systemsApplied Intelligence10.1007/s10489-024-05458-254:8(6213-6237)Online publication date: 8-May-2024
https://dl.acm.org/doi/10.1007/s10489-024-05458-2
Xu RHuang WZhao JChen MNie L(2023)A Spatial and Adversarial Representation Learning Approach for Land Use Classification with POIsACM Transactions on Intelligent Systems and Technology10.1145/362782414:6(1-25)Online publication date: 14-Nov-2023
https://dl.acm.org/doi/10.1145/3627824
Shirvani SBaseri YGhorbani A(2023)Evaluation Framework for Electric Vehicle Security Risk AssessmentIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2023.330766025:1(33-56)Online publication date: 11-Sep-2023
https://dl.acm.org/doi/10.1109/TITS.2023.3307660
Wang BZhang MDing PYang TJin YXu Y(2023)User re-identification via human mobility trajectories with siamese transformer networksApplied Intelligence10.1007/s10489-023-05234-854:1(815-834)Online publication date: 20-Dec-2023
https://dl.acm.org/doi/10.1007/s10489-023-05234-8
Hossain NBijoy MIslam SShatabda S(2023)Panini: a transformer-based grammatical error correction method for BanglaNeural Computing and Applications10.1007/s00521-023-09211-736:7(3463-3477)Online publication date: 4-Dec-2023
https://dl.acm.org/doi/10.1007/s00521-023-09211-7
Yang X(undefined)Research on the Application of Neural Network Classification Model in English Grammar Error CorrectionACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3596492
https://dl.acm.org/doi/10.1145/3596492

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Issue’s Table of Contents