research-article

Disentangled Representations and Hierarchical Refinement of Multi-Granularity Features for Text-to-Image Synthesis

Authors:
Pei Dong

Shandong University, JiNan, China

Shandong University, JiNan, China
View Profile

,
Lei Wu

Shandong University, JiNan, China

Shandong University, JiNan, China
View Profile

,
Lei Meng

Shandong University, JiNan, China

Shandong University, JiNan, China
View Profile

,
Xiangxu Meng

Shandong University, JiNan, China

Shandong University, JiNan, China
View Profile

ICMR '22: Proceedings of the 2022 International Conference on Multimedia RetrievalJune 2022Pages 268–276https://doi.org/10.1145/3512527.3531389

Published:27 June 2022Publication History

ICMR '22: Proceedings of the 2022 International Conference on Multimedia Retrieval

Pages 268–276

ABSTRACT

In this paper, we focus on generating photo-realistic images from given text descriptions. Current methods first generate an initial image and then progressively refine it to a high-resolution one. These methods typically indiscriminately refine all granularity features output from the previous stage. However, the ability to express different granularity features in each stage is not consistent, and it is difficult to express precise semantics by further refining the features with poor quality generated in the previous stage. Current methods cannot refine different granularity features independently, resulting in that it is challenging to clearly express all factors of semantics in generated image, and some features even become worse. To address this issue, we propose a Hierarchical Disentangled Representations Generative Adversarial Networks (HDR-GAN) to generate photo-realistic images by explicitly disentangling and individually modeling the factors of semantics in the image. HDR-GAN introduces a novel component called multi-granularity feature disentangled encoder to represent image information comprehensively through explicitly disentangling multi-granularity features including pose, shape and texture. Moreover, we develop a novel Multi-granularity Feature Refinement (MFR) containing a Coarse-grained Feature Refinement (CFR) model and a Fine-grained Feature Refinement (FFR) model. CFR utilizes coarse-grained disentangled representations (e.g., pose and shape) to clarify category information, while FFR employs fine-grained disentangled representations (e.g., texture) to reflect instance-level details. Extensive experiments on two well-studied and publicly available datasets (i.e., CUB-200 and CLEVR-SV) demonstrate the rationality and superiority of our method.

Supplemental Material

ICMR2022-icmrfp170.mp4

mp4

48.8 MB

Download

References

Kevin Chen, Christopher B Choy, Manolis Savva, Angel X Chang, Thomas Funkhouser, and Silvio Savarese. 2018. Text2shape: Generating shapes from natural language by learning joint embeddings. In Asian Conference on Computer Vision. Springer, 100--116.Google Scholar
Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. 2016. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Proceedings of the 30th International Conference on Neural Information Processing Systems. 2180--2188.Google Scholar
Ayushman Dash, John Cristian Borges Gamboa, Sheraz Ahmed, Marcus Liwicki, and Muhammad Zeshan Afzal. 2017. Tac-gan-text conditioned auxiliary classifier generative adversarial network. arXiv preprint arXiv:1703.06412 (2017).Google Scholar
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. Advances in neural information processing systems 27 (2014).Google Scholar
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems 30 (2017).Google Scholar
Tobias Hinz, Stefan Heinrich, and Stefan Wermter. 2018. Generating Multiple Objects at Spatially Distinct Locations. In International Conference on Learning Representations.Google Scholar
Tobias Hinz, Stefan Heinrich, and Stefan Wermter. 2020. Semantic Object Accuracy for Generative Text-to-Image Synthesis. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020).Google ScholarCross Ref
Seunghoon Hong, Dingdong Yang, Jongwook Choi, and Honglak Lee. 2018. Inferring semantic layout for hierarchical text-to-image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7986--7994.Google ScholarCross Ref
Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. 2017. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2901--2910.Google ScholarCross Ref
Wenbo Li, Pengchuan Zhang, Lei Zhang, Qiuyuan Huang, Xiaodong He, Siwei Lyu, and Jianfeng Gao. 2019. Object-driven text-to-image synthesis via adversarial training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12174--12182.Google ScholarCross Ref
Yitong Li, Zhe Gan, Yelong Shen, Jingjing Liu, Yu Cheng, Yuexin Wu, Lawrence Carin, David Carlson, and Jianfeng Gao. 2019. Storygan: A sequential conditional gan for story visualization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6329--6338.Google ScholarCross Ref
Yikang Li, Tao Ma, Yeqi Bai, Nan Duan, Sining Wei, and Xiaogang Wang. 2019. Pastegan: A semi-parametric method to generate image from scene graph. Advances in Neural Information Processing Systems 32 (2019), 3948--3958.Google Scholar
Xiankai Lu, Wenguan Wang, Jianbing Shen, Yu-Wing Tai, David J Crandall, and Steven CH Hoi. 2020. Learning video object segmentation from unlabeled videos. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8960--8970.Google ScholarCross Ref
Gaurav Mittal, Shubham Agrawal, Anuva Agarwal, Sushant Mehta, and Tanya Marwah. 2019. Interactive image generation using scene graphs. arXiv preprint arXiv:1905.03743 (2019).Google Scholar
Tianrui Niu, Fangxiang Feng, Lingxuan Li, and XiaojieWang. 2020. Image synthesis from locally related texts. In Proceedings of the 2020 International Conference on Multimedia Retrieval. 145--153.Google ScholarDigital Library
Tingting Qiao, Jing Zhang, Duanqing Xu, and Dacheng Tao. 2019. Mirrorgan: Learning text-to-image generation by redescription. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1505--1514.Google ScholarCross Ref
Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. 2016. Generative adversarial text to image synthesis. In International Conference on Machine Learning. PMLR, 1060--1069.Google Scholar
Scott E Reed, Zeynep Akata, Santosh Mohan, Samuel Tenka, Bernt Schiele, and Honglak Lee. 2016. Learning what and where to draw. Advances in neural information processing systems 29 (2016), 217--225.Google Scholar
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. 2016. Improved techniques for training gans. Advances in neural information processing systems 29 (2016), 2234--2242.Google Scholar
Rui Shu, Hung Bui, and Stefano Ermon. 2017. Ac-gan learns a biased distribution. In NIPS Workshop on Bayesian Deep Learning, Vol. 8.Google Scholar
Krishna Kumar Singh, Utkarsh Ojha, and Yong Jae Lee. 2019. Finegan: Unsupervised hierarchical disentanglement for fine-grained object generation and discovery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6490--6499.Google ScholarCross Ref
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2818--2826.Google ScholarCross Ref
CatherineWah, Steve Branson, PeterWelinder, Pietro Perona, and Serge Belongie. 2011. The caltech-ucsd birds-200--2011 dataset. (2011).Google Scholar
GuanshuoWang, Yufeng Yuan, Xiong Chen, Jiwei Li, and Xi Zhou. 2018. Learning discriminative features with multiple granularities for person re-identification. In Proceedings of the 26th ACM international conference on Multimedia. 274--282.Google Scholar
Min Wang, Congyan Lang, Liqian Liang, Songhe Feng, Tao Wang, and Yutong Gao. 2020. End-to-End Text-to-Image Synthesis with Spatial Constrains. ACM Transactions on Intelligent Systems and Technology (TIST) 11, 4 (2020), 1--19.Google ScholarDigital Library
Min Wang, Congyan Lang, Liqian Liang, Gengyu Lyu, Songhe Feng, and Tao Wang. 2020. Attentive generative adversarial network to bridge multi-domain gap for image synthesis. In 2020 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1--6.Google ScholarCross Ref
Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. 2018. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1316--1324.Google ScholarCross Ref
Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. 2017. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE international conference on computer vision. 5907--5915.Google ScholarCross Ref
Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. 2018. Stackgan++: Realistic image synthesis with stacked generative adversarial networks. IEEE transactions on pattern analysis and machine intelligence 41, 8 (2018), 1947--1962.Google Scholar
Jize Zhang, Bhavya Kailkhura, and T Yong-Jin Han. 2020. Mix-n-match: Ensemble and compositional methods for uncertainty calibration in deep learning. In International Conference on Machine Learning. PMLR, 11117--11128.Google Scholar
Jiale Zhi. 2017. PixelBrush: Art Generation from text with GANs. In Cl. Proj. Stanford CS231N Convolutional Neural Networks Vis. Recognition, Sprint 2017. 256.Google Scholar
Tianfei Zhou, Wenguan Wang, Si Liu, Yi Yang, and Luc Van Gool. 2021. Differentiable Multi-Granularity Human Representation Learning for Instance-Aware Human Semantic Parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1622--1631.Google ScholarCross Ref
Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision. 2223--2232.Google ScholarCross Ref
Minfeng Zhu, Pingbo Pan,Wei Chen, and Yi Yang. 2019. Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5802--5810.Google ScholarCross Ref

Index Terms

Disentangled Representations and Hierarchical Refinement of Multi-Granularity Features for Text-to-Image Synthesis
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
  2. Machine learning

Recommendations

MMFL-net: multi-scale and multi-granularity feature learning for cross-domain fashion retrieval
Abstract
Instance-level image retrieval in fashion industry is a challenging issue owing to its increasing importance in real-scenario visual fashion search. Cross-domain fashion retrieval aims to match the unconstrained customer images as queries for ...
Read More
Multi-granularity Feature Attention Fusion Network for Image-Text Sentiment Analysis
Advances in Computer Graphics
Abstract
Multi-modal sentiment analysis of images and texts in social media has surpassed traditional text-based analysis and attracted more and more attention from researchers. Existing studies on multi-modal sentiment analysis of texts and images focus ...
Read More
Text-Enriched Representations for News Image Classification
WWW '18: Companion Proceedings of the The Web Conference 2018

Images have a prominent role in the communication of news on the Web. We propose a novel method for image classification with subject categories when limited annotated images are available for training the classifier. A neural network based encoder ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICMR '22: Proceedings of the 2022 International Conference on Multimedia Retrieval
June 2022
714 pages
ISBN:9781450392389
DOI:10.1145/3512527
General Chairs:
Vincent Oria
New Jersey Institute of Technology, USA
,
Maria Luisa Sapino
Università degli Studi di Torino, Italy
,
Shin'ichi Satoh
National Institute of Informatics, Japan
,
Brigitte Kerhervé
Université du Québec à Montréal, Canada
,
Program Chairs:
Wen-Huang Cheng
National Yang Ming Chao Tung University, Taiwan
,
Ichiro Ide
Nagoya University, Japan
,
Vivek Singh
Rutgers University, USA
Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 June 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
generative adversarial networks
hierarchical disentangled representations
multi-granularity features
text-to-image synthesis
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate254of830submissions,31%
Upcoming Conference
ICMR '24

Sponsor:

sigmm

International Conference on Multimedia Retrieval

June 10 - 14, 2024

Phuket , Thailand
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 175
  Total Downloads
- Downloads (Last 12 months)39
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Disentangled Representations and Hierarchical Refinement of Multi-Granularity Features for Text-to-Image Synthesis

ICMR '22: Proceedings of the 2022 International Conference on Multimedia Retrieval

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

MMFL-net: multi-scale and multi-granularity feature learning for cross-domain fashion retrieval

Multi-granularity Feature Attention Fusion Network for Image-Text Sentiment Analysis

Text-Enriched Representations for News Image Classification

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Disentangled Representations and Hierarchical Refinement of Multi-Granularity Features for Text-to-Image Synthesis

ICMR '22: Proceedings of the 2022 International Conference on Multimedia Retrieval

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

MMFL-net: multi-scale and multi-granularity feature learning for cross-domain fashion retrieval

Multi-granularity Feature Attention Fusion Network for Image-Text Sentiment Analysis

Text-Enriched Representations for News Image Classification

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media