research-article

Collocated Clothing Synthesis with GANs Aided by Textual Information: A Multi-Modal Framework

Authors:

Zhao ZhangAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 20, Issue 1

Article No.: 26, Pages 1 - 25

https://doi.org/10.1145/3614097

Published: 18 September 2023 Publication History

Abstract

Synthesizing realistic images of fashion items which are compatible with given clothing images, as well as conditioning on multiple modalities, brings novel and exciting applications together with enormous economic potential. In this work, we propose a multi-modal collocation framework based on generative adversarial network (GAN) for synthesizing compatible clothing images. Given an input clothing item that consists of an image and a text description, our model works on synthesizing a clothing image which is compatible with the input clothing, as well as being guided by a given text description from the target domain. Specifically, a generator aims to synthesize realistic and collocated clothing images relying on image- and text-based latent representations learned from the source domain. An auxiliary text representation from the target domain is added for supervising the generation results. In addition, a multi-discriminator framework is carried out to determine compatibility between the generated clothing images and the input clothing images, as well as visual-semantic matching between the generated clothing images and the targeted textual information. Extensive quantitative and qualitative results demonstrate that our model substantially outperforms state-of-the-art methods in terms of authenticity, diversity, and visual-semantic similarity between image and text.

Supplementary Material

3614097.supp (3614097.supp.pdf)

Supplementary material

Download
1.26 MB

References

[1]

Aurélie Bugeau, Vinh-Thong Ta, and Nicolas Papadakis. 2013. Variational exemplar-based image colorization. IEEE Transactions on Image Processing 23, 1 (2013), 298–307.

Digital Library

[2]

Lele Chen, Justin Tian, Guo Li, Cheng-Haw Wu, Erh-Kan King, Kuan-Ting Chen, Shao-Hang Hsieh, and Chenliang Xu. 2020. TailorGAN: Making user-defined fashion designs. In WACV. 3241–3250.

[3]

Yi Rui Cui, Qi Liu, Cheng Ying Gao, and Zhongbo Su. 2018. FashionGAN: Display your fashion design using conditional generative adversarial nets. In Computer Graphics Forum, Vol. 37. Wiley Online Library, 109–119.

[4]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. NaacL-HLT (2019), 4171–4186.

[5]

Haoye Dong, Xiaodan Liang, Yixuan Zhang, Xujie Zhang, Xiaohui Shen, Zhenyu Xie, Bowen Wu, and Jian Yin. 2020. Fashion editing with adversarial parsing learning. In CVPR. 8120–8128.

[6]

Xue Dong, Xuemeng Song, Fuli Feng, Peiguang Jing, Xin-Shun Xu, and Liqiang Nie. 2019. Personalized capsule wardrobe creation with garment and user modeling. In ACM MM. 302–310.

[7]

Xiaoling Gu, Jun Yu, Yongkang Wong, and Mohan S. Kankanhalli. 2020. Toward multi-modal conditioned fashion image translation. IEEE Transactions on Multimedia 23 (2020), 2361–2371.

[8]

Xintong Han, Zuxuan Wu, Zhe Wu, Ruichi Yu, and Larry S Davis. 2018. Viton: An image-based virtual try-on network. In CVPR. 7543–7552.

[9]

Trang-Thi Ho, John Jethro Virtusio, Yung-Yao Chen, Chih-Ming Hsu, and Kai-Lung Hua. 2020. Sketch-guided deep portrait generation. ACM TOMM 16, 3 (2020), 1–18.

Digital Library

[10]

Wei-Lin Hsiao and Kristen Grauman. 2018. Creating capsule wardrobes from fashion images. In CVPR. 7161–7170.

[11]

Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. 2018. Multimodal unsupervised image-to-image translation. In ECCV. 172–189.

[12]

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2017. Image-to-image translation with conditional adversarial networks. In CVPR. 1125–1134.

[13]

Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim. 2017. Learning to discover cross-domain relations with generative adversarial networks. In International Conference on Machine Learning. PMLR, 1857–1865.

[14]

Durk P. Kingma and Prafulla Dhariwal. 2018. Generative flow with invertible 1x1 convolutions. Advances in Neural Information Processing Systems 31 (2018).

[15]

Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv:1312.6114. Retrieved from https://arxiv.org/abs/1312.6114

[16]

Hsin-Ying Lee, Hung-Yu Tseng, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. 2018. Diverse image-to-image translation via disentangled representations. In ECCV. 35–51.

[17]

Hsin-Ying Lee, Hung-Yu Tseng, Qi Mao, Jia-Bin Huang, Yu-Ding Lu, Maneesh Singh, and Ming-Hsuan Yang. 2020. Drit++: Diverse image-to-image translation via disentangled representations. International Journal of Computer Vision 128 (2020), 2402–2417.

[18]

Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, and Philip HS Torr. 2020. Manigan: Text-guided image manipulation. In CVPR. 7880–7889.

[19]

Bowen Li, Xiaojuan Qi, Philip Torr, and Thomas Lukasiewicz. 2020. Lightweight generative adversarial networks for text-guided image manipulation. NIPS 33 (2020), 22020–22031.

[20]

Zejian Li, Jingyu Wu, Immanuel Koh, Yongchuan Tang, and Lingyun Sun. 2021. Image synthesis from layout with locality-aware mask adaption. In ICCV. 13819–13828.

[21]

Chieh Hubert Lin, Chia-Che Chang, Yu-Sheng Chen, Da-Cheng Juan, Wei Wei, and Hwann-Tzong Chen. 2019. Coco-gan: Generation by parts via conditional coordinating. In ICCV. 4512–4521.

[22]

Yujie Lin, Pengjie Ren, Zhumin Chen, Zhaochun Ren, Jun Ma, and Maarten de Rijke. 2019. Improving outfit recommendation with co-supervision of fashion generation. In WWWC. 1095–1105.

[23]

Jinhuan Liu, Xuemeng Song, Zhumin Chen, and Jun Ma. 2020. MGCM: Multi-modal generative compatibility modeling for clothing matching. Neurocomputing 414 (2020), 215–224.

[24]

Lingjie Liu, Weipeng Xu, Marc Habermann, Michael Zollhöfer, Florian Bernard, Hyeongwoo Kim, Wenping Wang, and Christian Theobalt. 2020. Neural human video rendering by learning dynamic textures and rendering-to-video translation.arXiv:2001.04947. Retrieved from https://arxiv.org/abs/2001.04947

[25]

Linlin Liu, Haijun Zhang, Yuzhu Ji, and QM Jonathan Wu. 2019. Toward AI fashion design: An attribute-GAN model for clothing match. Neurocomputing 341 (2019), 156–167.

Digital Library

[26]

Linlin Liu, Haijun Zhang, Xiaofei Xu, Zhao Zhang, and Shuicheng Yan. 2019. Collocating clothes with generativeadversarial networks cosupervised by categories and attributes: a multidiscriminator framework. IEEE Trans. NeuralNetw. Learn. Syst. 31, 9 (2019), 3540–3554.

[27]

Yu Liu, Wei Chen, Li Liu, and Michael S Lew. 2019. Swapgan: A multistage generative approach for person-to-person fashion style transfer. IEEE Transactions on Multimedia 21, 9 (2019), 2209–2222.

[28]

Shuang Ma, Daniel Mcduff, and Yale Song. 2019. A generative adversarial network for style modeling in a text-to-speech system. In ICLR, Vol. 2. 1–15.

[29]

Yifang Men, Yiming Mao, Yuning Jiang, Wei-Ying Ma, and Zhouhui Lian. 2020. Controllable person image synthesis with attribute-decomposed gan. In CVPR. 5084–5093.

[30]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of wordsand phrases and their compositionality. In NIPS 26 (2013), 1–9.

[31]

Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv:1411.1784. Retrieved from https://arxiv.org/abs/1411.1784

[32]

Kamal Nasrollahi and Thomas B. Moeslund. 2014. Super-resolution: A comprehensive survey. Machine Vision and Applications 25, 6 (2014), 1423–1468.

Digital Library

[33]

Liqiang Nie, Wenjie Wang, Richang Hong, Meng Wang, and Qi Tian. 2019. Multimodal dialog system: Generating responses via adaptive decoders. In ACM MM. 1098–1106.

[34]

Taesung Park, Alexei A Efros, Richard Zhang, and Jun-Yan Zhu. 2020. Contrastive learning for unpaired image-to-image translation. In ECCV. Springer, 319–345.

[35]

Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. 2019. Semantic image synthesis with spatially-adaptive normalization. In CVPR. 2337–2346.

[36]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision. In ICML. 8748–8763.

[37]

Alec Radford, Luke Metz, and Soumith Chintala. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv:1511.06434. Retrieved from https://arxiv.org/abs/1511.06434

[38]

Siddarth Ravichandran, Ondřej Texler, Dimitar Dinev, and Hyun Jae Kang. 2023. Synthesizing photorealistic virtual humans through cross-modal disentanglement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4585–4594.

[39]

Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. 2016. Generative adversarial text to image synthesis. In ICML. 1060–1069.

[40]

Danilo Rezende and Shakir Mohamed. 2015. Variational inference with normalizing flows. In ICML. 1530–1538.

[41]

Othman Sbai, Mohamed Elhoseiny, Antoine Bordes, Yann LeCun, and Camille Couprie. 2018. Design: Design inspiration from generative networks. In ECCVW. 1–7.

[42]

Yujun Shen, Jinjin Gu, Xiaoou Tang, and Bolei Zhou. 2020. Interpreting the latent space of gans for semantic face editing. In CVPR. 9243–9252.

[43]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. Retrieved from https://arxiv.org/abs/1409.1556

[44]

Kihyuk Sohn, Honglak Lee, and Xinchen Yan. 2015. Learning structured output representation using deep conditional generative models. In NeurIPS. 3483–3491.

[45]

Hao Tang, Dan Xu, Nicu Sebe, and Yan Yan. 2019. Attention-guided generative adversarial networks for unsupervised image-to-image translation. In IJCNN. 1–8.

[46]

Duc Minh Vo and Akihiro Sugimoto. 2020. Visual-relation conscious image generation from structured-text. In ECCV. Springer, 290–306.

[47]

Andrey Voynov, Kfir Aberman, and Daniel Cohen-Or. 2023. Sketch-guided text-to-image diffusion models. In ACM SIGGRAPH 2023 Conference Proceedings. 1–11.

[48]

Chengde Wan, Thomas Probst, Luc Van Gool, and Angela Yao. 2017. Crossing nets: Combining gans and vaes with a shared latent space for hand pose estimation. In CVPR. 680–689.

[49]

Bochao Wang, Huabin Zheng, Xiaodan Liang, Yimin Chen, Liang Lin, and Meng Yang. 2018. Toward characteristic-preserving image-based virtual try-on network. In ECCV. 589–604.

[50]

Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018. High-resolution image synthesis and semantic manipulation with conditional gans. In CVPR. 8798–8807.

[51]

Weiran Wang, Xinchen Yan, Honglak Lee, and Karen Livescu. 2016. Deep variational canonical correlation analysis. arXiv:1610.03454. Retrieved from https://arxiv.org/abs/1610.03454

[52]

Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In CVPR. 7794–7803.

[53]

Zhou Wang, Eero P Simoncelli, and Alan C Bovik. 2003. Multiscale structural similarity for image quality assessment. In ACSSC. 1398–1402.

[54]

Mike Wu and Noah Goodman. 2018. Multimodal generative models for scalable weakly-supervised learning. In NeurIPS. 5575–5585.

[55]

Zhonghua Wu, Guosheng Lin, Qingyi Tao, and Jianfei Cai. 2019. M2e-try on net: Fashion from model to everyone. In ACM MM. 293–301.

[56]

Wenqi Xian, Patsorn Sangkloy, Varun Agrawal, Amit Raj, Jingwan Lu, Chen Fang, Fisher Yu, and James Hays. 2018. Texturegan: Controlling deep image synthesis with texture patches. In CVPR. 8456–8465.

[57]

Junyuan Xie, Linli Xu, and Enhong Chen. 2012. Image denoising and inpainting with deep neural networks. In NeurIPS. 341–349.

[58]

Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. 2018. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR. 1316–1324.

[59]

Han Yan, Haijun Zhang, Linlin Liu, Dongliang Zhou, Xiaofei Xu, Zhao Zhang, and Shuicheng Yan. 2022. Toward intelligent design: An AI-based fashion designer using generative adversarial networks aided by sketch and rendering generators. IEEE Trans. MM 25 (2022), 2323–2338.

[60]

Xun Yang, Yunshan Ma, Lizi Liao, Meng Wang, and Tat-Seng Chua. 2019. Transnfcm: Translation-based neural fashion compatibility modeling. In AAAI, Vol. 33. 403–410.

[61]

Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong. 2017. Dualgan: Unsupervised dual learning for image-to-image translation. In ICCV. 2849–2857.

[62]

Feifei Zhang, Mingliang Xu, and Changsheng Xu. 2022. Tell, imagine, and search: End-to-end learning for composing text and image to image retrieval. ACM TOMM 18, 2 (2022), 1–23.

Digital Library

[63]

Haijun Zhang, Yanfang Sun, Linlin Liu, Xinghao Wang, Liuwu Li, and Wenyin Liu. 2020. Clothing Out: a category-supervised GAN model for clothing segmentation and retrieval. NCAA 32 (2020), 4519–4530.

Digital Library

[64]

Haijun Zhang, Yanfang Sun, Linlin Liu, and Xiaofei Xu. 2020. CascadeGAN: A category-supervised cascading generative adversarial network for clothes translation from the human body to tiled images. Neurocomputing 382 (2020), 148–161.

Digital Library

[65]

Haijun Zhang, Xinghao Wang, Linlin Liu, Dongliang Zhou, and Zhao Zhang. 2020. Warpclothingout: A stepwise framework for clothes translation from the human body to tiled images. IEEE MultiMedia 27, 4 (2020), 58–68.

Digital Library

[66]

Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. 2017. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV. 5907–5915.

[67]

Pan Zhang, Bo Zhang, Dong Chen, Lu Yuan, and Fang Wen. 2020. Cross-domain correspondence learning for exemplar-based image translation. In CVPR. 5143–5153.

[68]

Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV. 2223–2232.

[69]

Shizhan Zhu, Raquel Urtasun, Sanja Fidler, Dahua Lin, and Chen Change Loy. 2017. Be your own prada: Fashion synthesis with structural coherence. In ICCV. 1680–1688.

Cited By

Rong Liu Annie Anak Joseph Miaomiao Xin Hongyan Zang Wanzhen Wang Shengqun Zhang (2024)Personalized Clothing Prediction Algorithm Based on Multi-modal Feature FusionInternational Journal of Engineering and Technology Innovation10.46604/ijeti.2024.1339414:2(216-230)Online publication date: 27-Mar-2024
https://doi.org/10.46604/ijeti.2024.13394
Marwala T(2024)Synthetic dataMechanism Design, Behavioral Science and Artificial Intelligence in International Relations10.1016/B978-0-443-23982-3.00012-9(169-180)Online publication date: 2024
https://doi.org/10.1016/B978-0-443-23982-3.00012-9

Index Terms

Collocated Clothing Synthesis with GANs Aided by Textual Information: A Multi-Modal Framework
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision representations
        Image representations

Recommendations

An overview of multi-modal medical image fusion

Multi-modal medical image fusion is the process of merging multiple images from single or multiple imaging modalities to improve the imaging quality with preserving the specific features.Medical image fusion covers a broad number of hot topic areas, ...
An Improved Method for Semantic Image Inpainting with GANs: Progressive Inpainting

Semantic image inpainting is getting more and more attention due to its increasing usage. Existing methods make inference based on either local data or external information. Generating Adversarial Networks, as a research focus in recent years, has been ...
Cross-view image synthesis using geometry-guided conditional GANs
Abstract
We address the problem of generating images across two drastically different views, namely ground (street) and aerial (overhead) views. Image synthesis by itself is a very challenging computer vision task and is even more so when ...
Highlights
- The first work to synthesize outdoor natural scene images between aerial and street view, conditioned on images in one view.

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 20, Issue 1

January 2024

639 pages

EISSN:1551-6865

DOI:10.1145/3613542

Editor:
Abdulmotaleb El Saddik
Mohamed Bin Zayed University of Artificial Intelligence, UAE and University of Ottawa, Canada

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 September 2023

Online AM: 14 August 2023

Accepted: 30 July 2023

Revised: 25 July 2023

Received: 12 September 2022

Published in TOMM Volume 20, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Guangdong Basic and Applied Basic Research Foundation
Shenzhen Science and Technology Program
HITSZ-J&A Joint Laboratory of Digital Design and Intelligent Fabrication

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
395
Total Downloads

Downloads (Last 12 months)162
Downloads (Last 6 weeks)7

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Rong Liu Annie Anak Joseph Miaomiao Xin Hongyan Zang Wanzhen Wang Shengqun Zhang (2024)Personalized Clothing Prediction Algorithm Based on Multi-modal Feature FusionInternational Journal of Engineering and Technology Innovation10.46604/ijeti.2024.1339414:2(216-230)Online publication date: 27-Mar-2024
https://doi.org/10.46604/ijeti.2024.13394
Marwala T(2024)Synthetic dataMechanism Design, Behavioral Science and Artificial Intelligence in International Relations10.1016/B978-0-443-23982-3.00012-9(169-180)Online publication date: 2024
https://doi.org/10.1016/B978-0-443-23982-3.00012-9

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents