skip to main content
research-article

Collocated Clothing Synthesis with GANs Aided by Textual Information: A Multi-Modal Framework

Published: 18 September 2023 Publication History

Abstract

Synthesizing realistic images of fashion items which are compatible with given clothing images, as well as conditioning on multiple modalities, brings novel and exciting applications together with enormous economic potential. In this work, we propose a multi-modal collocation framework based on generative adversarial network (GAN) for synthesizing compatible clothing images. Given an input clothing item that consists of an image and a text description, our model works on synthesizing a clothing image which is compatible with the input clothing, as well as being guided by a given text description from the target domain. Specifically, a generator aims to synthesize realistic and collocated clothing images relying on image- and text-based latent representations learned from the source domain. An auxiliary text representation from the target domain is added for supervising the generation results. In addition, a multi-discriminator framework is carried out to determine compatibility between the generated clothing images and the input clothing images, as well as visual-semantic matching between the generated clothing images and the targeted textual information. Extensive quantitative and qualitative results demonstrate that our model substantially outperforms state-of-the-art methods in terms of authenticity, diversity, and visual-semantic similarity between image and text.

Supplementary Material

3614097.supp (3614097.supp.pdf)
Supplementary material

References

[1]
Aurélie Bugeau, Vinh-Thong Ta, and Nicolas Papadakis. 2013. Variational exemplar-based image colorization. IEEE Transactions on Image Processing 23, 1 (2013), 298–307.
[2]
Lele Chen, Justin Tian, Guo Li, Cheng-Haw Wu, Erh-Kan King, Kuan-Ting Chen, Shao-Hang Hsieh, and Chenliang Xu. 2020. TailorGAN: Making user-defined fashion designs. In WACV. 3241–3250.
[3]
Yi Rui Cui, Qi Liu, Cheng Ying Gao, and Zhongbo Su. 2018. FashionGAN: Display your fashion design using conditional generative adversarial nets. In Computer Graphics Forum, Vol. 37. Wiley Online Library, 109–119.
[4]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. NaacL-HLT (2019), 4171–4186.
[5]
Haoye Dong, Xiaodan Liang, Yixuan Zhang, Xujie Zhang, Xiaohui Shen, Zhenyu Xie, Bowen Wu, and Jian Yin. 2020. Fashion editing with adversarial parsing learning. In CVPR. 8120–8128.
[6]
Xue Dong, Xuemeng Song, Fuli Feng, Peiguang Jing, Xin-Shun Xu, and Liqiang Nie. 2019. Personalized capsule wardrobe creation with garment and user modeling. In ACM MM. 302–310.
[7]
Xiaoling Gu, Jun Yu, Yongkang Wong, and Mohan S. Kankanhalli. 2020. Toward multi-modal conditioned fashion image translation. IEEE Transactions on Multimedia 23 (2020), 2361–2371.
[8]
Xintong Han, Zuxuan Wu, Zhe Wu, Ruichi Yu, and Larry S Davis. 2018. Viton: An image-based virtual try-on network. In CVPR. 7543–7552.
[9]
Trang-Thi Ho, John Jethro Virtusio, Yung-Yao Chen, Chih-Ming Hsu, and Kai-Lung Hua. 2020. Sketch-guided deep portrait generation. ACM TOMM 16, 3 (2020), 1–18.
[10]
Wei-Lin Hsiao and Kristen Grauman. 2018. Creating capsule wardrobes from fashion images. In CVPR. 7161–7170.
[11]
Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. 2018. Multimodal unsupervised image-to-image translation. In ECCV. 172–189.
[12]
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2017. Image-to-image translation with conditional adversarial networks. In CVPR. 1125–1134.
[13]
Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim. 2017. Learning to discover cross-domain relations with generative adversarial networks. In International Conference on Machine Learning. PMLR, 1857–1865.
[14]
Durk P. Kingma and Prafulla Dhariwal. 2018. Generative flow with invertible 1x1 convolutions. Advances in Neural Information Processing Systems 31 (2018).
[15]
Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv:1312.6114. Retrieved from https://arxiv.org/abs/1312.6114
[16]
Hsin-Ying Lee, Hung-Yu Tseng, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. 2018. Diverse image-to-image translation via disentangled representations. In ECCV. 35–51.
[17]
Hsin-Ying Lee, Hung-Yu Tseng, Qi Mao, Jia-Bin Huang, Yu-Ding Lu, Maneesh Singh, and Ming-Hsuan Yang. 2020. Drit++: Diverse image-to-image translation via disentangled representations. International Journal of Computer Vision 128 (2020), 2402–2417.
[18]
Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, and Philip HS Torr. 2020. Manigan: Text-guided image manipulation. In CVPR. 7880–7889.
[19]
Bowen Li, Xiaojuan Qi, Philip Torr, and Thomas Lukasiewicz. 2020. Lightweight generative adversarial networks for text-guided image manipulation. NIPS 33 (2020), 22020–22031.
[20]
Zejian Li, Jingyu Wu, Immanuel Koh, Yongchuan Tang, and Lingyun Sun. 2021. Image synthesis from layout with locality-aware mask adaption. In ICCV. 13819–13828.
[21]
Chieh Hubert Lin, Chia-Che Chang, Yu-Sheng Chen, Da-Cheng Juan, Wei Wei, and Hwann-Tzong Chen. 2019. Coco-gan: Generation by parts via conditional coordinating. In ICCV. 4512–4521.
[22]
Yujie Lin, Pengjie Ren, Zhumin Chen, Zhaochun Ren, Jun Ma, and Maarten de Rijke. 2019. Improving outfit recommendation with co-supervision of fashion generation. In WWWC. 1095–1105.
[23]
Jinhuan Liu, Xuemeng Song, Zhumin Chen, and Jun Ma. 2020. MGCM: Multi-modal generative compatibility modeling for clothing matching. Neurocomputing 414 (2020), 215–224.
[24]
Lingjie Liu, Weipeng Xu, Marc Habermann, Michael Zollhöfer, Florian Bernard, Hyeongwoo Kim, Wenping Wang, and Christian Theobalt. 2020. Neural human video rendering by learning dynamic textures and rendering-to-video translation.arXiv:2001.04947. Retrieved from https://arxiv.org/abs/2001.04947
[25]
Linlin Liu, Haijun Zhang, Yuzhu Ji, and QM Jonathan Wu. 2019. Toward AI fashion design: An attribute-GAN model for clothing match. Neurocomputing 341 (2019), 156–167.
[26]
Linlin Liu, Haijun Zhang, Xiaofei Xu, Zhao Zhang, and Shuicheng Yan. 2019. Collocating clothes with generativeadversarial networks cosupervised by categories and attributes: a multidiscriminator framework. IEEE Trans. NeuralNetw. Learn. Syst. 31, 9 (2019), 3540–3554.
[27]
Yu Liu, Wei Chen, Li Liu, and Michael S Lew. 2019. Swapgan: A multistage generative approach for person-to-person fashion style transfer. IEEE Transactions on Multimedia 21, 9 (2019), 2209–2222.
[28]
Shuang Ma, Daniel Mcduff, and Yale Song. 2019. A generative adversarial network for style modeling in a text-to-speech system. In ICLR, Vol. 2. 1–15.
[29]
Yifang Men, Yiming Mao, Yuning Jiang, Wei-Ying Ma, and Zhouhui Lian. 2020. Controllable person image synthesis with attribute-decomposed gan. In CVPR. 5084–5093.
[30]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of wordsand phrases and their compositionality. In NIPS 26 (2013), 1–9.
[31]
Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv:1411.1784. Retrieved from https://arxiv.org/abs/1411.1784
[32]
Kamal Nasrollahi and Thomas B. Moeslund. 2014. Super-resolution: A comprehensive survey. Machine Vision and Applications 25, 6 (2014), 1423–1468.
[33]
Liqiang Nie, Wenjie Wang, Richang Hong, Meng Wang, and Qi Tian. 2019. Multimodal dialog system: Generating responses via adaptive decoders. In ACM MM. 1098–1106.
[34]
Taesung Park, Alexei A Efros, Richard Zhang, and Jun-Yan Zhu. 2020. Contrastive learning for unpaired image-to-image translation. In ECCV. Springer, 319–345.
[35]
Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. 2019. Semantic image synthesis with spatially-adaptive normalization. In CVPR. 2337–2346.
[36]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision. In ICML. 8748–8763.
[37]
Alec Radford, Luke Metz, and Soumith Chintala. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv:1511.06434. Retrieved from https://arxiv.org/abs/1511.06434
[38]
Siddarth Ravichandran, Ondřej Texler, Dimitar Dinev, and Hyun Jae Kang. 2023. Synthesizing photorealistic virtual humans through cross-modal disentanglement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4585–4594.
[39]
Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. 2016. Generative adversarial text to image synthesis. In ICML. 1060–1069.
[40]
Danilo Rezende and Shakir Mohamed. 2015. Variational inference with normalizing flows. In ICML. 1530–1538.
[41]
Othman Sbai, Mohamed Elhoseiny, Antoine Bordes, Yann LeCun, and Camille Couprie. 2018. Design: Design inspiration from generative networks. In ECCVW. 1–7.
[42]
Yujun Shen, Jinjin Gu, Xiaoou Tang, and Bolei Zhou. 2020. Interpreting the latent space of gans for semantic face editing. In CVPR. 9243–9252.
[43]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. Retrieved from https://arxiv.org/abs/1409.1556
[44]
Kihyuk Sohn, Honglak Lee, and Xinchen Yan. 2015. Learning structured output representation using deep conditional generative models. In NeurIPS. 3483–3491.
[45]
Hao Tang, Dan Xu, Nicu Sebe, and Yan Yan. 2019. Attention-guided generative adversarial networks for unsupervised image-to-image translation. In IJCNN. 1–8.
[46]
Duc Minh Vo and Akihiro Sugimoto. 2020. Visual-relation conscious image generation from structured-text. In ECCV. Springer, 290–306.
[47]
Andrey Voynov, Kfir Aberman, and Daniel Cohen-Or. 2023. Sketch-guided text-to-image diffusion models. In ACM SIGGRAPH 2023 Conference Proceedings. 1–11.
[48]
Chengde Wan, Thomas Probst, Luc Van Gool, and Angela Yao. 2017. Crossing nets: Combining gans and vaes with a shared latent space for hand pose estimation. In CVPR. 680–689.
[49]
Bochao Wang, Huabin Zheng, Xiaodan Liang, Yimin Chen, Liang Lin, and Meng Yang. 2018. Toward characteristic-preserving image-based virtual try-on network. In ECCV. 589–604.
[50]
Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018. High-resolution image synthesis and semantic manipulation with conditional gans. In CVPR. 8798–8807.
[51]
Weiran Wang, Xinchen Yan, Honglak Lee, and Karen Livescu. 2016. Deep variational canonical correlation analysis. arXiv:1610.03454. Retrieved from https://arxiv.org/abs/1610.03454
[52]
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In CVPR. 7794–7803.
[53]
Zhou Wang, Eero P Simoncelli, and Alan C Bovik. 2003. Multiscale structural similarity for image quality assessment. In ACSSC. 1398–1402.
[54]
Mike Wu and Noah Goodman. 2018. Multimodal generative models for scalable weakly-supervised learning. In NeurIPS. 5575–5585.
[55]
Zhonghua Wu, Guosheng Lin, Qingyi Tao, and Jianfei Cai. 2019. M2e-try on net: Fashion from model to everyone. In ACM MM. 293–301.
[56]
Wenqi Xian, Patsorn Sangkloy, Varun Agrawal, Amit Raj, Jingwan Lu, Chen Fang, Fisher Yu, and James Hays. 2018. Texturegan: Controlling deep image synthesis with texture patches. In CVPR. 8456–8465.
[57]
Junyuan Xie, Linli Xu, and Enhong Chen. 2012. Image denoising and inpainting with deep neural networks. In NeurIPS. 341–349.
[58]
Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. 2018. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR. 1316–1324.
[59]
Han Yan, Haijun Zhang, Linlin Liu, Dongliang Zhou, Xiaofei Xu, Zhao Zhang, and Shuicheng Yan. 2022. Toward intelligent design: An AI-based fashion designer using generative adversarial networks aided by sketch and rendering generators. IEEE Trans. MM 25 (2022), 2323–2338.
[60]
Xun Yang, Yunshan Ma, Lizi Liao, Meng Wang, and Tat-Seng Chua. 2019. Transnfcm: Translation-based neural fashion compatibility modeling. In AAAI, Vol. 33. 403–410.
[61]
Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong. 2017. Dualgan: Unsupervised dual learning for image-to-image translation. In ICCV. 2849–2857.
[62]
Feifei Zhang, Mingliang Xu, and Changsheng Xu. 2022. Tell, imagine, and search: End-to-end learning for composing text and image to image retrieval. ACM TOMM 18, 2 (2022), 1–23.
[63]
Haijun Zhang, Yanfang Sun, Linlin Liu, Xinghao Wang, Liuwu Li, and Wenyin Liu. 2020. Clothing Out: a category-supervised GAN model for clothing segmentation and retrieval. NCAA 32 (2020), 4519–4530.
[64]
Haijun Zhang, Yanfang Sun, Linlin Liu, and Xiaofei Xu. 2020. CascadeGAN: A category-supervised cascading generative adversarial network for clothes translation from the human body to tiled images. Neurocomputing 382 (2020), 148–161.
[65]
Haijun Zhang, Xinghao Wang, Linlin Liu, Dongliang Zhou, and Zhao Zhang. 2020. Warpclothingout: A stepwise framework for clothes translation from the human body to tiled images. IEEE MultiMedia 27, 4 (2020), 58–68.
[66]
Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. 2017. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV. 5907–5915.
[67]
Pan Zhang, Bo Zhang, Dong Chen, Lu Yuan, and Fang Wen. 2020. Cross-domain correspondence learning for exemplar-based image translation. In CVPR. 5143–5153.
[68]
Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV. 2223–2232.
[69]
Shizhan Zhu, Raquel Urtasun, Sanja Fidler, Dahua Lin, and Chen Change Loy. 2017. Be your own prada: Fashion synthesis with structural coherence. In ICCV. 1680–1688.

Cited By

View all
  • (2024)Personalized Clothing Prediction Algorithm Based on Multi-modal Feature FusionInternational Journal of Engineering and Technology Innovation10.46604/ijeti.2024.1339414:2(216-230)Online publication date: 27-Mar-2024
  • (2024)Synthetic dataMechanism Design, Behavioral Science and Artificial Intelligence in International Relations10.1016/B978-0-443-23982-3.00012-9(169-180)Online publication date: 2024

Index Terms

  1. Collocated Clothing Synthesis with GANs Aided by Textual Information: A Multi-Modal Framework

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 1
    January 2024
    639 pages
    EISSN:1551-6865
    DOI:10.1145/3613542
    • Editor:
    • Abdulmotaleb El Saddik
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 18 September 2023
    Online AM: 14 August 2023
    Accepted: 30 July 2023
    Revised: 25 July 2023
    Received: 12 September 2022
    Published in TOMM Volume 20, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Multi-modal
    2. clothes collocation
    3. generative adversarial networks
    4. image translation
    5. fashion data

    Qualifiers

    • Research-article

    Funding Sources

    • National Natural Science Foundation of China
    • Guangdong Basic and Applied Basic Research Foundation
    • Shenzhen Science and Technology Program
    • HITSZ-J&A Joint Laboratory of Digital Design and Intelligent Fabrication

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)162
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Personalized Clothing Prediction Algorithm Based on Multi-modal Feature FusionInternational Journal of Engineering and Technology Innovation10.46604/ijeti.2024.1339414:2(216-230)Online publication date: 27-Mar-2024
    • (2024)Synthetic dataMechanism Design, Behavioral Science and Artificial Intelligence in International Relations10.1016/B978-0-443-23982-3.00012-9(169-180)Online publication date: 2024

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media