skip to main content
research-article

Collocated Clothing Synthesis with GANs Aided by Textual Information: A Multi-Modal Framework

Published:18 September 2023Publication History
Skip Abstract Section

Abstract

Synthesizing realistic images of fashion items which are compatible with given clothing images, as well as conditioning on multiple modalities, brings novel and exciting applications together with enormous economic potential. In this work, we propose a multi-modal collocation framework based on generative adversarial network (GAN) for synthesizing compatible clothing images. Given an input clothing item that consists of an image and a text description, our model works on synthesizing a clothing image which is compatible with the input clothing, as well as being guided by a given text description from the target domain. Specifically, a generator aims to synthesize realistic and collocated clothing images relying on image- and text-based latent representations learned from the source domain. An auxiliary text representation from the target domain is added for supervising the generation results. In addition, a multi-discriminator framework is carried out to determine compatibility between the generated clothing images and the input clothing images, as well as visual-semantic matching between the generated clothing images and the targeted textual information. Extensive quantitative and qualitative results demonstrate that our model substantially outperforms state-of-the-art methods in terms of authenticity, diversity, and visual-semantic similarity between image and text.

Skip Supplemental Material Section

Supplemental Material

REFERENCES

  1. [1] Bugeau Aurélie, Ta Vinh-Thong, and Papadakis Nicolas. 2013. Variational exemplar-based image colorization. IEEE Transactions on Image Processing 23, 1 (2013), 298307.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Chen Lele, Tian Justin, Li Guo, Wu Cheng-Haw, King Erh-Kan, Chen Kuan-Ting, Hsieh Shao-Hang, and Xu Chenliang. 2020. TailorGAN: Making user-defined fashion designs. In WACV. 32413250.Google ScholarGoogle Scholar
  3. [3] Cui Yi Rui, Liu Qi, Gao Cheng Ying, and Su Zhongbo. 2018. FashionGAN: Display your fashion design using conditional generative adversarial nets. In Computer Graphics Forum, Vol. 37. Wiley Online Library, 109119.Google ScholarGoogle Scholar
  4. [4] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. NaacL-HLT (2019), 41714186.Google ScholarGoogle Scholar
  5. [5] Dong Haoye, Liang Xiaodan, Zhang Yixuan, Zhang Xujie, Shen Xiaohui, Xie Zhenyu, Wu Bowen, and Yin Jian. 2020. Fashion editing with adversarial parsing learning. In CVPR. 81208128.Google ScholarGoogle Scholar
  6. [6] Dong Xue, Song Xuemeng, Feng Fuli, Jing Peiguang, Xu Xin-Shun, and Nie Liqiang. 2019. Personalized capsule wardrobe creation with garment and user modeling. In ACM MM. 302310.Google ScholarGoogle Scholar
  7. [7] Gu Xiaoling, Yu Jun, Wong Yongkang, and Kankanhalli Mohan S.. 2020. Toward multi-modal conditioned fashion image translation. IEEE Transactions on Multimedia 23 (2020), 23612371.Google ScholarGoogle Scholar
  8. [8] Han Xintong, Wu Zuxuan, Wu Zhe, Yu Ruichi, and Davis Larry S. 2018. Viton: An image-based virtual try-on network. In CVPR. 75437552.Google ScholarGoogle Scholar
  9. [9] Ho Trang-Thi, Virtusio John Jethro, Chen Yung-Yao, Hsu Chih-Ming, and Hua Kai-Lung. 2020. Sketch-guided deep portrait generation. ACM TOMM 16, 3 (2020), 118.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Hsiao Wei-Lin and Grauman Kristen. 2018. Creating capsule wardrobes from fashion images. In CVPR. 71617170.Google ScholarGoogle Scholar
  11. [11] Huang Xun, Liu Ming-Yu, Belongie Serge, and Kautz Jan. 2018. Multimodal unsupervised image-to-image translation. In ECCV. 172189.Google ScholarGoogle Scholar
  12. [12] Isola Phillip, Zhu Jun-Yan, Zhou Tinghui, and Efros Alexei A.. 2017. Image-to-image translation with conditional adversarial networks. In CVPR. 11251134.Google ScholarGoogle Scholar
  13. [13] Kim Taeksoo, Cha Moonsu, Kim Hyunsoo, Lee Jung Kwon, and Kim Jiwon. 2017. Learning to discover cross-domain relations with generative adversarial networks. In International Conference on Machine Learning. PMLR, 18571865.Google ScholarGoogle Scholar
  14. [14] Kingma Durk P. and Dhariwal Prafulla. 2018. Generative flow with invertible 1x1 convolutions. Advances in Neural Information Processing Systems 31 (2018).Google ScholarGoogle Scholar
  15. [15] Kingma Diederik P and Welling Max. 2013. Auto-encoding variational bayes. arXiv:1312.6114. Retrieved from https://arxiv.org/abs/1312.6114Google ScholarGoogle Scholar
  16. [16] Lee Hsin-Ying, Tseng Hung-Yu, Huang Jia-Bin, Singh Maneesh, and Yang Ming-Hsuan. 2018. Diverse image-to-image translation via disentangled representations. In ECCV. 3551.Google ScholarGoogle Scholar
  17. [17] Lee Hsin-Ying, Tseng Hung-Yu, Mao Qi, Huang Jia-Bin, Lu Yu-Ding, Singh Maneesh, and Yang Ming-Hsuan. 2020. Drit++: Diverse image-to-image translation via disentangled representations. International Journal of Computer Vision 128 (2020), 24022417.Google ScholarGoogle Scholar
  18. [18] Li Bowen, Qi Xiaojuan, Lukasiewicz Thomas, and Torr Philip HS. 2020. Manigan: Text-guided image manipulation. In CVPR. 78807889.Google ScholarGoogle Scholar
  19. [19] Li Bowen, Qi Xiaojuan, Torr Philip, and Lukasiewicz Thomas. 2020. Lightweight generative adversarial networks for text-guided image manipulation. NIPS 33 (2020), 2202022031.Google ScholarGoogle Scholar
  20. [20] Li Zejian, Wu Jingyu, Koh Immanuel, Tang Yongchuan, and Sun Lingyun. 2021. Image synthesis from layout with locality-aware mask adaption. In ICCV. 1381913828.Google ScholarGoogle Scholar
  21. [21] Lin Chieh Hubert, Chang Chia-Che, Chen Yu-Sheng, Juan Da-Cheng, Wei Wei, and Chen Hwann-Tzong. 2019. Coco-gan: Generation by parts via conditional coordinating. In ICCV. 45124521.Google ScholarGoogle Scholar
  22. [22] Lin Yujie, Ren Pengjie, Chen Zhumin, Ren Zhaochun, Ma Jun, and Rijke Maarten de. 2019. Improving outfit recommendation with co-supervision of fashion generation. In WWWC. 10951105.Google ScholarGoogle Scholar
  23. [23] Liu Jinhuan, Song Xuemeng, Chen Zhumin, and Ma Jun. 2020. MGCM: Multi-modal generative compatibility modeling for clothing matching. Neurocomputing 414 (2020), 215224.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Liu Lingjie, Xu Weipeng, Habermann Marc, Zollhöfer Michael, Bernard Florian, Kim Hyeongwoo, Wang Wenping, and Theobalt Christian. 2020. Neural human video rendering by learning dynamic textures and rendering-to-video translation.arXiv:2001.04947. Retrieved from https://arxiv.org/abs/2001.04947Google ScholarGoogle Scholar
  25. [25] Liu Linlin, Zhang Haijun, Ji Yuzhu, and Wu QM Jonathan. 2019. Toward AI fashion design: An attribute-GAN model for clothing match. Neurocomputing 341 (2019), 156167.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Liu Linlin, Zhang Haijun, Xu Xiaofei, Zhang Zhao, and Yan Shuicheng. 2019. Collocating clothes with generativeadversarial networks cosupervised by categories and attributes: a multidiscriminator framework. IEEE Trans. NeuralNetw. Learn. Syst. 31, 9 (2019), 35403554.Google ScholarGoogle Scholar
  27. [27] Liu Yu, Chen Wei, Liu Li, and Lew Michael S. 2019. Swapgan: A multistage generative approach for person-to-person fashion style transfer. IEEE Transactions on Multimedia 21, 9 (2019), 22092222.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Ma Shuang, Mcduff Daniel, and Song Yale. 2019. A generative adversarial network for style modeling in a text-to-speech system. In ICLR, Vol. 2. 115.Google ScholarGoogle Scholar
  29. [29] Men Yifang, Mao Yiming, Jiang Yuning, Ma Wei-Ying, and Lian Zhouhui. 2020. Controllable person image synthesis with attribute-decomposed gan. In CVPR. 50845093.Google ScholarGoogle Scholar
  30. [30] Mikolov Tomas, Sutskever Ilya, Chen Kai, Corrado Greg S., and Dean Jeff. 2013. Distributed representations of wordsand phrases and their compositionality. In NIPS 26 (2013), 19.Google ScholarGoogle Scholar
  31. [31] Mirza Mehdi and Osindero Simon. 2014. Conditional generative adversarial nets. arXiv:1411.1784. Retrieved from https://arxiv.org/abs/1411.1784Google ScholarGoogle Scholar
  32. [32] Nasrollahi Kamal and Moeslund Thomas B.. 2014. Super-resolution: A comprehensive survey. Machine Vision and Applications 25, 6 (2014), 14231468.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Nie Liqiang, Wang Wenjie, Hong Richang, Wang Meng, and Tian Qi. 2019. Multimodal dialog system: Generating responses via adaptive decoders. In ACM MM. 10981106.Google ScholarGoogle Scholar
  34. [34] Park Taesung, Efros Alexei A, Zhang Richard, and Zhu Jun-Yan. 2020. Contrastive learning for unpaired image-to-image translation. In ECCV. Springer, 319345.Google ScholarGoogle Scholar
  35. [35] Park Taesung, Liu Ming-Yu, Wang Ting-Chun, and Zhu Jun-Yan. 2019. Semantic image synthesis with spatially-adaptive normalization. In CVPR. 23372346.Google ScholarGoogle Scholar
  36. [36] Radford Alec, Kim Jong Wook, Hallacy Chris, Ramesh Aditya, Goh Gabriel, Agarwal Sandhini, Sastry Girish, Askell Amanda, Mishkin Pamela, Clark Jack, Krueger Gretchen, and Sutskever Ilya. 2021. Learning transferable visual models from natural language supervision. In ICML. 87488763.Google ScholarGoogle Scholar
  37. [37] Radford Alec, Metz Luke, and Chintala Soumith. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv:1511.06434. Retrieved from https://arxiv.org/abs/1511.06434Google ScholarGoogle Scholar
  38. [38] Ravichandran Siddarth, Texler Ondřej, Dinev Dimitar, and Kang Hyun Jae. 2023. Synthesizing photorealistic virtual humans through cross-modal disentanglement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 45854594.Google ScholarGoogle Scholar
  39. [39] Reed Scott, Akata Zeynep, Yan Xinchen, Logeswaran Lajanugen, Schiele Bernt, and Lee Honglak. 2016. Generative adversarial text to image synthesis. In ICML. 10601069.Google ScholarGoogle Scholar
  40. [40] Rezende Danilo and Mohamed Shakir. 2015. Variational inference with normalizing flows. In ICML. 15301538.Google ScholarGoogle Scholar
  41. [41] Sbai Othman, Elhoseiny Mohamed, Bordes Antoine, LeCun Yann, and Couprie Camille. 2018. Design: Design inspiration from generative networks. In ECCVW. 17.Google ScholarGoogle Scholar
  42. [42] Shen Yujun, Gu Jinjin, Tang Xiaoou, and Zhou Bolei. 2020. Interpreting the latent space of gans for semantic face editing. In CVPR. 92439252.Google ScholarGoogle Scholar
  43. [43] Simonyan Karen and Zisserman Andrew. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. Retrieved from https://arxiv.org/abs/1409.1556Google ScholarGoogle Scholar
  44. [44] Sohn Kihyuk, Lee Honglak, and Yan Xinchen. 2015. Learning structured output representation using deep conditional generative models. In NeurIPS. 34833491.Google ScholarGoogle Scholar
  45. [45] Tang Hao, Xu Dan, Sebe Nicu, and Yan Yan. 2019. Attention-guided generative adversarial networks for unsupervised image-to-image translation. In IJCNN. 18.Google ScholarGoogle Scholar
  46. [46] Vo Duc Minh and Sugimoto Akihiro. 2020. Visual-relation conscious image generation from structured-text. In ECCV. Springer, 290306.Google ScholarGoogle Scholar
  47. [47] Voynov Andrey, Aberman Kfir, and Cohen-Or Daniel. 2023. Sketch-guided text-to-image diffusion models. In ACM SIGGRAPH 2023 Conference Proceedings. 111.Google ScholarGoogle Scholar
  48. [48] Wan Chengde, Probst Thomas, Gool Luc Van, and Yao Angela. 2017. Crossing nets: Combining gans and vaes with a shared latent space for hand pose estimation. In CVPR. 680689.Google ScholarGoogle Scholar
  49. [49] Wang Bochao, Zheng Huabin, Liang Xiaodan, Chen Yimin, Lin Liang, and Yang Meng. 2018. Toward characteristic-preserving image-based virtual try-on network. In ECCV. 589604.Google ScholarGoogle Scholar
  50. [50] Wang Ting-Chun, Liu Ming-Yu, Zhu Jun-Yan, Tao Andrew, Kautz Jan, and Catanzaro Bryan. 2018. High-resolution image synthesis and semantic manipulation with conditional gans. In CVPR. 87988807.Google ScholarGoogle Scholar
  51. [51] Wang Weiran, Yan Xinchen, Lee Honglak, and Livescu Karen. 2016. Deep variational canonical correlation analysis. arXiv:1610.03454. Retrieved from https://arxiv.org/abs/1610.03454Google ScholarGoogle Scholar
  52. [52] Wang Xiaolong, Girshick Ross, Gupta Abhinav, and He Kaiming. 2018. Non-local neural networks. In CVPR. 77947803.Google ScholarGoogle Scholar
  53. [53] Wang Zhou, Simoncelli Eero P, and Bovik Alan C. 2003. Multiscale structural similarity for image quality assessment. In ACSSC. 13981402.Google ScholarGoogle Scholar
  54. [54] Wu Mike and Goodman Noah. 2018. Multimodal generative models for scalable weakly-supervised learning. In NeurIPS. 55755585.Google ScholarGoogle Scholar
  55. [55] Wu Zhonghua, Lin Guosheng, Tao Qingyi, and Cai Jianfei. 2019. M2e-try on net: Fashion from model to everyone. In ACM MM. 293301.Google ScholarGoogle Scholar
  56. [56] Xian Wenqi, Sangkloy Patsorn, Agrawal Varun, Raj Amit, Lu Jingwan, Fang Chen, Yu Fisher, and Hays James. 2018. Texturegan: Controlling deep image synthesis with texture patches. In CVPR. 84568465.Google ScholarGoogle Scholar
  57. [57] Xie Junyuan, Xu Linli, and Chen Enhong. 2012. Image denoising and inpainting with deep neural networks. In NeurIPS. 341349.Google ScholarGoogle Scholar
  58. [58] Xu Tao, Zhang Pengchuan, Huang Qiuyuan, Zhang Han, Gan Zhe, Huang Xiaolei, and He Xiaodong. 2018. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR. 13161324.Google ScholarGoogle Scholar
  59. [59] Yan Han, Zhang Haijun, Liu Linlin, Zhou Dongliang, Xu Xiaofei, Zhang Zhao, and Yan Shuicheng. 2022. Toward intelligent design: An AI-based fashion designer using generative adversarial networks aided by sketch and rendering generators. IEEE Trans. MM 25 (2022), 23232338.Google ScholarGoogle Scholar
  60. [60] Yang Xun, Ma Yunshan, Liao Lizi, Wang Meng, and Chua Tat-Seng. 2019. Transnfcm: Translation-based neural fashion compatibility modeling. In AAAI, Vol. 33. 403410.Google ScholarGoogle Scholar
  61. [61] Yi Zili, Zhang Hao, Tan Ping, and Gong Minglun. 2017. Dualgan: Unsupervised dual learning for image-to-image translation. In ICCV. 28492857.Google ScholarGoogle Scholar
  62. [62] Zhang Feifei, Xu Mingliang, and Xu Changsheng. 2022. Tell, imagine, and search: End-to-end learning for composing text and image to image retrieval. ACM TOMM 18, 2 (2022), 123.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. [63] Zhang Haijun, Sun Yanfang, Liu Linlin, Wang Xinghao, Li Liuwu, and Liu Wenyin. 2020. Clothing Out: a category-supervised GAN model for clothing segmentation and retrieval. NCAA 32 (2020), 45194530.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. [64] Zhang Haijun, Sun Yanfang, Liu Linlin, and Xu Xiaofei. 2020. CascadeGAN: A category-supervised cascading generative adversarial network for clothes translation from the human body to tiled images. Neurocomputing 382 (2020), 148161.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. [65] Zhang Haijun, Wang Xinghao, Liu Linlin, Zhou Dongliang, and Zhang Zhao. 2020. Warpclothingout: A stepwise framework for clothes translation from the human body to tiled images. IEEE MultiMedia 27, 4 (2020), 5868.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. [66] Zhang Han, Xu Tao, Li Hongsheng, Zhang Shaoting, Wang Xiaogang, Huang Xiaolei, and Metaxas Dimitris N. 2017. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV. 59075915.Google ScholarGoogle Scholar
  67. [67] Zhang Pan, Zhang Bo, Chen Dong, Yuan Lu, and Wen Fang. 2020. Cross-domain correspondence learning for exemplar-based image translation. In CVPR. 51435153.Google ScholarGoogle Scholar
  68. [68] Zhu Jun-Yan, Park Taesung, Isola Phillip, and Efros Alexei A. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV. 22232232.Google ScholarGoogle Scholar
  69. [69] Zhu Shizhan, Urtasun Raquel, Fidler Sanja, Lin Dahua, and Loy Chen Change. 2017. Be your own prada: Fashion synthesis with structural coherence. In ICCV. 16801688.Google ScholarGoogle Scholar

Index Terms

  1. Collocated Clothing Synthesis with GANs Aided by Textual Information: A Multi-Modal Framework

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Multimedia Computing, Communications, and Applications
      ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 1
      January 2024
      639 pages
      ISSN:1551-6857
      EISSN:1551-6865
      DOI:10.1145/3613542
      • Editor:
      • Abdulmotaleb El Saddik
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 18 September 2023
      • Online AM: 14 August 2023
      • Accepted: 30 July 2023
      • Revised: 25 July 2023
      • Received: 12 September 2022
      Published in tomm Volume 20, Issue 1

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
    • Article Metrics

      • Downloads (Last 12 months)286
      • Downloads (Last 6 weeks)45

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text