Abstract
Synthesizing realistic images of fashion items which are compatible with given clothing images, as well as conditioning on multiple modalities, brings novel and exciting applications together with enormous economic potential. In this work, we propose a multi-modal collocation framework based on generative adversarial network (GAN) for synthesizing compatible clothing images. Given an input clothing item that consists of an image and a text description, our model works on synthesizing a clothing image which is compatible with the input clothing, as well as being guided by a given text description from the target domain. Specifically, a generator aims to synthesize realistic and collocated clothing images relying on image- and text-based latent representations learned from the source domain. An auxiliary text representation from the target domain is added for supervising the generation results. In addition, a multi-discriminator framework is carried out to determine compatibility between the generated clothing images and the input clothing images, as well as visual-semantic matching between the generated clothing images and the targeted textual information. Extensive quantitative and qualitative results demonstrate that our model substantially outperforms state-of-the-art methods in terms of authenticity, diversity, and visual-semantic similarity between image and text.
Supplemental Material
Available for Download
Supplementary material
- [1] . 2013. Variational exemplar-based image colorization. IEEE Transactions on Image Processing 23, 1 (2013), 298–307.Google ScholarDigital Library
- [2] . 2020. TailorGAN: Making user-defined fashion designs. In WACV. 3241–3250.Google Scholar
- [3] . 2018. FashionGAN: Display your fashion design using conditional generative adversarial nets. In Computer Graphics Forum, Vol. 37. Wiley Online Library, 109–119.Google Scholar
- [4] . 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. NaacL-HLT (2019), 4171–4186.Google Scholar
- [5] . 2020. Fashion editing with adversarial parsing learning. In CVPR. 8120–8128.Google Scholar
- [6] . 2019. Personalized capsule wardrobe creation with garment and user modeling. In ACM MM. 302–310.Google Scholar
- [7] . 2020. Toward multi-modal conditioned fashion image translation. IEEE Transactions on Multimedia 23 (2020), 2361–2371.Google Scholar
- [8] . 2018. Viton: An image-based virtual try-on network. In CVPR. 7543–7552.Google Scholar
- [9] . 2020. Sketch-guided deep portrait generation. ACM TOMM 16, 3 (2020), 1–18.Google ScholarDigital Library
- [10] . 2018. Creating capsule wardrobes from fashion images. In CVPR. 7161–7170.Google Scholar
- [11] . 2018. Multimodal unsupervised image-to-image translation. In ECCV. 172–189.Google Scholar
- [12] . 2017. Image-to-image translation with conditional adversarial networks. In CVPR. 1125–1134.Google Scholar
- [13] . 2017. Learning to discover cross-domain relations with generative adversarial networks. In International Conference on Machine Learning. PMLR, 1857–1865.Google Scholar
- [14] . 2018. Generative flow with invertible 1x1 convolutions. Advances in Neural Information Processing Systems 31 (2018).Google Scholar
- [15] . 2013. Auto-encoding variational bayes. arXiv:1312.6114. Retrieved from https://arxiv.org/abs/1312.6114Google Scholar
- [16] . 2018. Diverse image-to-image translation via disentangled representations. In ECCV. 35–51.Google Scholar
- [17] . 2020. Drit++: Diverse image-to-image translation via disentangled representations. International Journal of Computer Vision 128 (2020), 2402–2417.Google Scholar
- [18] . 2020. Manigan: Text-guided image manipulation. In CVPR. 7880–7889.Google Scholar
- [19] . 2020. Lightweight generative adversarial networks for text-guided image manipulation. NIPS 33 (2020), 22020–22031.Google Scholar
- [20] . 2021. Image synthesis from layout with locality-aware mask adaption. In ICCV. 13819–13828.Google Scholar
- [21] . 2019. Coco-gan: Generation by parts via conditional coordinating. In ICCV. 4512–4521.Google Scholar
- [22] . 2019. Improving outfit recommendation with co-supervision of fashion generation. In WWWC. 1095–1105.Google Scholar
- [23] . 2020. MGCM: Multi-modal generative compatibility modeling for clothing matching. Neurocomputing 414 (2020), 215–224.Google ScholarCross Ref
- [24] . 2020. Neural human video rendering by learning dynamic textures and rendering-to-video translation.arXiv:2001.04947. Retrieved from https://arxiv.org/abs/2001.04947Google Scholar
- [25] . 2019. Toward AI fashion design: An attribute-GAN model for clothing match. Neurocomputing 341 (2019), 156–167.Google ScholarDigital Library
- [26] . 2019. Collocating clothes with generativeadversarial networks cosupervised by categories and attributes: a multidiscriminator framework. IEEE Trans. NeuralNetw. Learn. Syst. 31, 9 (2019), 3540–3554.Google Scholar
- [27] . 2019. Swapgan: A multistage generative approach for person-to-person fashion style transfer. IEEE Transactions on Multimedia 21, 9 (2019), 2209–2222.Google ScholarCross Ref
- [28] . 2019. A generative adversarial network for style modeling in a text-to-speech system. In ICLR, Vol. 2. 1–15.Google Scholar
- [29] . 2020. Controllable person image synthesis with attribute-decomposed gan. In CVPR. 5084–5093.Google Scholar
- [30] . 2013. Distributed representations of wordsand phrases and their compositionality. In NIPS 26 (2013), 1–9.Google Scholar
- [31] . 2014. Conditional generative adversarial nets. arXiv:1411.1784. Retrieved from https://arxiv.org/abs/1411.1784Google Scholar
- [32] . 2014. Super-resolution: A comprehensive survey. Machine Vision and Applications 25, 6 (2014), 1423–1468.Google ScholarDigital Library
- [33] . 2019. Multimodal dialog system: Generating responses via adaptive decoders. In ACM MM. 1098–1106.Google Scholar
- [34] . 2020. Contrastive learning for unpaired image-to-image translation. In ECCV. Springer, 319–345.Google Scholar
- [35] . 2019. Semantic image synthesis with spatially-adaptive normalization. In CVPR. 2337–2346.Google Scholar
- [36] . 2021. Learning transferable visual models from natural language supervision. In ICML. 8748–8763.Google Scholar
- [37] . 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv:1511.06434. Retrieved from https://arxiv.org/abs/1511.06434Google Scholar
- [38] . 2023. Synthesizing photorealistic virtual humans through cross-modal disentanglement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4585–4594.Google Scholar
- [39] . 2016. Generative adversarial text to image synthesis. In ICML. 1060–1069.Google Scholar
- [40] . 2015. Variational inference with normalizing flows. In ICML. 1530–1538.Google Scholar
- [41] . 2018. Design: Design inspiration from generative networks. In ECCVW. 1–7.Google Scholar
- [42] . 2020. Interpreting the latent space of gans for semantic face editing. In CVPR. 9243–9252.Google Scholar
- [43] . 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556. Retrieved from https://arxiv.org/abs/1409.1556Google Scholar
- [44] . 2015. Learning structured output representation using deep conditional generative models. In NeurIPS. 3483–3491.Google Scholar
- [45] . 2019. Attention-guided generative adversarial networks for unsupervised image-to-image translation. In IJCNN. 1–8.Google Scholar
- [46] . 2020. Visual-relation conscious image generation from structured-text. In ECCV. Springer, 290–306.Google Scholar
- [47] . 2023. Sketch-guided text-to-image diffusion models. In ACM SIGGRAPH 2023 Conference Proceedings. 1–11.Google Scholar
- [48] . 2017. Crossing nets: Combining gans and vaes with a shared latent space for hand pose estimation. In CVPR. 680–689.Google Scholar
- [49] . 2018. Toward characteristic-preserving image-based virtual try-on network. In ECCV. 589–604.Google Scholar
- [50] . 2018. High-resolution image synthesis and semantic manipulation with conditional gans. In CVPR. 8798–8807.Google Scholar
- [51] . 2016. Deep variational canonical correlation analysis. arXiv:1610.03454. Retrieved from https://arxiv.org/abs/1610.03454Google Scholar
- [52] . 2018. Non-local neural networks. In CVPR. 7794–7803.Google Scholar
- [53] . 2003. Multiscale structural similarity for image quality assessment. In ACSSC. 1398–1402.Google Scholar
- [54] . 2018. Multimodal generative models for scalable weakly-supervised learning. In NeurIPS. 5575–5585.Google Scholar
- [55] . 2019. M2e-try on net: Fashion from model to everyone. In ACM MM. 293–301.Google Scholar
- [56] . 2018. Texturegan: Controlling deep image synthesis with texture patches. In CVPR. 8456–8465.Google Scholar
- [57] . 2012. Image denoising and inpainting with deep neural networks. In NeurIPS. 341–349.Google Scholar
- [58] . 2018. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR. 1316–1324.Google Scholar
- [59] . 2022. Toward intelligent design: An AI-based fashion designer using generative adversarial networks aided by sketch and rendering generators. IEEE Trans. MM 25 (2022), 2323–2338.Google Scholar
- [60] . 2019. Transnfcm: Translation-based neural fashion compatibility modeling. In AAAI, Vol. 33. 403–410.Google Scholar
- [61] . 2017. Dualgan: Unsupervised dual learning for image-to-image translation. In ICCV. 2849–2857.Google Scholar
- [62] . 2022. Tell, imagine, and search: End-to-end learning for composing text and image to image retrieval. ACM TOMM 18, 2 (2022), 1–23.Google ScholarDigital Library
- [63] . 2020. Clothing Out: a category-supervised GAN model for clothing segmentation and retrieval. NCAA 32 (2020), 4519–4530.Google ScholarDigital Library
- [64] . 2020. CascadeGAN: A category-supervised cascading generative adversarial network for clothes translation from the human body to tiled images. Neurocomputing 382 (2020), 148–161.Google ScholarDigital Library
- [65] . 2020. Warpclothingout: A stepwise framework for clothes translation from the human body to tiled images. IEEE MultiMedia 27, 4 (2020), 58–68.Google ScholarDigital Library
- [66] . 2017. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV. 5907–5915.Google Scholar
- [67] . 2020. Cross-domain correspondence learning for exemplar-based image translation. In CVPR. 5143–5153.Google Scholar
- [68] . 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV. 2223–2232.Google Scholar
- [69] . 2017. Be your own prada: Fashion synthesis with structural coherence. In ICCV. 1680–1688.Google Scholar
Index Terms
- Collocated Clothing Synthesis with GANs Aided by Textual Information: A Multi-Modal Framework
Recommendations
An overview of multi-modal medical image fusion
Multi-modal medical image fusion is the process of merging multiple images from single or multiple imaging modalities to improve the imaging quality with preserving the specific features.Medical image fusion covers a broad number of hot topic areas, ...
An Improved Method for Semantic Image Inpainting with GANs: Progressive Inpainting
Semantic image inpainting is getting more and more attention due to its increasing usage. Existing methods make inference based on either local data or external information. Generating Adversarial Networks, as a research focus in recent years, has been ...
SWT and PCA image fusion methods for multi-modal imagery
Image fusion is the process of combining two or more related images to produce a single output image, containing more relevant information than any one of the input images. The image-fusion process depends upon: the application domain; the number of ...
Comments