Abstract
Diffusion models have led to the revolutionizing of generative modeling in numerous image synthesis tasks. Nevertheless, it is not trivial to directly apply diffusion models for synthesizing an image of a target person wearing a given in-shop garment, i.e., image-based virtual try-on (VTON) task. The difficulty originates from the aspect that the diffusion process should not only produce holistically high-fidelity photorealistic image of the target person, but also locally preserve every appearance and texture detail of the given garment. To address this, we shape a new Diffusion model, namely GarDiff, which triggers the garment-focused diffusion process with amplified guidance of both basic visual appearance and detailed textures (i.e., high-frequency details) derived from the given garment. GarDiff first remoulds a pre-trained latent diffusion model with additional appearance priors derived from the CLIP and VAE encodings of the reference garment. Meanwhile, a novel garment-focused adapter is integrated into the UNet of diffusion model, pursuing local fine-grained alignment with the visual appearance of reference garment and human pose. We specifically design an appearance loss over the synthesized garment to enhance the crucial, high-frequency details. Extensive experiments on VITON-HD and DressCode datasets demonstrate the superiority of our GarDiff when compared to state-of-the-art VTON approaches. Code is publicly available at: https://github.com/siqi0905/GarDiff/tree/master.
S. Wan—This work was performed at HiDream.ai
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bai, S., Zhou, H., Li, Z., Zhou, C., Yang, H.: Single stage virtual try-on via deformable attention flows. In: ECCV (2022)
Bińkowski, M., Sutherland, D.J., Arbel, M., Gretton, A.: Demystifying mmd GANs. In: ICLR (2018)
Bookstein, F.L.: Principal warps: thin-plate splines and the decomposition of deformations. IEEE TPAMI 11(6), 567–585 (1989)
Chen, J., Pan, Y., Yao, T., Mei, T.: Controlstyle: text-driven stylized image generation using diffusion priors. In: ACM MM (2023)
Chen, Y., Pan, Y., Li, Y., Yao, T., Mei, T.: Control3d: towards controllable text-to-3D generation. In: ACM MM (2023)
Choi, S., Park, S., Lee, M., Choo, J.: Viton-HD: high-resolution virtual try-on via misalignment-aware normalization. In: CVPR (2021)
Ding, K., Ma, K., Wang, S., Simoncelli, E.P.: Image quality assessment: unifying structure and texture similarity. IEEE TPAMI 44(5), 2567–2581 (2020)
Dong, H., Liang, X., Shen, X., Wu, B., Chen, B.C., Yin, J.: FW-GAN: flow-navigated warping GAN for video virtual try-on. In: ICCV (2019)
Fenocchi, E., Morelli, D., Cornia, M., Baraldi, L., Cesari, F., Cucchiara, R.: Dual-branch collaborative transformer for virtual try-on. In: CVPR Workshops (2022)
Ge, Y., Song, Y., Zhang, R., Ge, C., Liu, W., Luo, P.: Parser-free virtual try-on via distilling appearance flows. In: CVPR (2021)
Goodfellow, I., et al.: Generative adversarial nets. In: NeurIPS (2014)
Gou, J., Sun, S., Zhang, J., Si, J., Qian, C., Zhang, L.: Taming the power of diffusion models for high-quality virtual try-on with appearance flow. In: ACM MM (2023)
Gu, S., et al.: Vector quantized diffusion model for text-to-image synthesis. In: CVPR (2022)
Han, X., Wu, Z., Wu, Z., Yu, R., Davis, L.S.: Viton: an image-based virtual try-on network. In: CVPR (2018)
He, S., Song, Y.Z., Xiang, T.: Style-based global appearance flow for virtual try-on. In: CVPR (2022)
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: NeurIPS (2017)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)
Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)
Ilharco, G., et al.: Openclip. Zenodo 4, 5 (2021)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: ICLR (2014)
Lee, S., Gu, G., Park, S., Choi, S., Choo, J.: High-resolution virtual try-on with misalignment and occlusion-handled conditions. In: ECCV (2022)
Li, K., Chong, M.J., Zhang, J., Liu, J.: Toward accurate and realistic outfits visualization with attention to details. In: CVPR (2021)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
Minar, M.R., Tuan, T.T., Ahn, H., Rosin, P., Lai, Y.K.: CP-VTON+: clothing shape and texture preserving image-based virtual try-on. In: CVPR Workshops (2020)
Morelli, D., Baldrati, A., Cartella, G., Cornia, M., Bertini, M., Cucchiara, R.: LADI-VTON: latent diffusion textual-inversion enhanced virtual try-on. In: ACM MM (2023)
Morelli, D., Fincato, M., Cornia, M., Landi, F., Cesari, F., Cucchiara, R.: Dress code: high-resolution multi-category virtual try-on. In: CVPR Workshops (2022)
Qian, Y., et al.: Boosting diffusion models with moving average sampling in frequency domain. In: CVPR (2024)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: MICCAI (2015)
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: ICML (2015)
Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: ICLR (2021)
Song, Y., Ermon, S.: Generative modeling by estimating gradients of the data distribution. In: NeurIPS (2019)
Wang, B., Zheng, H., Liang, X., Chen, Y., Lin, L., Yang, M.: Toward characteristic-preserving image-based virtual try-on network. In: ECCV (2018)
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE TIP 13(4), 600–612 (2004)
Xie, Z., et al.: GP-VTON: towards general purpose virtual try-on via collaborative local-flow global-parsing learning. In: CVPR (2023)
Yang, B., et al.: Paint by example: exemplar-based image editing with diffusion models. In: CVPR (2023)
Yang, H., Zhang, R., Guo, X., Liu, W., Zuo, W., Luo, P.: Towards photo-realistic virtual try-on by adaptively generating-preserving image content. In: CVPR (2020)
Yu, R., Wang, X., Xie, X.: VTNFP: an image-based virtual try-on network with body and clothing feature preservation. In: ICCV (2019)
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)
Zhang, Z., et al.: Trip: temporal residual learning with image noise prior for image-to-video diffusion models. In: CVPR (2024)
Zhu, L., et al.: Tryondiffusion: a tale of two unets. In: CVPR (2023)
Zhu, R., et al.: SD-DIT: unleashing the power of self-supervised discrimination in diffusion transformer. In: CVPR (2024)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Wan, S. et al. (2025). Improving Virtual Try-On with Garment-Focused Diffusion Models. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15107. Springer, Cham. https://doi.org/10.1007/978-3-031-72967-6_11
Download citation
DOI: https://doi.org/10.1007/978-3-031-72967-6_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72966-9
Online ISBN: 978-3-031-72967-6
eBook Packages: Computer ScienceComputer Science (R0)