Skip to main content

Improving Virtual Try-On with Garment-Focused Diffusion Models

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Diffusion models have led to the revolutionizing of generative modeling in numerous image synthesis tasks. Nevertheless, it is not trivial to directly apply diffusion models for synthesizing an image of a target person wearing a given in-shop garment, i.e., image-based virtual try-on (VTON) task. The difficulty originates from the aspect that the diffusion process should not only produce holistically high-fidelity photorealistic image of the target person, but also locally preserve every appearance and texture detail of the given garment. To address this, we shape a new Diffusion model, namely GarDiff, which triggers the garment-focused diffusion process with amplified guidance of both basic visual appearance and detailed textures (i.e., high-frequency details) derived from the given garment. GarDiff first remoulds a pre-trained latent diffusion model with additional appearance priors derived from the CLIP and VAE encodings of the reference garment. Meanwhile, a novel garment-focused adapter is integrated into the UNet of diffusion model, pursuing local fine-grained alignment with the visual appearance of reference garment and human pose. We specifically design an appearance loss over the synthesized garment to enhance the crucial, high-frequency details. Extensive experiments on VITON-HD and DressCode datasets demonstrate the superiority of our GarDiff when compared to state-of-the-art VTON approaches. Code is publicly available at: https://github.com/siqi0905/GarDiff/tree/master.

S. Wan—This work was performed at HiDream.ai

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bai, S., Zhou, H., Li, Z., Zhou, C., Yang, H.: Single stage virtual try-on via deformable attention flows. In: ECCV (2022)

    Google Scholar 

  2. Bińkowski, M., Sutherland, D.J., Arbel, M., Gretton, A.: Demystifying mmd GANs. In: ICLR (2018)

    Google Scholar 

  3. Bookstein, F.L.: Principal warps: thin-plate splines and the decomposition of deformations. IEEE TPAMI 11(6), 567–585 (1989)

    Article  Google Scholar 

  4. Chen, J., Pan, Y., Yao, T., Mei, T.: Controlstyle: text-driven stylized image generation using diffusion priors. In: ACM MM (2023)

    Google Scholar 

  5. Chen, Y., Pan, Y., Li, Y., Yao, T., Mei, T.: Control3d: towards controllable text-to-3D generation. In: ACM MM (2023)

    Google Scholar 

  6. Choi, S., Park, S., Lee, M., Choo, J.: Viton-HD: high-resolution virtual try-on via misalignment-aware normalization. In: CVPR (2021)

    Google Scholar 

  7. Ding, K., Ma, K., Wang, S., Simoncelli, E.P.: Image quality assessment: unifying structure and texture similarity. IEEE TPAMI 44(5), 2567–2581 (2020)

    Google Scholar 

  8. Dong, H., Liang, X., Shen, X., Wu, B., Chen, B.C., Yin, J.: FW-GAN: flow-navigated warping GAN for video virtual try-on. In: ICCV (2019)

    Google Scholar 

  9. Fenocchi, E., Morelli, D., Cornia, M., Baraldi, L., Cesari, F., Cucchiara, R.: Dual-branch collaborative transformer for virtual try-on. In: CVPR Workshops (2022)

    Google Scholar 

  10. Ge, Y., Song, Y., Zhang, R., Ge, C., Liu, W., Luo, P.: Parser-free virtual try-on via distilling appearance flows. In: CVPR (2021)

    Google Scholar 

  11. Goodfellow, I., et al.: Generative adversarial nets. In: NeurIPS (2014)

    Google Scholar 

  12. Gou, J., Sun, S., Zhang, J., Si, J., Qian, C., Zhang, L.: Taming the power of diffusion models for high-quality virtual try-on with appearance flow. In: ACM MM (2023)

    Google Scholar 

  13. Gu, S., et al.: Vector quantized diffusion model for text-to-image synthesis. In: CVPR (2022)

    Google Scholar 

  14. Han, X., Wu, Z., Wu, Z., Yu, R., Davis, L.S.: Viton: an image-based virtual try-on network. In: CVPR (2018)

    Google Scholar 

  15. He, S., Song, Y.Z., Xiang, T.: Style-based global appearance flow for virtual try-on. In: CVPR (2022)

    Google Scholar 

  16. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: NeurIPS (2017)

    Google Scholar 

  17. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: NeurIPS (2020)

    Google Scholar 

  18. Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)

  19. Ilharco, G., et al.: Openclip. Zenodo 4, 5 (2021)

    Google Scholar 

  20. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: ICLR (2014)

    Google Scholar 

  21. Lee, S., Gu, G., Park, S., Choi, S., Choo, J.: High-resolution virtual try-on with misalignment and occlusion-handled conditions. In: ECCV (2022)

    Google Scholar 

  22. Li, K., Chong, M.J., Zhang, J., Liu, J.: Toward accurate and realistic outfits visualization with attention to details. In: CVPR (2021)

    Google Scholar 

  23. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)

    Google Scholar 

  24. Minar, M.R., Tuan, T.T., Ahn, H., Rosin, P., Lai, Y.K.: CP-VTON+: clothing shape and texture preserving image-based virtual try-on. In: CVPR Workshops (2020)

    Google Scholar 

  25. Morelli, D., Baldrati, A., Cartella, G., Cornia, M., Bertini, M., Cucchiara, R.: LADI-VTON: latent diffusion textual-inversion enhanced virtual try-on. In: ACM MM (2023)

    Google Scholar 

  26. Morelli, D., Fincato, M., Cornia, M., Landi, F., Cesari, F., Cucchiara, R.: Dress code: high-resolution multi-category virtual try-on. In: CVPR Workshops (2022)

    Google Scholar 

  27. Qian, Y., et al.: Boosting diffusion models with moving average sampling in frequency domain. In: CVPR (2024)

    Google Scholar 

  28. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

    Google Scholar 

  29. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)

    Google Scholar 

  30. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: MICCAI (2015)

    Google Scholar 

  31. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: ICML (2015)

    Google Scholar 

  32. Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: ICLR (2021)

    Google Scholar 

  33. Song, Y., Ermon, S.: Generative modeling by estimating gradients of the data distribution. In: NeurIPS (2019)

    Google Scholar 

  34. Wang, B., Zheng, H., Liang, X., Chen, Y., Lin, L., Yang, M.: Toward characteristic-preserving image-based virtual try-on network. In: ECCV (2018)

    Google Scholar 

  35. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE TIP 13(4), 600–612 (2004)

    Google Scholar 

  36. Xie, Z., et al.: GP-VTON: towards general purpose virtual try-on via collaborative local-flow global-parsing learning. In: CVPR (2023)

    Google Scholar 

  37. Yang, B., et al.: Paint by example: exemplar-based image editing with diffusion models. In: CVPR (2023)

    Google Scholar 

  38. Yang, H., Zhang, R., Guo, X., Liu, W., Zuo, W., Luo, P.: Towards photo-realistic virtual try-on by adaptively generating-preserving image content. In: CVPR (2020)

    Google Scholar 

  39. Yu, R., Wang, X., Xie, X.: VTNFP: an image-based virtual try-on network with body and clothing feature preservation. In: ICCV (2019)

    Google Scholar 

  40. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)

    Google Scholar 

  41. Zhang, Z., et al.: Trip: temporal residual learning with image noise prior for image-to-video diffusion models. In: CVPR (2024)

    Google Scholar 

  42. Zhu, L., et al.: Tryondiffusion: a tale of two unets. In: CVPR (2023)

    Google Scholar 

  43. Zhu, R., et al.: SD-DIT: unleashing the power of self-supervised discrimination in diffusion transformer. In: CVPR (2024)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yingwei Pan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wan, S. et al. (2025). Improving Virtual Try-On with Garment-Focused Diffusion Models. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15107. Springer, Cham. https://doi.org/10.1007/978-3-031-72967-6_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72967-6_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72966-9

  • Online ISBN: 978-3-031-72967-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics