Skip to main content

OpenFashionCLIP: Vision-and-Language Contrastive Learning with Open-Source Fashion Data

  • Conference paper
  • First Online:
Image Analysis and Processing – ICIAP 2023 (ICIAP 2023)

Abstract

The inexorable growth of online shopping and e-commerce demands scalable and robust machine learning-based solutions to accommodate customer requirements. In the context of automatic tagging classification and multimodal retrieval, prior works either defined a low generalizable supervised learning approach or more reusable CLIP-based techniques while, however, training on closed source data. In this work, we propose OpenFashionCLIP, a vision-and-language contrastive learning method that only adopts open-source fashion data stemming from diverse domains, and characterized by varying degrees of specificity. Our approach is extensively validated across several tasks and benchmarks, and experimental results highlight a significant out-of-domain generalization capability and consistent improvements over state-of-the-art methods both in terms of accuracy and recall. Source code and trained models are publicly available at: https://github.com/aimagelab/open-fashion-clip.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/explosion/spaCy.

References

  1. Aggarwal, P.: Fashion Product Images (Small). https://www.kaggle.com/datasets/paramaggarwal/fashion-product-images-small

  2. Baldrati, A., Agnolucci, L., Bertini, M., Del Bimbo, A.: Zero-shot composed image retrieval with textual inversion. arXiv preprint arXiv:2303.15247 (2023)

  3. Baldrati, A., Bertini, M., Uricchio, T., Del Bimbo, A.: Conditioned image retrieval for fashion using contrastive learning and CLIP-based features. In: ACM Multimedia Asia (2021)

    Google Scholar 

  4. Baldrati, A., Bertini, M., Uricchio, T., Del Bimbo, A.: Conditioned and composed image retrieval combining and partially fine-tuning CLIP-based features. In: CVPR Workshops (2022)

    Google Scholar 

  5. Baldrati, A., Morelli, D., Cartella, G., Cornia, M., Bertini, M., Cucchiara, R.: Multimodal garment designer: human-centric latent diffusion models for fashion image editing. arXiv preprint arXiv:2304.02051 (2023)

  6. Brown, T., et al.: Language models are few-shot learners. In: NeurIPS (2020)

    Google Scholar 

  7. Chen, T., Xu, B., Zhang, C., Guestrin, C.: Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174 (2016)

  8. Chia, P.J., et al.: Contrastive language and vision learning of general fashion concepts. Sci. Rep. 12(1), 18958 (2022)

    Article  Google Scholar 

  9. Cornia, M., Baraldi, L., Fiameni, G., Cucchiara, R.: Universal captioner: inducing content-style separation in vision-and-language model training. arXiv preprint arXiv:2111.12727 (2022)

  10. Cucurull, G., Taslakian, P., Vazquez, D.: Context-aware visual compatibility prediction. In: CVPR (2019)

    Google Scholar 

  11. De Divitiis, L., Becattini, F., Baecchi, C., Del Bimbo, A.: Disentangling features for fashion recommendation. ACM TOMM 19(1s), 1–21 (2023)

    Article  Google Scholar 

  12. Dong, H., et al.: Fashion editing with adversarial parsing learning. In: CVPR (2020)

    Google Scholar 

  13. Fenocchi, E., Morelli, D., Cornia, M., Baraldi, L., Cesari, F., Cucchiara, R.: Dual-branch collaborative transformer for virtual try-on. In: CVPR Workshops (2022)

    Google Scholar 

  14. Fincato, M., Cornia, M., Landi, F., Cesari, F., Cucchiara, R.: Transform, warp, and dress: a new transformation-guided model for virtual try-on. ACM TOMM 18(2), 1–24 (2022)

    Article  Google Scholar 

  15. Fincato, M., Landi, F., Cornia, M., Cesari, F., Cucchiara, R.: VITON-GT: an image-based virtual try-on model with geometric transformations. In: ICPR (2021)

    Google Scholar 

  16. Gao, T., Fisch, A., Chen, D.: Making pre-trained language models better few-shot learners. In: ACL (2021)

    Google Scholar 

  17. Guo, S., et al.: The iMaterialist fashion attribute dataset. In: ICCV Workshops (2019)

    Google Scholar 

  18. Hadi Kiapour, M., Han, X., Lazebnik, S., Berg, A.C., Berg, T.L.: Where to buy it: matching street clothing photos in online shops. In: ICCV (2015)

    Google Scholar 

  19. Han, X., Yu, L., Zhu, X., Zhang, L., Song, Y.Z., Xiang, T.: FashionViL: fashion-focused vision-and-language representation learning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV. LNCS, vol. 13695, pp. 634–651. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19833-5_37

  20. Han, X., et al.: Automatic spatially-aware fashion concept discovery. In: ICCV (2017)

    Google Scholar 

  21. Hsiao, W.L., Grauman, K.: Creating capsule wardrobes from fashion images. In: CVPR (2018)

    Google Scholar 

  22. Ilharco, G., et al.: OpenCLIP (2021). https://doi.org/10.5281/zenodo.5143773

  23. Kuang, Z., et al.: Fashion retrieval via graph reasoning networks on a similarity pyramid. In: ICCV (2019)

    Google Scholar 

  24. Lee, S., Gu, G., Park, S., Choi, S., Choo, J.: High-resolution virtual try-on with misalignment and occlusion-handled conditions. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV. LNCS, vol. 13677, pp. 204–219. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19790-1_13

  25. Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: ICML (2022)

    Google Scholar 

  26. Li, Y., et al.: Supervision exists everywhere: a data efficient contrastive language-image pre-training paradigm. In: ICLR (2022)

    Google Scholar 

  27. Liu, Z., Luo, P., Qiu, S., Wang, X., Tang, X.: DeepFashion: powering robust clothes recognition and retrieval with rich annotations. In: CVPR (2016)

    Google Scholar 

  28. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)

    Google Scholar 

  29. Majithia, S., Parameswaran, S.N., Babar, S., Garg, V., Srivastava, A., Sharma, A.: Robust 3D garment digitization from monocular 2D images for 3D virtual try-on systems. In: WACV (2022)

    Google Scholar 

  30. Moratelli, N., Barraco, M., Morelli, D., Cornia, M., Baraldi, L., Cucchiara, R.: Fashion-oriented image captioning with external knowledge retrieval and fully attentive gates. Sensors 23(3), 1286 (2023)

    Article  Google Scholar 

  31. Morelli, D., Baldrati, A., Cartella, G., Cornia, M., Bertini, M., Cucchiara, R.: LaDI-VTON: latent diffusion textual-inversion enhanced virtual try-on. arXiv preprint arXiv:2305.13501 (2023)

  32. Morelli, D., Cornia, M., Cucchiara, R.: FashionSearch++: improving consumer-to-shop clothes retrieval with hard negatives. In: CEUR Workshop Proceedings (2021)

    Google Scholar 

  33. Morelli, D., Fincato, M., Cornia, M., Landi, F., Cesari, F., Cucchiara, R.: Dress code: high-resolution multi-category virtual try-on. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV. LNCS, vol. 13668, pp. 345–362. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20074-8_20

  34. Pernuš, M., Fookes, C., Štruc, V., Dobrišek, S.: FICE: text-conditioned fashion image editing with guided GAN inversion. arXiv preprint arXiv:2301.02110 (2023)

  35. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

    Google Scholar 

  36. Rostamzadeh, N., et al.: Fashion-gen: the generative fashion dataset and challenge. arXiv preprint arXiv:1806.08317 (2018)

  37. Santesteban, I., Otaduy, M., Thuerey, N., Casas, D.: ULNeF: untangled layered neural fields for mix-and-match virtual try-on. In: NeurIPS (2022)

    Google Scholar 

  38. Santesteban, I., Thuerey, N., Otaduy, M.A., Casas, D.: Self-supervised collision handling via generative 3D garment models for virtual try-on. In: CVPR (2021)

    Google Scholar 

  39. Sarkar, R., et al.: OutfitTransformer: learning outfit representations for fashion recommendation. In: WACV (2023)

    Google Scholar 

  40. Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. In: NeurIPS (2022)

    Google Scholar 

  41. Schuhmann, C., et al.: LAION-400M: open dataset of CLIP-filtered 400 million image-text pairs. In: NeurIPS Workshops (2021)

    Google Scholar 

  42. Shiau, R., et al.: Shop the look: building a large scale visual shopping system at Pinterest. In: KDD (2020)

    Google Scholar 

  43. Wortsman, M., et al.: Robust fine-tuning of zero-shot models. In: CVPR (2022)

    Google Scholar 

  44. Wu, H., et al.: Fashion IQ: a new dataset towards retrieving images by natural language feedback. In: CVPR (2021)

    Google Scholar 

  45. Xiao, H., Rasul, K., Vollgraf, R.: Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017)

  46. Xie, Z., et al.: GP-VTON: towards general purpose virtual try-on via collaborative local-flow global-parsing learning. In: CVPR (2023)

    Google Scholar 

  47. Yao, L., et al.: FILIP: fine-grained interactive language-image pre-training. In: ICLR (2022)

    Google Scholar 

  48. Zhai, A., Wu, H.Y., Tzeng, E., Park, D.H., Rosenberg, C.: Learning a unified embedding for visual search at Pinterest. In: KDD (2019)

    Google Scholar 

  49. Zhang, Y., et al.: Visual search at Alibaba. In: KDD (2018)

    Google Scholar 

  50. Zhuge, M., et al.: Kaleido-BERT: vision-language pre-training on fashion domain. In: CVPR (2021)

    Google Scholar 

Download references

Acknowledgements

This work has partially been supported by the European Commission under the PNRR-M4C2 (PE00000013) project “FAIR - Future Artificial Intelligence Research” and the European Horizon 2020 Programme (grant number 101004545 - ReInHerit), and by the PRIN project “CREATIVE: CRoss-modal understanding and gEnerATIon of Visual and tExtual content” (CUP B87G22000460001), co-funded by the Italian Ministry of University.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marcella Cornia .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Cartella, G., Baldrati, A., Morelli, D., Cornia, M., Bertini, M., Cucchiara, R. (2023). OpenFashionCLIP: Vision-and-Language Contrastive Learning with Open-Source Fashion Data. In: Foresti, G.L., Fusiello, A., Hancock, E. (eds) Image Analysis and Processing – ICIAP 2023. ICIAP 2023. Lecture Notes in Computer Science, vol 14233. Springer, Cham. https://doi.org/10.1007/978-3-031-43148-7_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-43148-7_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-43147-0

  • Online ISBN: 978-3-031-43148-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics