OpenFashionCLIP: Vision-and-Language Contrastive Learning with Open-Source Fashion Data

Cartella, Giuseppe; Baldrati, Alberto; Morelli, Davide; Cornia, Marcella; Bertini, Marco; Cucchiara, Rita

doi:10.1007/978-3-031-43148-7_21

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14233))

Included in the following conference series:

International Conference on Image Analysis and Processing

1009 Accesses

Abstract

The inexorable growth of online shopping and e-commerce demands scalable and robust machine learning-based solutions to accommodate customer requirements. In the context of automatic tagging classification and multimodal retrieval, prior works either defined a low generalizable supervised learning approach or more reusable CLIP-based techniques while, however, training on closed source data. In this work, we propose OpenFashionCLIP, a vision-and-language contrastive learning method that only adopts open-source fashion data stemming from diverse domains, and characterized by varying degrees of specificity. Our approach is extensively validated across several tasks and benchmarks, and experimental results highlight a significant out-of-domain generalization capability and consistent improvements over state-of-the-art methods both in terms of accuracy and recall. Source code and trained models are publicly available at: https://github.com/aimagelab/open-fashion-clip.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Contrastive language and vision learning of general fashion concepts

Article Open access 08 November 2022

FashionViL: Fashion-Focused Vision-and-Language Representation Learning

Fill in the blank for fashion complementary outfit product Retrieval: VISUM summer school competition

Article Open access 30 December 2022

Notes

1.
https://github.com/explosion/spaCy.

References

Aggarwal, P.: Fashion Product Images (Small). https://www.kaggle.com/datasets/paramaggarwal/fashion-product-images-small
Baldrati, A., Agnolucci, L., Bertini, M., Del Bimbo, A.: Zero-shot composed image retrieval with textual inversion. arXiv preprint arXiv:2303.15247 (2023)
Baldrati, A., Bertini, M., Uricchio, T., Del Bimbo, A.: Conditioned image retrieval for fashion using contrastive learning and CLIP-based features. In: ACM Multimedia Asia (2021)
Google Scholar
Baldrati, A., Bertini, M., Uricchio, T., Del Bimbo, A.: Conditioned and composed image retrieval combining and partially fine-tuning CLIP-based features. In: CVPR Workshops (2022)
Google Scholar
Baldrati, A., Morelli, D., Cartella, G., Cornia, M., Bertini, M., Cucchiara, R.: Multimodal garment designer: human-centric latent diffusion models for fashion image editing. arXiv preprint arXiv:2304.02051 (2023)
Brown, T., et al.: Language models are few-shot learners. In: NeurIPS (2020)
Google Scholar
Chen, T., Xu, B., Zhang, C., Guestrin, C.: Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174 (2016)
Chia, P.J., et al.: Contrastive language and vision learning of general fashion concepts. Sci. Rep. 12(1), 18958 (2022)
Article Google Scholar
Cornia, M., Baraldi, L., Fiameni, G., Cucchiara, R.: Universal captioner: inducing content-style separation in vision-and-language model training. arXiv preprint arXiv:2111.12727 (2022)
Cucurull, G., Taslakian, P., Vazquez, D.: Context-aware visual compatibility prediction. In: CVPR (2019)
Google Scholar
De Divitiis, L., Becattini, F., Baecchi, C., Del Bimbo, A.: Disentangling features for fashion recommendation. ACM TOMM 19(1s), 1–21 (2023)
Article Google Scholar
Dong, H., et al.: Fashion editing with adversarial parsing learning. In: CVPR (2020)
Google Scholar
Fenocchi, E., Morelli, D., Cornia, M., Baraldi, L., Cesari, F., Cucchiara, R.: Dual-branch collaborative transformer for virtual try-on. In: CVPR Workshops (2022)
Google Scholar
Fincato, M., Cornia, M., Landi, F., Cesari, F., Cucchiara, R.: Transform, warp, and dress: a new transformation-guided model for virtual try-on. ACM TOMM 18(2), 1–24 (2022)
Article Google Scholar
Fincato, M., Landi, F., Cornia, M., Cesari, F., Cucchiara, R.: VITON-GT: an image-based virtual try-on model with geometric transformations. In: ICPR (2021)
Google Scholar
Gao, T., Fisch, A., Chen, D.: Making pre-trained language models better few-shot learners. In: ACL (2021)
Google Scholar
Guo, S., et al.: The iMaterialist fashion attribute dataset. In: ICCV Workshops (2019)
Google Scholar
Hadi Kiapour, M., Han, X., Lazebnik, S., Berg, A.C., Berg, T.L.: Where to buy it: matching street clothing photos in online shops. In: ICCV (2015)
Google Scholar
Han, X., Yu, L., Zhu, X., Zhang, L., Song, Y.Z., Xiang, T.: FashionViL: fashion-focused vision-and-language representation learning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV. LNCS, vol. 13695, pp. 634–651. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19833-5_37
Han, X., et al.: Automatic spatially-aware fashion concept discovery. In: ICCV (2017)
Google Scholar
Hsiao, W.L., Grauman, K.: Creating capsule wardrobes from fashion images. In: CVPR (2018)
Google Scholar
Ilharco, G., et al.: OpenCLIP (2021). https://doi.org/10.5281/zenodo.5143773
Kuang, Z., et al.: Fashion retrieval via graph reasoning networks on a similarity pyramid. In: ICCV (2019)
Google Scholar
Lee, S., Gu, G., Park, S., Choi, S., Choo, J.: High-resolution virtual try-on with misalignment and occlusion-handled conditions. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV. LNCS, vol. 13677, pp. 204–219. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19790-1_13
Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: ICML (2022)
Google Scholar
Li, Y., et al.: Supervision exists everywhere: a data efficient contrastive language-image pre-training paradigm. In: ICLR (2022)
Google Scholar
Liu, Z., Luo, P., Qiu, S., Wang, X., Tang, X.: DeepFashion: powering robust clothes recognition and retrieval with rich annotations. In: CVPR (2016)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
Google Scholar
Majithia, S., Parameswaran, S.N., Babar, S., Garg, V., Srivastava, A., Sharma, A.: Robust 3D garment digitization from monocular 2D images for 3D virtual try-on systems. In: WACV (2022)
Google Scholar
Moratelli, N., Barraco, M., Morelli, D., Cornia, M., Baraldi, L., Cucchiara, R.: Fashion-oriented image captioning with external knowledge retrieval and fully attentive gates. Sensors 23(3), 1286 (2023)
Article Google Scholar
Morelli, D., Baldrati, A., Cartella, G., Cornia, M., Bertini, M., Cucchiara, R.: LaDI-VTON: latent diffusion textual-inversion enhanced virtual try-on. arXiv preprint arXiv:2305.13501 (2023)
Morelli, D., Cornia, M., Cucchiara, R.: FashionSearch++: improving consumer-to-shop clothes retrieval with hard negatives. In: CEUR Workshop Proceedings (2021)
Google Scholar
Morelli, D., Fincato, M., Cornia, M., Landi, F., Cesari, F., Cucchiara, R.: Dress code: high-resolution multi-category virtual try-on. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV. LNCS, vol. 13668, pp. 345–362. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20074-8_20
Pernuš, M., Fookes, C., Štruc, V., Dobrišek, S.: FICE: text-conditioned fashion image editing with guided GAN inversion. arXiv preprint arXiv:2301.02110 (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Google Scholar
Rostamzadeh, N., et al.: Fashion-gen: the generative fashion dataset and challenge. arXiv preprint arXiv:1806.08317 (2018)
Santesteban, I., Otaduy, M., Thuerey, N., Casas, D.: ULNeF: untangled layered neural fields for mix-and-match virtual try-on. In: NeurIPS (2022)
Google Scholar
Santesteban, I., Thuerey, N., Otaduy, M.A., Casas, D.: Self-supervised collision handling via generative 3D garment models for virtual try-on. In: CVPR (2021)
Google Scholar
Sarkar, R., et al.: OutfitTransformer: learning outfit representations for fashion recommendation. In: WACV (2023)
Google Scholar
Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. In: NeurIPS (2022)
Google Scholar
Schuhmann, C., et al.: LAION-400M: open dataset of CLIP-filtered 400 million image-text pairs. In: NeurIPS Workshops (2021)
Google Scholar
Shiau, R., et al.: Shop the look: building a large scale visual shopping system at Pinterest. In: KDD (2020)
Google Scholar
Wortsman, M., et al.: Robust fine-tuning of zero-shot models. In: CVPR (2022)
Google Scholar
Wu, H., et al.: Fashion IQ: a new dataset towards retrieving images by natural language feedback. In: CVPR (2021)
Google Scholar
Xiao, H., Rasul, K., Vollgraf, R.: Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017)
Xie, Z., et al.: GP-VTON: towards general purpose virtual try-on via collaborative local-flow global-parsing learning. In: CVPR (2023)
Google Scholar
Yao, L., et al.: FILIP: fine-grained interactive language-image pre-training. In: ICLR (2022)
Google Scholar
Zhai, A., Wu, H.Y., Tzeng, E., Park, D.H., Rosenberg, C.: Learning a unified embedding for visual search at Pinterest. In: KDD (2019)
Google Scholar
Zhang, Y., et al.: Visual search at Alibaba. In: KDD (2018)
Google Scholar
Zhuge, M., et al.: Kaleido-BERT: vision-language pre-training on fashion domain. In: CVPR (2021)
Google Scholar

Download references

Acknowledgements

This work has partially been supported by the European Commission under the PNRR-M4C2 (PE00000013) project “FAIR - Future Artificial Intelligence Research” and the European Horizon 2020 Programme (grant number 101004545 - ReInHerit), and by the PRIN project “CREATIVE: CRoss-modal understanding and gEnerATIon of Visual and tExtual content” (CUP B87G22000460001), co-funded by the Italian Ministry of University.

Author information

Authors and Affiliations

University of Modena and Reggio Emilia, Modena, Italy
Giuseppe Cartella, Davide Morelli, Marcella Cornia & Rita Cucchiara
University of Florence, Florence, Italy
Alberto Baldrati & Marco Bertini
University of Pisa, Pisa, Italy
Alberto Baldrati & Davide Morelli

Authors

Giuseppe Cartella
View author publications
You can also search for this author in PubMed Google Scholar
Alberto Baldrati
View author publications
You can also search for this author in PubMed Google Scholar
Davide Morelli
View author publications
You can also search for this author in PubMed Google Scholar
Marcella Cornia
View author publications
You can also search for this author in PubMed Google Scholar
Marco Bertini
View author publications
You can also search for this author in PubMed Google Scholar
Rita Cucchiara
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marcella Cornia .

Editor information

Editors and Affiliations

University of Udine, Udine, Italy
Gian Luca Foresti
University of Udine, Udine, Italy
Andrea Fusiello
University of York, York, UK
Edwin Hancock

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cartella, G., Baldrati, A., Morelli, D., Cornia, M., Bertini, M., Cucchiara, R. (2023). OpenFashionCLIP: Vision-and-Language Contrastive Learning with Open-Source Fashion Data. In: Foresti, G.L., Fusiello, A., Hancock, E. (eds) Image Analysis and Processing – ICIAP 2023. ICIAP 2023. Lecture Notes in Computer Science, vol 14233. Springer, Cham. https://doi.org/10.1007/978-3-031-43148-7_21

Download citation

DOI: https://doi.org/10.1007/978-3-031-43148-7_21
Published: 05 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43147-0
Online ISBN: 978-3-031-43148-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

OpenFashionCLIP: Vision-and-Language Contrastive Learning with Open-Source Fashion Data