Abstract
The inexorable growth of online shopping and e-commerce demands scalable and robust machine learning-based solutions to accommodate customer requirements. In the context of automatic tagging classification and multimodal retrieval, prior works either defined a low generalizable supervised learning approach or more reusable CLIP-based techniques while, however, training on closed source data. In this work, we propose OpenFashionCLIP, a vision-and-language contrastive learning method that only adopts open-source fashion data stemming from diverse domains, and characterized by varying degrees of specificity. Our approach is extensively validated across several tasks and benchmarks, and experimental results highlight a significant out-of-domain generalization capability and consistent improvements over state-of-the-art methods both in terms of accuracy and recall. Source code and trained models are publicly available at: https://github.com/aimagelab/open-fashion-clip.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Aggarwal, P.: Fashion Product Images (Small). https://www.kaggle.com/datasets/paramaggarwal/fashion-product-images-small
Baldrati, A., Agnolucci, L., Bertini, M., Del Bimbo, A.: Zero-shot composed image retrieval with textual inversion. arXiv preprint arXiv:2303.15247 (2023)
Baldrati, A., Bertini, M., Uricchio, T., Del Bimbo, A.: Conditioned image retrieval for fashion using contrastive learning and CLIP-based features. In: ACM Multimedia Asia (2021)
Baldrati, A., Bertini, M., Uricchio, T., Del Bimbo, A.: Conditioned and composed image retrieval combining and partially fine-tuning CLIP-based features. In: CVPR Workshops (2022)
Baldrati, A., Morelli, D., Cartella, G., Cornia, M., Bertini, M., Cucchiara, R.: Multimodal garment designer: human-centric latent diffusion models for fashion image editing. arXiv preprint arXiv:2304.02051 (2023)
Brown, T., et al.: Language models are few-shot learners. In: NeurIPS (2020)
Chen, T., Xu, B., Zhang, C., Guestrin, C.: Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174 (2016)
Chia, P.J., et al.: Contrastive language and vision learning of general fashion concepts. Sci. Rep. 12(1), 18958 (2022)
Cornia, M., Baraldi, L., Fiameni, G., Cucchiara, R.: Universal captioner: inducing content-style separation in vision-and-language model training. arXiv preprint arXiv:2111.12727 (2022)
Cucurull, G., Taslakian, P., Vazquez, D.: Context-aware visual compatibility prediction. In: CVPR (2019)
De Divitiis, L., Becattini, F., Baecchi, C., Del Bimbo, A.: Disentangling features for fashion recommendation. ACM TOMM 19(1s), 1–21 (2023)
Dong, H., et al.: Fashion editing with adversarial parsing learning. In: CVPR (2020)
Fenocchi, E., Morelli, D., Cornia, M., Baraldi, L., Cesari, F., Cucchiara, R.: Dual-branch collaborative transformer for virtual try-on. In: CVPR Workshops (2022)
Fincato, M., Cornia, M., Landi, F., Cesari, F., Cucchiara, R.: Transform, warp, and dress: a new transformation-guided model for virtual try-on. ACM TOMM 18(2), 1–24 (2022)
Fincato, M., Landi, F., Cornia, M., Cesari, F., Cucchiara, R.: VITON-GT: an image-based virtual try-on model with geometric transformations. In: ICPR (2021)
Gao, T., Fisch, A., Chen, D.: Making pre-trained language models better few-shot learners. In: ACL (2021)
Guo, S., et al.: The iMaterialist fashion attribute dataset. In: ICCV Workshops (2019)
Hadi Kiapour, M., Han, X., Lazebnik, S., Berg, A.C., Berg, T.L.: Where to buy it: matching street clothing photos in online shops. In: ICCV (2015)
Han, X., Yu, L., Zhu, X., Zhang, L., Song, Y.Z., Xiang, T.: FashionViL: fashion-focused vision-and-language representation learning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV. LNCS, vol. 13695, pp. 634–651. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19833-5_37
Han, X., et al.: Automatic spatially-aware fashion concept discovery. In: ICCV (2017)
Hsiao, W.L., Grauman, K.: Creating capsule wardrobes from fashion images. In: CVPR (2018)
Ilharco, G., et al.: OpenCLIP (2021). https://doi.org/10.5281/zenodo.5143773
Kuang, Z., et al.: Fashion retrieval via graph reasoning networks on a similarity pyramid. In: ICCV (2019)
Lee, S., Gu, G., Park, S., Choi, S., Choo, J.: High-resolution virtual try-on with misalignment and occlusion-handled conditions. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV. LNCS, vol. 13677, pp. 204–219. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19790-1_13
Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: ICML (2022)
Li, Y., et al.: Supervision exists everywhere: a data efficient contrastive language-image pre-training paradigm. In: ICLR (2022)
Liu, Z., Luo, P., Qiu, S., Wang, X., Tang, X.: DeepFashion: powering robust clothes recognition and retrieval with rich annotations. In: CVPR (2016)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
Majithia, S., Parameswaran, S.N., Babar, S., Garg, V., Srivastava, A., Sharma, A.: Robust 3D garment digitization from monocular 2D images for 3D virtual try-on systems. In: WACV (2022)
Moratelli, N., Barraco, M., Morelli, D., Cornia, M., Baraldi, L., Cucchiara, R.: Fashion-oriented image captioning with external knowledge retrieval and fully attentive gates. Sensors 23(3), 1286 (2023)
Morelli, D., Baldrati, A., Cartella, G., Cornia, M., Bertini, M., Cucchiara, R.: LaDI-VTON: latent diffusion textual-inversion enhanced virtual try-on. arXiv preprint arXiv:2305.13501 (2023)
Morelli, D., Cornia, M., Cucchiara, R.: FashionSearch++: improving consumer-to-shop clothes retrieval with hard negatives. In: CEUR Workshop Proceedings (2021)
Morelli, D., Fincato, M., Cornia, M., Landi, F., Cesari, F., Cucchiara, R.: Dress code: high-resolution multi-category virtual try-on. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV. LNCS, vol. 13668, pp. 345–362. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20074-8_20
Pernuš, M., Fookes, C., Štruc, V., Dobrišek, S.: FICE: text-conditioned fashion image editing with guided GAN inversion. arXiv preprint arXiv:2301.02110 (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Rostamzadeh, N., et al.: Fashion-gen: the generative fashion dataset and challenge. arXiv preprint arXiv:1806.08317 (2018)
Santesteban, I., Otaduy, M., Thuerey, N., Casas, D.: ULNeF: untangled layered neural fields for mix-and-match virtual try-on. In: NeurIPS (2022)
Santesteban, I., Thuerey, N., Otaduy, M.A., Casas, D.: Self-supervised collision handling via generative 3D garment models for virtual try-on. In: CVPR (2021)
Sarkar, R., et al.: OutfitTransformer: learning outfit representations for fashion recommendation. In: WACV (2023)
Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. In: NeurIPS (2022)
Schuhmann, C., et al.: LAION-400M: open dataset of CLIP-filtered 400 million image-text pairs. In: NeurIPS Workshops (2021)
Shiau, R., et al.: Shop the look: building a large scale visual shopping system at Pinterest. In: KDD (2020)
Wortsman, M., et al.: Robust fine-tuning of zero-shot models. In: CVPR (2022)
Wu, H., et al.: Fashion IQ: a new dataset towards retrieving images by natural language feedback. In: CVPR (2021)
Xiao, H., Rasul, K., Vollgraf, R.: Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017)
Xie, Z., et al.: GP-VTON: towards general purpose virtual try-on via collaborative local-flow global-parsing learning. In: CVPR (2023)
Yao, L., et al.: FILIP: fine-grained interactive language-image pre-training. In: ICLR (2022)
Zhai, A., Wu, H.Y., Tzeng, E., Park, D.H., Rosenberg, C.: Learning a unified embedding for visual search at Pinterest. In: KDD (2019)
Zhang, Y., et al.: Visual search at Alibaba. In: KDD (2018)
Zhuge, M., et al.: Kaleido-BERT: vision-language pre-training on fashion domain. In: CVPR (2021)
Acknowledgements
This work has partially been supported by the European Commission under the PNRR-M4C2 (PE00000013) project “FAIR - Future Artificial Intelligence Research” and the European Horizon 2020 Programme (grant number 101004545 - ReInHerit), and by the PRIN project “CREATIVE: CRoss-modal understanding and gEnerATIon of Visual and tExtual content” (CUP B87G22000460001), co-funded by the Italian Ministry of University.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Cartella, G., Baldrati, A., Morelli, D., Cornia, M., Bertini, M., Cucchiara, R. (2023). OpenFashionCLIP: Vision-and-Language Contrastive Learning with Open-Source Fashion Data. In: Foresti, G.L., Fusiello, A., Hancock, E. (eds) Image Analysis and Processing – ICIAP 2023. ICIAP 2023. Lecture Notes in Computer Science, vol 14233. Springer, Cham. https://doi.org/10.1007/978-3-031-43148-7_21
Download citation
DOI: https://doi.org/10.1007/978-3-031-43148-7_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43147-0
Online ISBN: 978-3-031-43148-7
eBook Packages: Computer ScienceComputer Science (R0)