Skip to main content

FashionGPT: A Large Vision-Language Model for Enhancing Fashion Understanding

  • Conference paper
  • First Online:
Artificial Neural Networks and Machine Learning – ICANN 2024 (ICANN 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15020))

Included in the following conference series:

  • 729 Accesses

Abstract

Fashion understanding is a challenging multi-modal task of interpreting multi aspects of fashion images. While traditional computer vision or multi-modal algorithms fall short in providing a comprehensive understanding, Large Vision-Language Model (LVLM) offers a new approach. However, directly using LVLMs presents four major limitations, highlighting the need for a fashion-specific LVLM. Existing fashion datasets also reveal limitations in providing a coherent natural input that fits the LVLMs. To address this bottleneck, we introduce the FUND dataset featuring meticulously annotated textual descriptions for fashion images. Specifically, we build a fashion knowledge base and collect fashion images in various categories online. By leveraging image segmentation model and GPT4, we refine the pre-annotations through manual modifications. Through instruct-tuning with FUND, we develop FashionGPT, a GPT-assisted LVLM based on a solid architecture with exceptional performance on fashion understanding. It is capable of generating coherent and multi-aspect descriptions for fashion images and greatly alleviates the four limitations. Extensive experiments quantitatively and qualitatively demonstrate the effectiveness of FashionGPT and the benefits of FUND, and showcase the broad applications in more tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bai, J., et al.: Qwen-VL: a frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023)

  2. Brown, T., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)

    Google Scholar 

  3. Changpinyo, S., Sharma, P., Ding, N., Soricut, R.: Conceptual 12 m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In: CVPR (2021)

    Google Scholar 

  4. Chen, J., Guo, H., Yi, K., Li, B., Elhoseiny, M.: VisualGPT: data-efficient adaptation of pretrained language models for image captioning. In: CVPR, pp. 18030–18040 (2022)

    Google Scholar 

  5. Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195 (2023)

  6. Dai, W., et al.: InstructBLIP: towards general-purpose vision-language models with instruction tuning (2023)

    Google Scholar 

  7. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  8. Ge, Y., Zhang, R., Wang, X., Tang, X., Luo, P.: DeepFashion2: a versatile benchmark for detection, pose estimation, segmentation and re-identification of clothing images. In: CVPR, pp. 5337–5345 (2019)

    Google Scholar 

  9. Gu, X., Gao, F., Tan, M., Peng, P.: Fashion analysis and understanding with artificial intelligence. Inform. Process. Manage. 57(5), 102276 (2020)

    Article  Google Scholar 

  10. Hadi Kiapour, M., Han, X., Lazebnik, S., Berg, A.C., Berg, T.L.: Where to buy it: matching street clothing photos in online shops. In: ICCV, pp. 3343–3351 (2015)

    Google Scholar 

  11. Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: CLIPscore: a reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718 (2021)

  12. Hoffmann, J., et al.: Training compute-optimal large language models. arXiv preprint arXiv:2203.15556 (2022)

  13. Huang, J., Feris, R.S., Chen, Q., Yan, S.: Cross-domain image retrieval with a dual attribute-aware ranking network. In: ICCV, pp. 1062–1070 (2015)

    Google Scholar 

  14. Kirillov, A., et al.: Segment anything. arXiv:2304.02643 (2023)

  15. Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)

  16. Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 (2023)

  17. Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023)

  18. Liu, Y., et al.: Multilingual denoising pre-training for neural machine translation (2020)

    Google Scholar 

  19. Liu, Z., Luo, P., Qiu, S., Wang, X., Tang, X.: DeepFashion: powering robust clothes recognition and retrieval with rich annotations. In: CVPR, pp. 1096–1104 (2016)

    Google Scholar 

  20. Ma, Y., Jia, J., Zhou, S., Fu, J., Liu, Y., Tong, Z.: Towards better understanding the clothing fashion styles: a multimodal deep learning approach. In: AAAI (2017)

    Google Scholar 

  21. OpenAI: Introducing ChatGPT (2022). https://openai.com/blog/chatgpt

  22. OpenAI: Gpt-4 technical report (2023)

    Google Scholar 

  23. Ordonez, V., Kulkarni, G., Berg, T.: Im2Text: describing images using 1 million captioned photographs. In: NeurIPS (2011)

    Google Scholar 

  24. Peng, Z., et al.: Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023)

  25. Rostamzadeh, N., et al.: Fashion-Gen: the generative fashion dataset and challenge. arXiv preprint arXiv:1806.08317 (2018)

  26. Schuhmann, C., Köpf, A., Vencu, R., Coombes, T., Beaumont, R.: LAION Coco: 600 m synthetic captions from laion2B-en. https://laion.ai/blog/laion-coco/ (2022)

  27. Shankar, S., Garg, V.K., Cipolla, R.: Deep-carving: Discovering visual attributes by carving deep neural nets. In: CVPR, pp. 3403–3412 (2015)

    Google Scholar 

  28. Sun, Q., Fang, Y., Wu, L., Wang, X., Cao, Y.: EVA-CLIP: improved training techniques for clip at scale (2023)

    Google Scholar 

  29. Swain, D., Pandya, K., Sanghvi, J., Manchala, Y.: An intelligent fashion object classification using CNN. EAI Endorsed Trans. Ind. Netw. Intell. Syst. 10(4), e2 (2023)

    Article  Google Scholar 

  30. Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  31. Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

  32. Zhao, B., Feng, J., Wu, X., Yan, S.: Memory-augmented attribute manipulation networks for interactive fashion search. In: CVPR, pp. 1520–1528 (2017)

    Google Scholar 

  33. Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)

  34. Zou, X., Kong, X., Wong, W., Wang, C., Liu, Y., Cao, Y.: FashionAI: a hierarchical dataset for fashion understanding. In: CVPR Workshops (2019)

    Google Scholar 

Download references

Acknowledgements

This research work has been funded by National Key R&D Program of China (Grant No. 2023YFC3303800) and Joint Funds of the National Natural Science Foundation of China (Grant No. U21B2020).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gongshen Liu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Song, D., Gao, D., Liu, G., Li, X. (2024). FashionGPT: A Large Vision-Language Model for Enhancing Fashion Understanding. In: Wand, M., Malinovská, K., Schmidhuber, J., Tetko, I.V. (eds) Artificial Neural Networks and Machine Learning – ICANN 2024. ICANN 2024. Lecture Notes in Computer Science, vol 15020. Springer, Cham. https://doi.org/10.1007/978-3-031-72344-5_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72344-5_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72343-8

  • Online ISBN: 978-3-031-72344-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics