FashionGPT: A Large Vision-Language Model for Enhancing Fashion Understanding

Song, Duanxiao; Gao, Dehong; Liu, Gongshen; Li, Xiaoyong

doi:10.1007/978-3-031-72344-5_21

Duanxiao Song¹¹,
Dehong Gao¹²,
Gongshen Liu¹¹ &
…
Xiaoyong Li¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15020))

Included in the following conference series:

International Conference on Artificial Neural Networks

729 Accesses

Abstract

Fashion understanding is a challenging multi-modal task of interpreting multi aspects of fashion images. While traditional computer vision or multi-modal algorithms fall short in providing a comprehensive understanding, Large Vision-Language Model (LVLM) offers a new approach. However, directly using LVLMs presents four major limitations, highlighting the need for a fashion-specific LVLM. Existing fashion datasets also reveal limitations in providing a coherent natural input that fits the LVLMs. To address this bottleneck, we introduce the FUND dataset featuring meticulously annotated textual descriptions for fashion images. Specifically, we build a fashion knowledge base and collect fashion images in various categories online. By leveraging image segmentation model and GPT4, we refine the pre-annotations through manual modifications. Through instruct-tuning with FUND, we develop FashionGPT, a GPT-assisted LVLM based on a solid architecture with exceptional performance on fashion understanding. It is capable of generating coherent and multi-aspect descriptions for fashion images and greatly alleviates the four limitations. Extensive experiments quantitatively and qualitatively demonstrate the effectiveness of FashionGPT and the benefits of FUND, and showcase the broad applications in more tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bai, J., et al.: Qwen-VL: a frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023)
Brown, T., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)
Google Scholar
Changpinyo, S., Sharma, P., Ding, N., Soricut, R.: Conceptual 12 m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In: CVPR (2021)
Google Scholar
Chen, J., Guo, H., Yi, K., Li, B., Elhoseiny, M.: VisualGPT: data-efficient adaptation of pretrained language models for image captioning. In: CVPR, pp. 18030–18040 (2022)
Google Scholar
Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195 (2023)
Dai, W., et al.: InstructBLIP: towards general-purpose vision-language models with instruction tuning (2023)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Ge, Y., Zhang, R., Wang, X., Tang, X., Luo, P.: DeepFashion2: a versatile benchmark for detection, pose estimation, segmentation and re-identification of clothing images. In: CVPR, pp. 5337–5345 (2019)
Google Scholar
Gu, X., Gao, F., Tan, M., Peng, P.: Fashion analysis and understanding with artificial intelligence. Inform. Process. Manage. 57(5), 102276 (2020)
Article Google Scholar
Hadi Kiapour, M., Han, X., Lazebnik, S., Berg, A.C., Berg, T.L.: Where to buy it: matching street clothing photos in online shops. In: ICCV, pp. 3343–3351 (2015)
Google Scholar
Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: CLIPscore: a reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718 (2021)
Hoffmann, J., et al.: Training compute-optimal large language models. arXiv preprint arXiv:2203.15556 (2022)
Huang, J., Feris, R.S., Chen, Q., Yan, S.: Cross-domain image retrieval with a dual attribute-aware ranking network. In: ICCV, pp. 1062–1070 (2015)
Google Scholar
Kirillov, A., et al.: Segment anything. arXiv:2304.02643 (2023)
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023)
Liu, Y., et al.: Multilingual denoising pre-training for neural machine translation (2020)
Google Scholar
Liu, Z., Luo, P., Qiu, S., Wang, X., Tang, X.: DeepFashion: powering robust clothes recognition and retrieval with rich annotations. In: CVPR, pp. 1096–1104 (2016)
Google Scholar
Ma, Y., Jia, J., Zhou, S., Fu, J., Liu, Y., Tong, Z.: Towards better understanding the clothing fashion styles: a multimodal deep learning approach. In: AAAI (2017)
Google Scholar
OpenAI: Introducing ChatGPT (2022). https://openai.com/blog/chatgpt
OpenAI: Gpt-4 technical report (2023)
Google Scholar
Ordonez, V., Kulkarni, G., Berg, T.: Im2Text: describing images using 1 million captioned photographs. In: NeurIPS (2011)
Google Scholar
Peng, Z., et al.: Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023)
Rostamzadeh, N., et al.: Fashion-Gen: the generative fashion dataset and challenge. arXiv preprint arXiv:1806.08317 (2018)
Schuhmann, C., Köpf, A., Vencu, R., Coombes, T., Beaumont, R.: LAION Coco: 600 m synthetic captions from laion2B-en. https://laion.ai/blog/laion-coco/ (2022)
Shankar, S., Garg, V.K., Cipolla, R.: Deep-carving: Discovering visual attributes by carving deep neural nets. In: CVPR, pp. 3403–3412 (2015)
Google Scholar
Sun, Q., Fang, Y., Wu, L., Wang, X., Cao, Y.: EVA-CLIP: improved training techniques for clip at scale (2023)
Google Scholar
Swain, D., Pandya, K., Sanghvi, J., Manchala, Y.: An intelligent fashion object classification using CNN. EAI Endorsed Trans. Ind. Netw. Intell. Syst. 10(4), e2 (2023)
Article Google Scholar
Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
Zhao, B., Feng, J., Wu, X., Yan, S.: Memory-augmented attribute manipulation networks for interactive fashion search. In: CVPR, pp. 1520–1528 (2017)
Google Scholar
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: MiniGPT-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)
Zou, X., Kong, X., Wong, W., Wang, C., Liu, Y., Cao, Y.: FashionAI: a hierarchical dataset for fashion understanding. In: CVPR Workshops (2019)
Google Scholar

Download references

Acknowledgements

This research work has been funded by National Key R&D Program of China (Grant No. 2023YFC3303800) and Joint Funds of the National Natural Science Foundation of China (Grant No. U21B2020).

Author information

Authors and Affiliations

Shanghai Jiaotong University, Shanghai, China
Duanxiao Song, Gongshen Liu & Xiaoyong Li
Northwestern Polytechnical University, Xi’an, China
Dehong Gao

Authors

Duanxiao Song
View author publications
You can also search for this author in PubMed Google Scholar
Dehong Gao
View author publications
You can also search for this author in PubMed Google Scholar
Gongshen Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyong Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gongshen Liu .

Editor information

Editors and Affiliations

IDSIA USI-SUPSI, Lugano, Switzerland
Michael Wand
Comenius University, Bratislava, Slovakia
Kristína Malinovská
KAUST Center of Generative AI, Thuwal, Saudi Arabia
Jürgen Schmidhuber
Helmholtz Zentrum München, Neuherberg, Germany
Igor V. Tetko

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Song, D., Gao, D., Liu, G., Li, X. (2024). FashionGPT: A Large Vision-Language Model for Enhancing Fashion Understanding. In: Wand, M., Malinovská, K., Schmidhuber, J., Tetko, I.V. (eds) Artificial Neural Networks and Machine Learning – ICANN 2024. ICANN 2024. Lecture Notes in Computer Science, vol 15020. Springer, Cham. https://doi.org/10.1007/978-3-031-72344-5_21

Download citation

DOI: https://doi.org/10.1007/978-3-031-72344-5_21
Published: 17 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72343-8
Online ISBN: 978-3-031-72344-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

FashionGPT: A Large Vision-Language Model for Enhancing Fashion Understanding