Skip to main content

Advertisement

Log in

Improving fashion captioning via attribute-based alignment and multi-level language model

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Fashion captioning aims to generate detailed and captivating descriptions based on a group of item images. It requires the model to precisely describe attribute details under the supervision of complex sentences. Existing image captioning methods typically focus on describing a single image and often struggle to capture fine-grained visual representations in the fashion domain. Furthermore, the presence of complex description noise and unbalanced word distribution in fashion datasets limits diverse sentence generation. To alleviate redundancy in raw images, we propose an Attribute-based Alignment Module (AAM). The AAM captures more content-related information to enhance visual representations. Based on this design, we demonstrate that fashion captioning can benefit greatly from grid features with detailed alignment, in contrast to previous success with dense features. To address the inherent word distribution imbalance, we introduce a more balanced corpus called Fashion-Style-27k, collected from various shopping websites. Additionally, we present a pre-trained Fashion Language Model (FLM) that integrates sentence-level and attribute-level language knowledge into the caption model. Experiments on the FACAD and Fashion-Gen datasets show the proposed AAM-FLM outperforms existing methods. Descriptions in the two datasets are from considerably different lengths and styles, ranging from the 21-word detailed description to the 30-word template-based sentence, demonstrating the generalization ability of the proposed model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data Availability

All datasets are publicly available (see references). Correspondence and requests for materials should be addressed to Yuhao Tang.

Notes

  1. https://github.com/tylin/coco-caption

  2. https://nlp.stanford.edu/software/lex-parser.shtml

  3. https://huggingface.co/transformers

References

  1. Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086

  2. Bao C, Zhang X, Chen J, Miao Y (2022) Mmfl-net: multi-scale and multi-granularity feature learning for cross-domain fashion retrieval. Multimed Tools Appl 1–33

  3. Cheng W-H, Song S, Chen C-Y, Hidayati SC, Liu J (2021) Fashion meets computer vision: a survey. ACM Comput Surv (CSUR) 54(4):1–41

    Article  Google Scholar 

  4. Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10578–10587

  5. Denkowski M, Lavie A (2014) Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the ninth workshop on statistical machine translation, pages 376–380

  6. Ding Y, Ma Y, Liao L, Wong WK, Chua T-S (2021) Leveraging multiple relations for fashion trend forecasting based on social media. IEEE Trans Multimed 24:2287–2299

    Article  Google Scholar 

  7. Gu X, Gao F, Tan M, Peng P (2020) Fashion analysis and understanding with artificial intelligence. Inf Process Manag 5(5):102276

    Article  Google Scholar 

  8. Huang L, Wang W, Chen J, Wei X-Y (2019) Attention on attention for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4634–4643

  9. Jain A, Samala PR, Jyothi P, Mittal D, Singh MK (2021) Perturb, predict & paraphrase: Semi-supervised learning using noisy student for image captioning. In: IJCAI, pp 758–764

  10. Jiang S, Li J, Fu Y (2021) Deep learning for fashion style generation. IEEE Trans Neural Networks and Learn Syst 33(9):4538–4550

    Article  MathSciNet  Google Scholar 

  11. Kang Y, Yu B, Xu Z (2023) A novel approach to multi-attribute predictive analysis based on rough fuzzy sets. Appl Intell 1–18

  12. Kaur N, Pandey S (2023) Predicting clothing attributes with cnn and surf based classification model. Multimed Tools Appl 82(7):10681–10701

    Article  Google Scholar 

  13. Li X, Ye Z, Zhang Z, Zhao M (2021) Clothes image caption generation with attribute detection and visual attention model. Pattern Recognit Lett 141:68–74

    Article  Google Scholar 

  14. Liu A-A, Zhai Y, Xu N, Nie W, Li W, Zhang Y (2021) Region-aware image captioning via interaction learning. IEEE Trans Circ Syst Video Technol 32(6):3685–3696

    Article  Google Scholar 

  15. Liu P, Yuan W, Fu J, Jiang Z, Hayashi H, Neubig G (2021) Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput Surv 55(9):1–35

    Google Scholar 

  16. Liu Z, Luo P, Qiu S, Wang X, Tang X (2016) Deepfashion: powering robust clothes recognition and retrieval with rich annotations. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1096–1104

  17. Ma Y, Ji J, Sun X, Zhou Y, Ji R (2023) Towards local visual modeling for image captioning. Pattern Recognit 138:109420

    Article  Google Scholar 

  18. Min B, Ross H, Sulem E, Veyseh APB, Nguyen TH, Sainz O, Agirre E, Heintz I, Roth D (2021) Recent advances in natural language processing via large pre-trained language models: a survey. ACM Comput Surv

  19. Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311–318

  20. Prudviraj J, Vishnu C, Mohan CK (2022) M-ffn: multi-scale feature fusion network for image captioning. Appl Intell 52(13):14711–14723

    Article  Google Scholar 

  21. Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7008–7024

  22. Rostamzadeh N, Hosseini S, Boquet T, Stokowiec W, Zhang Y, Jauvin C, Pal C (2018) Fashion-gen: the generative fashion dataset and challenge. arXiv:1806.08317

  23. Shajini M, Ramanan A (2022) A knowledge-sharing semi-supervised approach for fashion clothes classification and attribute prediction. Vis Comput 38(11):3551–3561

    Article  Google Scholar 

  24. Sharma D, Dhiman C, Kumar D (2023) Evolution of visual data captioning methods, datasets, and evaluation metrics: a comprehensive survey. Expert Syst Appl 119773

  25. Stefanini M, Cornia M, Baraldi L, Cascianelli S, Fiameni G, Cucchiara R (2022) From show to tell: a survey on deep learning-based image captioning. IEEE Trans Pattern Anal Mach Intell 45(1):539–559

    Article  Google Scholar 

  26. Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575

  27. Wang C, Gu X (2022) Dynamic-balanced double-attention fusion for image captioning. Eng Appl Artif Intell 114:105194

    Article  Google Scholar 

  28. Wang C, Gu X (2022) Image captioning with adaptive incremental global context attention. Appl Intell 1–23

  29. Wang C, Shen Y, Ji L (2022) Geometry attention transformer with position-aware lstms for image captioning. Expert Syst Appl 201:117174

    Article  Google Scholar 

  30. Wu D, Li Z, Zhou J, Gan J, Gao W, Li H (2022) Clothing attribute recognition via a holistic relation network. Int J Intell Syst 37(9):6201–6220

    Article  Google Scholar 

  31. Wu H, Gao Y, Guo X, Al-Halah Z, Rennie S, Grauman K, Feris R (2021) Fashion iq: A new dataset towards retrieving images by natural language feedback. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11307–11317

  32. Xian T, Li Z, Zhang C, Ma H (2022) Dual global enhanced transformer for image captioning. Neural Netw 148:129–141

    Article  Google Scholar 

  33. Xu P, Zhu X, Clifton DA (2023) Multimodal learning with transformers: a survey. IEEE Trans Pattern Anal Mach Intell

  34. Yang X, Zhang H, Jin D, Liu Y, Wu C-H, Tan J, Xie D, Wang J, Wang X (2020) Fashion captioning: towards generating accurate descriptions with semantic rewards. In: European conference on computer vision, Springer, pp 1–17

  35. Yuan Z, Mou L, Wang Q, Zhu XX (2022) From easy to hard: Learning language-guided curriculum for visual question answering on remote sensing data. IEEE Trans Geosci Remote Sens 60:1–11

    Google Scholar 

  36. Yue X, Zhang C, Fujita H, Lv Y (2021) Clothing fashion style recognition with design issue graph. Appl Intell 51:3548–3560

    Article  Google Scholar 

  37. Zeng F, Zhao M, Zhang Z, Gao S, Cheng L (2022) Joint clothes detection and attribution prediction via anchor-free framework with decoupled representation transformer. In: Proceedings of the 31st ACM international conference on information & knowledge management, pp 2444–2454

  38. Zhang J, Fang Z, Sun H, Wang Z (2022) Adaptive semantic-enhanced transformer for image captioning. IEEE Trans Neural Netw Learn Syst

  39. Zhang J, Fang Z, Wang Z (2022) Multi-feature fusion enhanced transformer with multi-layer fused decoding for image captioning. Appl Intell pp 1–17

  40. Zhang X, Sun X, Luo Y, Ji J, Zhou Y, Wu Y, Huang F, Ji R (2021) Rstnet: captioning with adaptive attention on visual and non-visual words. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15465–15474

  41. Zhang Z, Shi Y, Yuan C, Li B, Wang P, Hu W, Zha Z-J (2020) Object relational graph with teacher-recommended learning for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13278–13288

  42. Zhou Y, Zhang Y, Hu Z, Wang M (2021) Semi-autoregressive transformer for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3139–3143

  43. Zhou Z, Su Z, Wang R (2022) Attribute-aware heterogeneous graph network for fashion compatibility prediction. Neurocomputing 495:62–74

    Article  Google Scholar 

  44. Zhuge M, Gao D, Fan D-P, Jin L, Chen B, Zhou H, Qiu M, Shao L (2021) Kaleido-bert: vision-language pre-training on fashion domain. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12647–12657

  45. Zohourianshahzadi Z, Kalita JK (2022) Neural attention for image captioning: review of outstanding methods. Artif Intell Rev 55(5):3833–3862

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under grant 62172212 and in part by the Natural Science Foundation of Jiangsu Province under grant BK20230031.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Liyan Zhang.

Ethics declarations

Conflicts of interest

We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tang, Y., Zhang, L., Yuan, Y. et al. Improving fashion captioning via attribute-based alignment and multi-level language model. Appl Intell 53, 30803–30821 (2023). https://doi.org/10.1007/s10489-023-05167-2

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-023-05167-2

Keywords