Improving fashion captioning via attribute-based alignment and multi-level language model

Tang, Yuhao; Zhang, Liyan; Yuan, Ye; Chen, Zhixian

doi:10.1007/s10489-023-05167-2

Improving fashion captioning via attribute-based alignment and multi-level language model

Published: 25 November 2023

Volume 53, pages 30803–30821, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Yuhao Tang¹,
Liyan Zhang ORCID: orcid.org/0000-0002-1549-3317¹,
Ye Yuan¹ &
…
Zhixian Chen¹

262 Accesses
Explore all metrics

Abstract

Fashion captioning aims to generate detailed and captivating descriptions based on a group of item images. It requires the model to precisely describe attribute details under the supervision of complex sentences. Existing image captioning methods typically focus on describing a single image and often struggle to capture fine-grained visual representations in the fashion domain. Furthermore, the presence of complex description noise and unbalanced word distribution in fashion datasets limits diverse sentence generation. To alleviate redundancy in raw images, we propose an Attribute-based Alignment Module (AAM). The AAM captures more content-related information to enhance visual representations. Based on this design, we demonstrate that fashion captioning can benefit greatly from grid features with detailed alignment, in contrast to previous success with dense features. To address the inherent word distribution imbalance, we introduce a more balanced corpus called Fashion-Style-27k, collected from various shopping websites. Additionally, we present a pre-trained Fashion Language Model (FLM) that integrates sentence-level and attribute-level language knowledge into the caption model. Experiments on the FACAD and Fashion-Gen datasets show the proposed AAM-FLM outperforms existing methods. Descriptions in the two datasets are from considerably different lengths and styles, ranging from the 21-word detailed description to the 30-word template-based sentence, demonstrating the generalization ability of the proposed model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fashion item captioning via grid-relation self-attention and gated-enhanced decoder

Article 10 June 2023

Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards

Attention Mechanism for Fashion Image Captioning

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data Availability

All datasets are publicly available (see references). Correspondence and requests for materials should be addressed to Yuhao Tang.

Notes

References

Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086
Bao C, Zhang X, Chen J, Miao Y (2022) Mmfl-net: multi-scale and multi-granularity feature learning for cross-domain fashion retrieval. Multimed Tools Appl 1–33
Cheng W-H, Song S, Chen C-Y, Hidayati SC, Liu J (2021) Fashion meets computer vision: a survey. ACM Comput Surv (CSUR) 54(4):1–41
Article Google Scholar
Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10578–10587
Denkowski M, Lavie A (2014) Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the ninth workshop on statistical machine translation, pages 376–380
Ding Y, Ma Y, Liao L, Wong WK, Chua T-S (2021) Leveraging multiple relations for fashion trend forecasting based on social media. IEEE Trans Multimed 24:2287–2299
Article Google Scholar
Gu X, Gao F, Tan M, Peng P (2020) Fashion analysis and understanding with artificial intelligence. Inf Process Manag 5(5):102276
Article Google Scholar
Huang L, Wang W, Chen J, Wei X-Y (2019) Attention on attention for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4634–4643
Jain A, Samala PR, Jyothi P, Mittal D, Singh MK (2021) Perturb, predict & paraphrase: Semi-supervised learning using noisy student for image captioning. In: IJCAI, pp 758–764
Jiang S, Li J, Fu Y (2021) Deep learning for fashion style generation. IEEE Trans Neural Networks and Learn Syst 33(9):4538–4550
Article MathSciNet Google Scholar
Kang Y, Yu B, Xu Z (2023) A novel approach to multi-attribute predictive analysis based on rough fuzzy sets. Appl Intell 1–18
Kaur N, Pandey S (2023) Predicting clothing attributes with cnn and surf based classification model. Multimed Tools Appl 82(7):10681–10701
Article Google Scholar
Li X, Ye Z, Zhang Z, Zhao M (2021) Clothes image caption generation with attribute detection and visual attention model. Pattern Recognit Lett 141:68–74
Article Google Scholar
Liu A-A, Zhai Y, Xu N, Nie W, Li W, Zhang Y (2021) Region-aware image captioning via interaction learning. IEEE Trans Circ Syst Video Technol 32(6):3685–3696
Article Google Scholar
Liu P, Yuan W, Fu J, Jiang Z, Hayashi H, Neubig G (2021) Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput Surv 55(9):1–35
Google Scholar
Liu Z, Luo P, Qiu S, Wang X, Tang X (2016) Deepfashion: powering robust clothes recognition and retrieval with rich annotations. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1096–1104
Ma Y, Ji J, Sun X, Zhou Y, Ji R (2023) Towards local visual modeling for image captioning. Pattern Recognit 138:109420
Article Google Scholar
Min B, Ross H, Sulem E, Veyseh APB, Nguyen TH, Sainz O, Agirre E, Heintz I, Roth D (2021) Recent advances in natural language processing via large pre-trained language models: a survey. ACM Comput Surv
Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311–318
Prudviraj J, Vishnu C, Mohan CK (2022) M-ffn: multi-scale feature fusion network for image captioning. Appl Intell 52(13):14711–14723
Article Google Scholar
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7008–7024
Rostamzadeh N, Hosseini S, Boquet T, Stokowiec W, Zhang Y, Jauvin C, Pal C (2018) Fashion-gen: the generative fashion dataset and challenge. arXiv:1806.08317
Shajini M, Ramanan A (2022) A knowledge-sharing semi-supervised approach for fashion clothes classification and attribute prediction. Vis Comput 38(11):3551–3561
Article Google Scholar
Sharma D, Dhiman C, Kumar D (2023) Evolution of visual data captioning methods, datasets, and evaluation metrics: a comprehensive survey. Expert Syst Appl 119773
Stefanini M, Cornia M, Baraldi L, Cascianelli S, Fiameni G, Cucchiara R (2022) From show to tell: a survey on deep learning-based image captioning. IEEE Trans Pattern Anal Mach Intell 45(1):539–559
Article Google Scholar
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
Wang C, Gu X (2022) Dynamic-balanced double-attention fusion for image captioning. Eng Appl Artif Intell 114:105194
Article Google Scholar
Wang C, Gu X (2022) Image captioning with adaptive incremental global context attention. Appl Intell 1–23
Wang C, Shen Y, Ji L (2022) Geometry attention transformer with position-aware lstms for image captioning. Expert Syst Appl 201:117174
Article Google Scholar
Wu D, Li Z, Zhou J, Gan J, Gao W, Li H (2022) Clothing attribute recognition via a holistic relation network. Int J Intell Syst 37(9):6201–6220
Article Google Scholar
Wu H, Gao Y, Guo X, Al-Halah Z, Rennie S, Grauman K, Feris R (2021) Fashion iq: A new dataset towards retrieving images by natural language feedback. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11307–11317
Xian T, Li Z, Zhang C, Ma H (2022) Dual global enhanced transformer for image captioning. Neural Netw 148:129–141
Article Google Scholar
Xu P, Zhu X, Clifton DA (2023) Multimodal learning with transformers: a survey. IEEE Trans Pattern Anal Mach Intell
Yang X, Zhang H, Jin D, Liu Y, Wu C-H, Tan J, Xie D, Wang J, Wang X (2020) Fashion captioning: towards generating accurate descriptions with semantic rewards. In: European conference on computer vision, Springer, pp 1–17
Yuan Z, Mou L, Wang Q, Zhu XX (2022) From easy to hard: Learning language-guided curriculum for visual question answering on remote sensing data. IEEE Trans Geosci Remote Sens 60:1–11
Google Scholar
Yue X, Zhang C, Fujita H, Lv Y (2021) Clothing fashion style recognition with design issue graph. Appl Intell 51:3548–3560
Article Google Scholar
Zeng F, Zhao M, Zhang Z, Gao S, Cheng L (2022) Joint clothes detection and attribution prediction via anchor-free framework with decoupled representation transformer. In: Proceedings of the 31st ACM international conference on information & knowledge management, pp 2444–2454
Zhang J, Fang Z, Sun H, Wang Z (2022) Adaptive semantic-enhanced transformer for image captioning. IEEE Trans Neural Netw Learn Syst
Zhang J, Fang Z, Wang Z (2022) Multi-feature fusion enhanced transformer with multi-layer fused decoding for image captioning. Appl Intell pp 1–17
Zhang X, Sun X, Luo Y, Ji J, Zhou Y, Wu Y, Huang F, Ji R (2021) Rstnet: captioning with adaptive attention on visual and non-visual words. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15465–15474
Zhang Z, Shi Y, Yuan C, Li B, Wang P, Hu W, Zha Z-J (2020) Object relational graph with teacher-recommended learning for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13278–13288
Zhou Y, Zhang Y, Hu Z, Wang M (2021) Semi-autoregressive transformer for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3139–3143
Zhou Z, Su Z, Wang R (2022) Attribute-aware heterogeneous graph network for fashion compatibility prediction. Neurocomputing 495:62–74
Article Google Scholar
Zhuge M, Gao D, Fan D-P, Jin L, Chen B, Zhou H, Qiu M, Shao L (2021) Kaleido-bert: vision-language pre-training on fashion domain. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12647–12657
Zohourianshahzadi Z, Kalita JK (2022) Neural attention for image captioning: review of outstanding methods. Artif Intell Rev 55(5):3833–3862
Article Google Scholar

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under grant 62172212 and in part by the Natural Science Foundation of Jiangsu Province under grant BK20230031.

Author information

Authors and Affiliations

College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, 211106, China
Yuhao Tang, Liyan Zhang, Ye Yuan & Zhixian Chen

Authors

Yuhao Tang
View author publications
You can also search for this author in PubMed Google Scholar
Liyan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Ye Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Zhixian Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Liyan Zhang.

Ethics declarations

Conflicts of interest

We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Tang, Y., Zhang, L., Yuan, Y. et al. Improving fashion captioning via attribute-based alignment and multi-level language model. Appl Intell 53, 30803–30821 (2023). https://doi.org/10.1007/s10489-023-05167-2

Download citation

Accepted: 07 November 2023
Published: 25 November 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s10489-023-05167-2

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving fashion captioning via attribute-based alignment and multi-level language model

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Fashion item captioning via grid-relation self-attention and gated-enhanced decoder

Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards

Attention Mechanism for Fashion Image Captioning

Explore related subjects

Data Availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now