Abstract
Fashion captioning aims to generate detailed and captivating descriptions based on a group of item images. It requires the model to precisely describe attribute details under the supervision of complex sentences. Existing image captioning methods typically focus on describing a single image and often struggle to capture fine-grained visual representations in the fashion domain. Furthermore, the presence of complex description noise and unbalanced word distribution in fashion datasets limits diverse sentence generation. To alleviate redundancy in raw images, we propose an Attribute-based Alignment Module (AAM). The AAM captures more content-related information to enhance visual representations. Based on this design, we demonstrate that fashion captioning can benefit greatly from grid features with detailed alignment, in contrast to previous success with dense features. To address the inherent word distribution imbalance, we introduce a more balanced corpus called Fashion-Style-27k, collected from various shopping websites. Additionally, we present a pre-trained Fashion Language Model (FLM) that integrates sentence-level and attribute-level language knowledge into the caption model. Experiments on the FACAD and Fashion-Gen datasets show the proposed AAM-FLM outperforms existing methods. Descriptions in the two datasets are from considerably different lengths and styles, ranging from the 21-word detailed description to the 30-word template-based sentence, demonstrating the generalization ability of the proposed model.









Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
All datasets are publicly available (see references). Correspondence and requests for materials should be addressed to Yuhao Tang.
References
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086
Bao C, Zhang X, Chen J, Miao Y (2022) Mmfl-net: multi-scale and multi-granularity feature learning for cross-domain fashion retrieval. Multimed Tools Appl 1–33
Cheng W-H, Song S, Chen C-Y, Hidayati SC, Liu J (2021) Fashion meets computer vision: a survey. ACM Comput Surv (CSUR) 54(4):1–41
Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10578–10587
Denkowski M, Lavie A (2014) Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the ninth workshop on statistical machine translation, pages 376–380
Ding Y, Ma Y, Liao L, Wong WK, Chua T-S (2021) Leveraging multiple relations for fashion trend forecasting based on social media. IEEE Trans Multimed 24:2287–2299
Gu X, Gao F, Tan M, Peng P (2020) Fashion analysis and understanding with artificial intelligence. Inf Process Manag 5(5):102276
Huang L, Wang W, Chen J, Wei X-Y (2019) Attention on attention for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4634–4643
Jain A, Samala PR, Jyothi P, Mittal D, Singh MK (2021) Perturb, predict & paraphrase: Semi-supervised learning using noisy student for image captioning. In: IJCAI, pp 758–764
Jiang S, Li J, Fu Y (2021) Deep learning for fashion style generation. IEEE Trans Neural Networks and Learn Syst 33(9):4538–4550
Kang Y, Yu B, Xu Z (2023) A novel approach to multi-attribute predictive analysis based on rough fuzzy sets. Appl Intell 1–18
Kaur N, Pandey S (2023) Predicting clothing attributes with cnn and surf based classification model. Multimed Tools Appl 82(7):10681–10701
Li X, Ye Z, Zhang Z, Zhao M (2021) Clothes image caption generation with attribute detection and visual attention model. Pattern Recognit Lett 141:68–74
Liu A-A, Zhai Y, Xu N, Nie W, Li W, Zhang Y (2021) Region-aware image captioning via interaction learning. IEEE Trans Circ Syst Video Technol 32(6):3685–3696
Liu P, Yuan W, Fu J, Jiang Z, Hayashi H, Neubig G (2021) Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput Surv 55(9):1–35
Liu Z, Luo P, Qiu S, Wang X, Tang X (2016) Deepfashion: powering robust clothes recognition and retrieval with rich annotations. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1096–1104
Ma Y, Ji J, Sun X, Zhou Y, Ji R (2023) Towards local visual modeling for image captioning. Pattern Recognit 138:109420
Min B, Ross H, Sulem E, Veyseh APB, Nguyen TH, Sainz O, Agirre E, Heintz I, Roth D (2021) Recent advances in natural language processing via large pre-trained language models: a survey. ACM Comput Surv
Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311–318
Prudviraj J, Vishnu C, Mohan CK (2022) M-ffn: multi-scale feature fusion network for image captioning. Appl Intell 52(13):14711–14723
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7008–7024
Rostamzadeh N, Hosseini S, Boquet T, Stokowiec W, Zhang Y, Jauvin C, Pal C (2018) Fashion-gen: the generative fashion dataset and challenge. arXiv:1806.08317
Shajini M, Ramanan A (2022) A knowledge-sharing semi-supervised approach for fashion clothes classification and attribute prediction. Vis Comput 38(11):3551–3561
Sharma D, Dhiman C, Kumar D (2023) Evolution of visual data captioning methods, datasets, and evaluation metrics: a comprehensive survey. Expert Syst Appl 119773
Stefanini M, Cornia M, Baraldi L, Cascianelli S, Fiameni G, Cucchiara R (2022) From show to tell: a survey on deep learning-based image captioning. IEEE Trans Pattern Anal Mach Intell 45(1):539–559
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
Wang C, Gu X (2022) Dynamic-balanced double-attention fusion for image captioning. Eng Appl Artif Intell 114:105194
Wang C, Gu X (2022) Image captioning with adaptive incremental global context attention. Appl Intell 1–23
Wang C, Shen Y, Ji L (2022) Geometry attention transformer with position-aware lstms for image captioning. Expert Syst Appl 201:117174
Wu D, Li Z, Zhou J, Gan J, Gao W, Li H (2022) Clothing attribute recognition via a holistic relation network. Int J Intell Syst 37(9):6201–6220
Wu H, Gao Y, Guo X, Al-Halah Z, Rennie S, Grauman K, Feris R (2021) Fashion iq: A new dataset towards retrieving images by natural language feedback. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11307–11317
Xian T, Li Z, Zhang C, Ma H (2022) Dual global enhanced transformer for image captioning. Neural Netw 148:129–141
Xu P, Zhu X, Clifton DA (2023) Multimodal learning with transformers: a survey. IEEE Trans Pattern Anal Mach Intell
Yang X, Zhang H, Jin D, Liu Y, Wu C-H, Tan J, Xie D, Wang J, Wang X (2020) Fashion captioning: towards generating accurate descriptions with semantic rewards. In: European conference on computer vision, Springer, pp 1–17
Yuan Z, Mou L, Wang Q, Zhu XX (2022) From easy to hard: Learning language-guided curriculum for visual question answering on remote sensing data. IEEE Trans Geosci Remote Sens 60:1–11
Yue X, Zhang C, Fujita H, Lv Y (2021) Clothing fashion style recognition with design issue graph. Appl Intell 51:3548–3560
Zeng F, Zhao M, Zhang Z, Gao S, Cheng L (2022) Joint clothes detection and attribution prediction via anchor-free framework with decoupled representation transformer. In: Proceedings of the 31st ACM international conference on information & knowledge management, pp 2444–2454
Zhang J, Fang Z, Sun H, Wang Z (2022) Adaptive semantic-enhanced transformer for image captioning. IEEE Trans Neural Netw Learn Syst
Zhang J, Fang Z, Wang Z (2022) Multi-feature fusion enhanced transformer with multi-layer fused decoding for image captioning. Appl Intell pp 1–17
Zhang X, Sun X, Luo Y, Ji J, Zhou Y, Wu Y, Huang F, Ji R (2021) Rstnet: captioning with adaptive attention on visual and non-visual words. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15465–15474
Zhang Z, Shi Y, Yuan C, Li B, Wang P, Hu W, Zha Z-J (2020) Object relational graph with teacher-recommended learning for video captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13278–13288
Zhou Y, Zhang Y, Hu Z, Wang M (2021) Semi-autoregressive transformer for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3139–3143
Zhou Z, Su Z, Wang R (2022) Attribute-aware heterogeneous graph network for fashion compatibility prediction. Neurocomputing 495:62–74
Zhuge M, Gao D, Fan D-P, Jin L, Chen B, Zhou H, Qiu M, Shao L (2021) Kaleido-bert: vision-language pre-training on fashion domain. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12647–12657
Zohourianshahzadi Z, Kalita JK (2022) Neural attention for image captioning: review of outstanding methods. Artif Intell Rev 55(5):3833–3862
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China under grant 62172212 and in part by the Natural Science Foundation of Jiangsu Province under grant BK20230031.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Tang, Y., Zhang, L., Yuan, Y. et al. Improving fashion captioning via attribute-based alignment and multi-level language model. Appl Intell 53, 30803–30821 (2023). https://doi.org/10.1007/s10489-023-05167-2
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-023-05167-2