Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards

Yang, Xuewen; Zhang, Heming; Jin, Di; Liu, Yingru; Wu, Chi-Hao; Tan, Jianchao; Xie, Dongliang; Wang, Jue; Wang, Xin

doi:10.1007/978-3-030-58601-0_1

Xuewen Yang¹²,
Heming Zhang¹³,
Di Jin¹⁴,
Yingru Liu¹²,
Chi-Hao Wu¹³,
Jianchao Tan¹⁵,
Dongliang Xie¹⁶,
Jue Wang¹⁷ &
…
Xin Wang¹²

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12358))

Included in the following conference series:

European Conference on Computer Vision

3775 Accesses
36 Citations

Abstract

Generating accurate descriptions for online fashion items is important not only for enhancing customers’ shopping experiences, but also for the increase of online sales. Besides the need of correctly presenting the attributes of items, the expressions in an enchanting style could better attract customer interests. The goal of this work is to develop a novel learning framework for accurate and expressive fashion captioning. Different from popular work on image captioning, it is hard to identify and describe the rich attributes of fashion items. We seed the description of an item by first identifying its attributes, and introduce attribute-level semantic (ALS) reward and sentence-level semantic (SLS) reward as metrics to improve the quality of text descriptions. We further integrate the training of our model with maximum likelihood estimation (MLE), attribute embedding, and Reinforcement Learning (RL). To facilitate the learning, we build a new FAshion CAptioning Dataset (FACAD), which contains 993K images and 130K corresponding enchanting and diverse descriptions. Experiments on FACAD demonstrate the effectiveness of our model (Code and data: https://github.com/xuewyang/Fashion_Captioning).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Improving fashion captioning via attribute-based alignment and multi-level language model

Article 25 November 2023

Inception Models for Fashion Image Captioning: An Extensive Study on Multiple Datasets

OpenFashionCLIP: Vision-and-Language Contrastive Learning with Open-Source Fashion Data

Notes

1.
https://www.nltk.org/api/nltk.tokenize.html.

References

Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 382–398. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_24
Chapter Google Scholar
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Aneja, J., Deshpande, A., Schwing, A.G.: Convolutional image captioning. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018)
Google Scholar
Chen, X., et al.: Microsoft COCO captions: Data collection and evaluation server (2015)
Google Scholar
Cucurull, G., Taslakian, P., Vazquez, D.: Context-aware visual compatibility prediction. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)
Google Scholar
Denkowski, M., Lavie, A.: Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the 9th Workshop on Statistical Machine Translation (2014)
Google Scholar
Gabale, V., Prabhu Subramanian, A.: How to Extract Fashion Trends from Social Media? A Robust Object Detector With Support For Unsupervised Learning. arXiv e-prints (2018)
Google Scholar
Gao, J., Wang, S., Wang, S., Ma, S., Gao, W.: Self-critical n-step training for image captioning. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Ge, Y., Zhang, R., Wu, L., Wang, X., Tang, X., Luo, P.: A versatile benchmark for detection, pose estimation, segmentation and re-identification of clothing images. In: CVPR (2019)
Google Scholar
Guo, X., Wu, H., Gao, Y., Rennie, S., Feris, R.: The fashion IQ dataset: Retrieving images by combining side information and relative natural language feedback. arXiv preprint arXiv:1905.12794 (2019)
Han, X., Wu, Z., Jiang, Y.G., Davis, L.S.: Learning fashion compatibility with bidirectional LSTMs. In: ACM Multimedia (2017)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.B.: Mask R-CNN. 2017 IEEE International Conference on Computer Vision (ICCV) (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
He, Y., Yang, L., Chen, L.: Real-time fashion-guided clothing semantic parsing: a lightweight multi-scale inception neural network and benchmark. In: AAAI Workshops (2017)
Google Scholar
Herdade, S., Kappeler, A., Boakye, K., Soares, J.: Image captioning: transforming objects into words. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Johnson, J., Karpathy, A., Fei-Fei, L.: DenseCap: fully convolutional localization networks for dense captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
Google Scholar
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 39, 664–676 (2017)
Article Google Scholar
Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014)
Google Scholar
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization (2015)
Google Scholar
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123, 32–73 (2017). https://doi.org/10.1007/s11263-016-0981-7
Article MathSciNet Google Scholar
Lin, C.Y.: ROUGE: A package for automatic evaluation of summaries. In: Text Summarization Branches Out (2004)
Google Scholar
Liu, S., et al.: Hi, magic closet, tell me what to wear! In: Proceedings of the 20th ACM International Conference on Multimedia (2012)
Google Scholar
Liu, Z., Luo, P., Qiu, S., Wang, X., Tang, X.: DeepFashion: powering robust clothes recognition and retrieval with rich annotations. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Lu, Z., Hu, Y., Jiang, Y., Chen, Y., Zeng, B.: Learning binary code for personalized fashion recommendation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)
Google Scholar
Ma, C.Y., Kadav, A., Melvin, I., Kira, Z., Alregib, G., Graf, H.: Attend and interact: Higher-order object interactions for video understanding (2017)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (2002)
Google Scholar
Qin, Y., Du, J., Zhang, Y., Lu, H.: Look back and predict forward in image captioning. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, vol. 28 (2015)
Google Scholar
Ren, Z., Wang, X., Zhang, N., Lv, X., Li, L.J.: Deep reinforcement learning-based image captioning with embedding reward (2017)
Google Scholar
Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Socher, R., Bauer, J., Manning, C.D., Ng, A.Y.: Parsing with compositional vector grammars. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2013)
Google Scholar
Vasileva, M.I., Plummer, B.A., Dusad, K., Rajpal, S., Kumar, R., Forsyth, D.: Learning type-aware embeddings for fashion compatibility. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11220, pp. 405–421. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01270-0_24
Chapter Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Vedantam, R., Zitnick, C.L., Parikh, D.: CIDEr: consensus-based image description evaluation. In: CVPR (2015)
Google Scholar
Wang, W., Xu, Y., Shen, J., Zhu, S.C.: Attentive fashion grammar network for fashion landmark detection and clothing category classification. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018)
Google Scholar
Wang, Z., Gu, Y., Zhang, Y., Zhou, J., Gu, X.: Clothing retrieval with visual attention model. In: 2017 IEEE Visual Communications and Image Processing (VCIP), pp. 1–4 (2017)
Google Scholar
Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8, 229–256 (1992). https://doi.org/10.1007/BF00992696
Article MATH Google Scholar
Xu, K., et al.: Show, attend and tell: Neural image caption generation with visual attention. In: Proceedings of the 32nd International Conference on Machine Learning (2015)
Google Scholar
Yu, W., Zhang, H., He, X., Chen, X., Xiong, L., Qin, Z.: Aesthetic-based clothing recommendation. In: Proceedings of the 2018 World Wide Web Conference (2018)
Google Scholar
Zhang, L., et al.: Actor-critic sequence training for image captioning. In: NIPS workshop (2017)
Google Scholar
Zheng, S., Yang, F., Kiapour, M.H., Piramuthu., R.: ModaNet: a large-scale street fashion dataset with polygon annotations. In: ACM Multimedia (2018)
Google Scholar
Zou, X., Kong, X., Wong, W., Wang, C., Liu, Y., Cao, Y.: FashionAI: a hierarchical dataset for fashion understanding. In: CVPRW (2019)
Google Scholar

Download references

Acknowledgements

This work is supported in part by the National Science Foundation under Grants NSF ECCS 1731238 and NSF CCF 2007313.

Author information

Authors and Affiliations

Stony Brook University, Stony Brook, USA
Xuewen Yang, Yingru Liu & Xin Wang
USC, Los Angeles, USA
Heming Zhang & Chi-Hao Wu
MIT, Cambridge, USA
Di Jin
Kwai Inc., Washington, USA
Jianchao Tan
BUPT, Beijing, China
Dongliang Xie
Megvii, Beijing, China
Jue Wang

Authors

Xuewen Yang
View author publications
You can also search for this author in PubMed Google Scholar
Heming Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Di Jin
View author publications
You can also search for this author in PubMed Google Scholar
Yingru Liu
View author publications
You can also search for this author in PubMed Google Scholar
Chi-Hao Wu
View author publications
You can also search for this author in PubMed Google Scholar
Jianchao Tan
View author publications
You can also search for this author in PubMed Google Scholar
Dongliang Xie
View author publications
You can also search for this author in PubMed Google Scholar
Jue Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xin Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xuewen Yang .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 4984 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yang, X. et al. (2020). Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12358. Springer, Cham. https://doi.org/10.1007/978-3-030-58601-0_1

Download citation

DOI: https://doi.org/10.1007/978-3-030-58601-0_1
Published: 28 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58600-3
Online ISBN: 978-3-030-58601-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics