Interactive Clothes Image Retrieval via Multi-modal Feature Fusion of Image Representation and Natural Language Feedback

Li, Xianrui; Rong, Yu; Zhao, Mingbo; Fan, Jicong

doi:10.1007/978-981-16-5188-5_41

Xianrui Li¹⁰,
Yu Rong¹⁰,
Mingbo Zhao¹⁰ &
…
Jicong Fan^11,12

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1449))

Included in the following conference series:

International Conference on Neural Computing for Advanced Applications

1988 Accesses
2 Citations

Abstract

Clothes image retrieval is an element task that has attracted research interests during the past decades. While in most case that single retrieval cannot achieve the best retrieval performance, we consider to develop an interactive image retrieval system for fashion outfit search, where we utilize the natural language feedback provided by the user to grasp compound and more specific details for clothes attributes. In detail, our model is divided into two parts: feature fusion part and similarity metric learning part. The fusion module is used for combining the feature vectors of the modified description and the feature vectors of the image part. It is then optimized in an end-to-end method via a matching objective, where we have adopted contractive learning strategy to learn the similarity metric. Extensive simulations have been conducted. The simulation results show that the compared with other complex multi-model proposed in recent years, our work improves the model performance while keeping the model simple in architecture.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Softcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

FRSFN: A semantic fusion network for practical fashion retrieval

Article 10 May 2020

Learning Joint Visual Semantic Matching Embeddings for Language-Guided Retrieval

MMFL-net: multi-scale and multi-granularity feature learning for cross-domain fashion retrieval

Article 22 August 2022

References

Berg, T.L., Berg, A.C., Shih, J.: Automatic attribute discovery and characterization from noisy web data. In: European Conference on Computer Vision. pp. 663–676. Springer (2010)
Google Scholar
Buciluundefined, C., Caruana, R., Niculescu-Mizil, A.: Model compression. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 535–541. KDD 2006, Association for Computing Machinery, New York (2006). https://doi.org/10.1145/1150402.1150464
Chen, Y., Gong, S., Bazzani, L.: Image search with text feedback by visiolinguistic attention learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3001–3011 (2020)
Google Scholar
Cui, Z., Li, Z., Wu, S., Zhang, X.Y., Wang, L.: Dressing as a whole: Outfit compatibility learning based on node-wise graph neural networks. In: The World Wide Web Conference, pp. 307–317 (2019)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding (2019)
Google Scholar
Ge, Y., Zhang, R., Wang, X., Tang, X., Luo, P.: Deepfashion2: a versatile benchmark for detection, pose estimation, segmentation and re-identification of clothing images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5337–5345 (2019)
Google Scholar
Ge, Y., Zhang, R., Wu, L., Wang, X., Tang, X., Luo, P.: A versatile benchmark for detection, pose estimation, segmentation and re-identification of clothing images. In: CVPR (2019)
Google Scholar
Gong, K., Liang, X., Zhang, D., Shen, X., Lin, L.: Look into person: self-supervised structure-sensitive learning and a new benchmark for human parsing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 932–940 (2017)
Google Scholar
Guo, X., Wu, H., Cheng, Y., Rennie, S., Tesauro, G., Feris, R.S.: Dialog-based interactive image retrieval (2018). arXiv:1805.00145
Han, X., Wu, Z., Jiang, Y.G., Davis, L.S.: Learning fashion compatibility with bidirectional lstms. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 1078–1086 (2017)
Google Scholar
Han, X., Wu, Z., Wu, Z., Yu, R., Davis, L.S.: Viton: An image-based virtual try-on network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7543–7552 (2018)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition (2015)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016)
Google Scholar
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network (2015)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Google Scholar
Howard, J., Gugger, S.: Deep Learning for Coders with Fastai and Pytorch: AI Applications Without a PhD. O’Reilly Media, Incorporated (2020). https://books.google.no/books?id=xd6LxgEACAAJ
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization (2017)
Google Scholar
Li, J., Zhao, J., Wei, Y., Lang, C., Li, Y., Sim, T., Yan, S., Feng, J.: Multiple-human parsing in the wild (2017). arXiv:1705.07206
Li, Q., Wu, Z., Zhang, H.: Spatio-temporal modeling with enhanced flexibility and robustness of solar irradiance prediction: a chain-structure echo state network approach. J. Cleaner Prod. 261, 121–151 (2020)
Google Scholar
Li, X., Ye, Z., Zhang, Z., Zhao, M.: Clothes image caption generation with attribute detection and visual attention model. Pattern Recognit. Lett. 141, 68–74 (2021). https://doi.org/10.1016/j.patrec.2020.12.001, https://www.sciencedirect.com/science/article/pii/S0167865520304281
Liu, L., Zhang, H., Xu, X., Zhang, Z., Yan, S.: Collocating clothes with generative adversarial networks cosupervised by categories and attributes: a multidiscriminator framework. IEEE Trans. Neural Netw. Learn. Syst. (2019)
Google Scholar
Liu, Z., Luo, P., Qiu, S., Wang, X., Tang, X.: Deepfashion: powering robust clothes recognition and retrieval with rich annotations. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016
Google Scholar
Liu, Z., Yan, S., Luo, P., Wang, X., Tang, X.: Fashion landmark detection in the wild. In: European Conference on Computer Vision (ECCV), October 2016
Google Scholar
Lu, J., Goswami, V., Rohrbach, M., Parikh, D., Lee, S.: 12-in-1: Multi-task vision and language representation learning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10434–10443 (2020). https://doi.org/10.1109/CVPR42600.2020.01045
Google Scholar
Redmon, J., Farhadi, A.: Yolov3: an incremental improvement (2018). arXiv:1804.02767
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015). 10.1007/s11263-015-0816-y
Google Scholar
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter (2020)
Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Google Scholar
Vaswani, A., et al.: Attention is all you need (2017)
Google Scholar
Vo, N., et al.: Composing text and image for image retrieval - an empirical odyssey (2018)
Google Scholar
Wang, W., Xu, Y., Shen, J., Zhu, S.C.: Attentive fashion grammar network for fashion landmark detection and clothing category classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4271–4280 (2018)
Google Scholar
Wu, H., et al.: Fashion IQ: a new dataset towards retrieving images by natural language feedback (2020)
Google Scholar
Wu, Z., Li, Q., Xia, X.: Multi-timescale forecast of solar irradiance based on multi-task learning and echo state network approaches. IEEE Trans. Ind. Inf. (2020)
Google Scholar
Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks (2013)
Google Scholar
Zeng, W., Zhao, M., Gao, Y., Zhang, Z.: Tilegan: category-oriented attention-based high-quality tiled clothes generation from dressed person. Neural Comput. Appl. (2020)
Google Scholar
Zhang, H., Sun, Y., Liu, L., Wang, X., Li, L., Liu, W.: Clothingout: a category-supervised gan model for clothing segmentation and retrieval. Neural Comput. Appl. 32, 1–12 (2018)
Google Scholar
Zhang, Y., Li, X., Lin, M., Chiu, B., Zhao, M.: Deep-recursive residual network for image semantic segmentation. Neural Comput. Appl. (2020)
Google Scholar
Zhao, M., Liu, J., Zhang, Z., Fan, J.: A scalable sub-graph regularization for efficient content based image retrieval with long-term relevance feedback enhancement. Knowl. Based Syst. 212, 106505 (2020)
Google Scholar
Zhao, M., Liu, Y., Li, x., Zhang, Z., Zhang, Y.: An end-to-end framework for clothing collocation based on semantic feature fusion. IEEE Multimedia 1–10 (2020)
Google Scholar
Zhu, S., Fidler, S., Urtasun, R., Lin, D., Loy, C.C.: Be your own prada: fashion synthesis with structural coherence. In: International Conference on Computer Vision (ICCV), October 2017
Google Scholar
Zhu, Y., et al.: Aligning books and movies: towards story-like visual explanations by watching movies and reading books (2015)
Google Scholar

Download references

Acknowledgement

This work is partially supported by National Key Research and Development Program of China (2019YFC1521300), partially supported by National Natural Science Foundation of China (61971121), partially supported by the Fundamental Research Funds for the Central Universities of China and partially supported by the T00120210002 of Shenzhen Research Institute of Big Data.

Author information

Authors and Affiliations

Donghua University, Shanghai, China
Xianrui Li, Yu Rong & Mingbo Zhao
The Chinese University of Hong Kong, Shenzhen, China
Jicong Fan
Shenzhen Research Institute of Big Data, Shenzhen, China
Jicong Fan

Authors

Xianrui Li
View author publications
You can also search for this author in PubMed Google Scholar
Yu Rong
View author publications
You can also search for this author in PubMed Google Scholar
Mingbo Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Jicong Fan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Mingbo Zhao or Jicong Fan .

Editor information

Editors and Affiliations

Harbin Institute of Technology, Shenzhen, China
Haijun Zhang
Nanfang College of Sun Yat-sen University, Guangzhou, China
Zhi Yang
Hefei University of Technology, Hefei, China
Zhao Zhang
Chongqing University, Chongqing, China
Zhou Wu
South China Normal University, Guangzhou, China
Tianyong Hao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, X., Rong, Y., Zhao, M., Fan, J. (2021). Interactive Clothes Image Retrieval via Multi-modal Feature Fusion of Image Representation and Natural Language Feedback. In: Zhang, H., Yang, Z., Zhang, Z., Wu, Z., Hao, T. (eds) Neural Computing for Advanced Applications. NCAA 2021. Communications in Computer and Information Science, vol 1449. Springer, Singapore. https://doi.org/10.1007/978-981-16-5188-5_41

Download citation

DOI: https://doi.org/10.1007/978-981-16-5188-5_41
Published: 20 August 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-5187-8
Online ISBN: 978-981-16-5188-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics