Dual attention composition network for fashion image retrieval with attribute manipulation

Wan, Yongquan; Zou, Guobing; Yan, Cairong; Zhang, Bofeng

doi:10.1007/s00521-022-07994-9

Dual attention composition network for fashion image retrieval with attribute manipulation

Original Article
Published: 09 November 2022

Volume 35, pages 5889–5902, (2023)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Yongquan Wan ORCID: orcid.org/0000-0002-6911-0852^1,3,
Guobing Zou¹,
Cairong Yan⁴ &
…
Bofeng Zhang^2,5

436 Accesses
Explore all metrics

Abstract

Due to practical demands and substantial potential benefits, there is growing interest in fashion image retrieval with attribute manipulation. For example, if a user wants a product similar to a query image and has the attribute “3/4 sleeves” instead of “short sleeves” he can modify the query image by entering text. Unlike general items, fashion items are rich in categories and attributes, and some items with different attributes have only very subtle differences in vision. Moreover, the visual appearance of fashion items changes dramatically under different conditions, such as lighting, viewing angle, and occlusion. These pose challenges to the fashion retrieval task. Therefore, we consider learning an attribute-specific space for each attribute to obtain discriminative features. In this paper, we propose a dual attention composition network for image retrieval with manipulation, which addresses two important issues, where to focus and how to modify. The dual attention module aims to capture fine-grained image-text alignment through corresponding spatial and channel attention and then satisfy multi-modal composition through corresponding affine transformation. The TIRG-based semantic composition module combines the query image’s attention features and the manipulation text’s embedding features to obtain a synthetic representation close to the target image. Meanwhile, we investigate the semantic hierarchy of attributes and propose a hierarchical encoding method, which can preserve the associations between attributes for efficient feature learning. Extensive experiments conducted on three multi-modal fashion-related retrieval datasets demonstrate the superiority of our network.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FashionSearchNet: Fashion Search with Attribute Manipulation

Learning Joint Visual Semantic Matching Embeddings for Language-Guided Retrieval

From spatial to semantic: attribute-aware fashion similarity learning via iterative positioning and attribute diverging

Article 13 January 2025

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

Artificial Intelligence

Data availability

The data that support the findings of this study are derived from the following public domain resources. 1. FashionIQ can be downloaded from https://github.com/hongwang600/fashion-iq-metadata. 2. Fashion200k can be downloaded from https://github.com/xthan/fashion-200k. 3. Shoes can be downloaded from https://github.com/XiaoxiaoGuo/fashion-retrieval/tree/master/dataset.

References

Liu Z, Luo P, Qiu S, Wang X, Tang X (2016) Deepfashion: powering robust clothes recognition and retrieval with rich annotations. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 1096–1104
Gu X, Wong Y, Shou L, Peng P, ChenG Kankanhalli MS (2018) Multi-modal and multi-domain embedding learning for fashion retrieval and analysis. IEEE Trans Multimed 21(6):1524–1537
Article Google Scholar
D’Innocente A, Garg N, Zhang Y, Bazzani L, Donoser M (2021) Localized triplet loss for fine-grained fashion image retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 3910–3915
Lang Y He Y Yang F, Dong J, Xue H (2020) Which is plagiarism: fashion image retrieval based on regional representation for design protection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). pp 2595–2604
Mansouri N, Ammar S, Kessentini Y (2021) Re-ranking person re-identification using attributes learning. Neural Comput Appl 33(19):12827–12843
Article Google Scholar
Li S, Yu H, Hu R (2020) Attributes-aided part detection and refinement for person re-identification. Pattern Recogn 97:107016
Article Google Scholar
Li X, Yang J, Ma J (2021) Recent developments of content-based image retrieval (CBIR). Neurocomputing 452(10):675–689
Article Google Scholar
Zhang F, Xu M, Xu C (2022) Geometry sensitive cross-modal reasoning for composed query based image retrieval. IEEE Trans Image Process 31:1000–1011
Article Google Scholar
Han X, Wu Z, Huang PX, Zhang X, Zhu M, Li Y, Zhao Y, Davis LS (2017) Automatic spatially-aware fashion concept discovery. In: Proceedings of the IEEE international conference on computer vision (ICCV). pp 1463–1471
Kovashka A, Devi P, Kristen G (2012) Whittlesearch: image search with relative attribute feedback. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 2973–2980
Yu A, Kristen G (2019) Thinking outside the pool: active training image creation for relative attributes. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). pp 708–718
Jifei S, Yi-Zhe S, Tao X, Timothy H, Xiang R (2016) Deep multi-task attribute-driven ranking for fine-grained sketch-based image retrieval. In: Proceedings of the British machine vision conference (BMVC). pp 132–113211
Murrugarra-Llerena N, Kovashka A (2021) Image retrieval with mixed initiative and multimodal feedback. Comput Vis Image Underst 207:103204
Article Google Scholar
Mai L, Jin H, Lin Z, Fang C, Brandt J, Liu F (2017) Spatial-semantic image search by visual feature synthesis. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 4718–4727
Cheng W, Song S, Chen C, Hidayati SC, Liu J (2021) Fashion meets computer vision: a survey. ACM Comput Surv 54(4):1–41
Article Google Scholar
Huang J, Feris RS, Chen Q, Yan S (2015) Cross-domain image retrieval with a dual attribute-aware ranking network. In: Proceedings of the IEEE international conference on computer vision (ICCV). pp 1062–1070
Kuang Z, Gao Y, Li G, Luo P, Chen Y, Lin L, Zhang W (2019) Fashion retrieval via graph reasoning networks on a similarity pyramid. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 3066–3075
Barz B, Denzler J (2019) Hierarchy-based image embeddings for semantic image retrieval. In: 2019 IEEE winter conference on applications of computer vision (WACV). pp 638–647
Zhao J, Peng Y, He X (2020) Attribute hierarchy based multi-task learning for fine-grained image classification. Neurocomputing 395:150–159
Article Google Scholar
Narayana P, Pednekar A, Krishnamoorthy A, Sone K, Basu S (2019) Huse: Hierarchical universal semantic embeddings. arXiv:1911.05978
Vo N, Jiang L, Sun C, Murphy K, Li L-J, Fei-Fei L, Hays J (2019) Composing text and image for image retrieval-an empirical odyssey. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). pp 6439–6448
Ji X, Wang W, Zhang M, Yang Y (2017) Cross-domain image retrieval with attention modeling. In: Proceedings of the 25th ACM international conference on multimedia (MM). pp 1654–1662
Zhang Y, Lu H (2018) Deep cross-modal projection learning for image-text matching. In: Proceedings of the European conference on computer vision (ECCV). pp 686–701
Gao D, Jin L, Chen B, Qiu M, Li P, Wei Y, Hu Y, Wang H (2020) Fashionbert: text and image matching with adaptive loss for cross-modal retrieval. In: Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval (SIGIR). pp 2251–2260
Liao L, He X, Zhao B, Ngo C-W, Chua T-S (2018) Interpretable multimodal retrieval for fashion products. In: Proceedings of the 26th ACM international conference on multimedia (MM). pp 1571–1579
Guo X, Wu H, Cheng Y, Rennie S, Tesauro G, Feris R (2018) Dialog-based interactive image retrieval. In: Proceedings of the conference on advances in neural information processing systems (NIPS). pp 678–688
Liu H, Wang R, Shan S, Chen X (2019) What is a tabby? Interpretable model decisions by learning attribute-based classification criteria. IEEE Trans Pattern Anal Mach Intell 43(5):1791–1807
Article Google Scholar
Xu Y, Bin Y, Wang G, Yang Y (2021) Hierarchical composition learning for composed query image retrieval. In: ACM multimedia Asia. pp 1–7
Zhang F, Xu M, Xu C (2021) Geometry sensitive cross-modal reasoning for composed query based image retrieval. IEEE Trans Image Process 31:1000–1011
Article Google Scholar
Chen Y, Gong S, Bazzani L (2020) Image search with text feedback by visiolinguistic attention learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). pp 3001–3011
Lee S, Kim D, Han B(2021) Cosmo: content-style modulation for image retrieval with text feedback. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). pp 802–812
Wen H, Song X, Yang X, Zhan Y, Nie L(2021) Comprehensive linguistic-visual composition network for image retrieval. In: Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval (SIGIR). pp 1369–1378
Li X, Rong Y, Zhao M, Fan J (2021) Interactive clothes image retrieval via multi-modal feature fusion of image representation and natural language feedback. In: International conference on neural computing for advanced applications. Springer, pp 578–589
Li X, Ye Z, Zhang Z, Zhao M (2021) Clothes image caption generation with attribute detection and visual attention model. Pattern Recogn Lett 141:68–74
Article Google Scholar
Quintino Ferreira B, Costeira JP, Sousa RG, Gui L-Y, Gomes JP (2019) Pose guided attention for multi-label fashion image classification. In: Proceedings of the IEEE/CVF international conference on computer vision workshops (ICCVW). pp 3125–3128
Peng L, Yang Y, Wang Z, Huang Z, Shen HT (2020) Mra-net: Improving vqa via multi-modal relation attention network. IEEE Trans Pattern Anal Mach Intell 44(1):318–329
Article Google Scholar
Wu J, Weng W, Fu J, Liu L, Hu B (2022) Deep semantic hashing with dual attention for cross-modal retrieval. Neural Comput Appl 34(7):5397–5416
Article Google Scholar
Su H, Wang P, Liu L, Li H, Li Z, Zhang Y (2020) Where to look and how to describe: fashion image retrieval with an attentional heterogeneous bilinear network. IEEE Trans Circuits Syst Video Technol 31(8):3254–3265
Article Google Scholar
Zhang Z, Chen P, Shi X, Yang L (2019) Text-guided neural network training for image recognition in natural scenes and medicine. IEEE Trans Pattern Anal Mach Intell 43(5):1733–1745
Article Google Scholar
Ma Z, Dong J, Long Z, Zhang Y, He Y, Xue H, Ji S (2020) Fine-grained fashion similarity learning by attribute-specific embedding network. Proc AAAI Conf Artif Intell (AAAI) 34:11741–11748
Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 770–778
Kuang Z, Zhang X, Yu J, Li Z, Fan J (2021) Deep embedding of concept ontology for hierarchical fashion recognition. Neurocomputing 425:191–206
Article Google Scholar
Yan C, Ding A, Zhang Y, Wang Z (2021) Learning fashion similarity based on hierarchical attribute embedding. In: Proceedings of 2021 IEEE 8th international conference on data science and advanced analytics (DSAA). pp 1–8
Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua T-S (2017) Sca-cnn: spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 5659–5667
Shajini M, Ramanan A (2021) An improved landmark-driven and spatial-channel attentive convolutional neural network for fashion clothes classification. Vis Comput 37(6):1517–1526
Article Google Scholar
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 7132–7141
Schroff F, Kalenichenko D, Philbin J (2015) Facenet: a unified embedding for face recognition and clustering. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 815–823
Wu H, Gao Y, Guo X, Al-Halah Z, Rennie S, Grauman K, Feris R (2021) Fashion iq: a new dataset towards retrieving images by natural language feedback. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). pp 11307–11317
Berg TL, Berg AC, Shih J (2010) Automatic attribute discovery and characterization from noisy web data. In: Proceedings of the European conference on computer vision (ECCV). pp 663–676
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Engineering and Science, Shanghai University, Shanghai, 200444, China
Yongquan Wan & Guobing Zou
School of Computer and Information Engineering, Shanghai Polytechnic University, Shanghai, 201029, China
Bofeng Zhang
School of Information Technology, Shanghai Jian Qiao University, Shanghai, 201306, China
Yongquan Wan
School of Computer Science and Technology, Donghua University, Shanghai, 201620, China
Cairong Yan
School of Computer Science and Technology, Kashi University, Xinjiang, 844006, China
Bofeng Zhang

Authors

Yongquan Wan
View author publications
You can also search for this author inPubMed Google Scholar
Guobing Zou
View author publications
You can also search for this author inPubMed Google Scholar
Cairong Yan
View author publications
You can also search for this author inPubMed Google Scholar
Bofeng Zhang
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Yongquan Wan.

Ethics declarations

Conflict of interest

We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wan, Y., Zou, G., Yan, C. et al. Dual attention composition network for fashion image retrieval with attribute manipulation. Neural Comput & Applic 35, 5889–5902 (2023). https://doi.org/10.1007/s00521-022-07994-9

Download citation

Received: 10 May 2022
Accepted: 21 October 2022
Published: 09 November 2022
Issue Date: March 2023
DOI: https://doi.org/10.1007/s00521-022-07994-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Dual attention composition network for fashion image retrieval with attribute manipulation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

FashionSearchNet: Fashion Search with Attribute Manipulation

Learning Joint Visual Semantic Matching Embeddings for Language-Guided Retrieval

From spatial to semantic: attribute-aware fashion similarity learning via iterative positioning and attribute diverging

Explore related subjects

Data availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now