Abstract
Due to practical demands and substantial potential benefits, there is growing interest in fashion image retrieval with attribute manipulation. For example, if a user wants a product similar to a query image and has the attribute “3/4 sleeves” instead of “short sleeves” he can modify the query image by entering text. Unlike general items, fashion items are rich in categories and attributes, and some items with different attributes have only very subtle differences in vision. Moreover, the visual appearance of fashion items changes dramatically under different conditions, such as lighting, viewing angle, and occlusion. These pose challenges to the fashion retrieval task. Therefore, we consider learning an attribute-specific space for each attribute to obtain discriminative features. In this paper, we propose a dual attention composition network for image retrieval with manipulation, which addresses two important issues, where to focus and how to modify. The dual attention module aims to capture fine-grained image-text alignment through corresponding spatial and channel attention and then satisfy multi-modal composition through corresponding affine transformation. The TIRG-based semantic composition module combines the query image’s attention features and the manipulation text’s embedding features to obtain a synthetic representation close to the target image. Meanwhile, we investigate the semantic hierarchy of attributes and propose a hierarchical encoding method, which can preserve the associations between attributes for efficient feature learning. Extensive experiments conducted on three multi-modal fashion-related retrieval datasets demonstrate the superiority of our network.





Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.Data availability
The data that support the findings of this study are derived from the following public domain resources. 1. FashionIQ can be downloaded from https://github.com/hongwang600/fashion-iq-metadata. 2. Fashion200k can be downloaded from https://github.com/xthan/fashion-200k. 3. Shoes can be downloaded from https://github.com/XiaoxiaoGuo/fashion-retrieval/tree/master/dataset.
References
Liu Z, Luo P, Qiu S, Wang X, Tang X (2016) Deepfashion: powering robust clothes recognition and retrieval with rich annotations. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 1096–1104
Gu X, Wong Y, Shou L, Peng P, ChenG Kankanhalli MS (2018) Multi-modal and multi-domain embedding learning for fashion retrieval and analysis. IEEE Trans Multimed 21(6):1524–1537
D’Innocente A, Garg N, Zhang Y, Bazzani L, Donoser M (2021) Localized triplet loss for fine-grained fashion image retrieval. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 3910–3915
Lang Y He Y Yang F, Dong J, Xue H (2020) Which is plagiarism: fashion image retrieval based on regional representation for design protection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). pp 2595–2604
Mansouri N, Ammar S, Kessentini Y (2021) Re-ranking person re-identification using attributes learning. Neural Comput Appl 33(19):12827–12843
Li S, Yu H, Hu R (2020) Attributes-aided part detection and refinement for person re-identification. Pattern Recogn 97:107016
Li X, Yang J, Ma J (2021) Recent developments of content-based image retrieval (CBIR). Neurocomputing 452(10):675–689
Zhang F, Xu M, Xu C (2022) Geometry sensitive cross-modal reasoning for composed query based image retrieval. IEEE Trans Image Process 31:1000–1011
Han X, Wu Z, Huang PX, Zhang X, Zhu M, Li Y, Zhao Y, Davis LS (2017) Automatic spatially-aware fashion concept discovery. In: Proceedings of the IEEE international conference on computer vision (ICCV). pp 1463–1471
Kovashka A, Devi P, Kristen G (2012) Whittlesearch: image search with relative attribute feedback. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 2973–2980
Yu A, Kristen G (2019) Thinking outside the pool: active training image creation for relative attributes. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). pp 708–718
Jifei S, Yi-Zhe S, Tao X, Timothy H, Xiang R (2016) Deep multi-task attribute-driven ranking for fine-grained sketch-based image retrieval. In: Proceedings of the British machine vision conference (BMVC). pp 132–113211
Murrugarra-Llerena N, Kovashka A (2021) Image retrieval with mixed initiative and multimodal feedback. Comput Vis Image Underst 207:103204
Mai L, Jin H, Lin Z, Fang C, Brandt J, Liu F (2017) Spatial-semantic image search by visual feature synthesis. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 4718–4727
Cheng W, Song S, Chen C, Hidayati SC, Liu J (2021) Fashion meets computer vision: a survey. ACM Comput Surv 54(4):1–41
Huang J, Feris RS, Chen Q, Yan S (2015) Cross-domain image retrieval with a dual attribute-aware ranking network. In: Proceedings of the IEEE international conference on computer vision (ICCV). pp 1062–1070
Kuang Z, Gao Y, Li G, Luo P, Chen Y, Lin L, Zhang W (2019) Fashion retrieval via graph reasoning networks on a similarity pyramid. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 3066–3075
Barz B, Denzler J (2019) Hierarchy-based image embeddings for semantic image retrieval. In: 2019 IEEE winter conference on applications of computer vision (WACV). pp 638–647
Zhao J, Peng Y, He X (2020) Attribute hierarchy based multi-task learning for fine-grained image classification. Neurocomputing 395:150–159
Narayana P, Pednekar A, Krishnamoorthy A, Sone K, Basu S (2019) Huse: Hierarchical universal semantic embeddings. arXiv:1911.05978
Vo N, Jiang L, Sun C, Murphy K, Li L-J, Fei-Fei L, Hays J (2019) Composing text and image for image retrieval-an empirical odyssey. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). pp 6439–6448
Ji X, Wang W, Zhang M, Yang Y (2017) Cross-domain image retrieval with attention modeling. In: Proceedings of the 25th ACM international conference on multimedia (MM). pp 1654–1662
Zhang Y, Lu H (2018) Deep cross-modal projection learning for image-text matching. In: Proceedings of the European conference on computer vision (ECCV). pp 686–701
Gao D, Jin L, Chen B, Qiu M, Li P, Wei Y, Hu Y, Wang H (2020) Fashionbert: text and image matching with adaptive loss for cross-modal retrieval. In: Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval (SIGIR). pp 2251–2260
Liao L, He X, Zhao B, Ngo C-W, Chua T-S (2018) Interpretable multimodal retrieval for fashion products. In: Proceedings of the 26th ACM international conference on multimedia (MM). pp 1571–1579
Guo X, Wu H, Cheng Y, Rennie S, Tesauro G, Feris R (2018) Dialog-based interactive image retrieval. In: Proceedings of the conference on advances in neural information processing systems (NIPS). pp 678–688
Liu H, Wang R, Shan S, Chen X (2019) What is a tabby? Interpretable model decisions by learning attribute-based classification criteria. IEEE Trans Pattern Anal Mach Intell 43(5):1791–1807
Xu Y, Bin Y, Wang G, Yang Y (2021) Hierarchical composition learning for composed query image retrieval. In: ACM multimedia Asia. pp 1–7
Zhang F, Xu M, Xu C (2021) Geometry sensitive cross-modal reasoning for composed query based image retrieval. IEEE Trans Image Process 31:1000–1011
Chen Y, Gong S, Bazzani L (2020) Image search with text feedback by visiolinguistic attention learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). pp 3001–3011
Lee S, Kim D, Han B(2021) Cosmo: content-style modulation for image retrieval with text feedback. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). pp 802–812
Wen H, Song X, Yang X, Zhan Y, Nie L(2021) Comprehensive linguistic-visual composition network for image retrieval. In: Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval (SIGIR). pp 1369–1378
Li X, Rong Y, Zhao M, Fan J (2021) Interactive clothes image retrieval via multi-modal feature fusion of image representation and natural language feedback. In: International conference on neural computing for advanced applications. Springer, pp 578–589
Li X, Ye Z, Zhang Z, Zhao M (2021) Clothes image caption generation with attribute detection and visual attention model. Pattern Recogn Lett 141:68–74
Quintino Ferreira B, Costeira JP, Sousa RG, Gui L-Y, Gomes JP (2019) Pose guided attention for multi-label fashion image classification. In: Proceedings of the IEEE/CVF international conference on computer vision workshops (ICCVW). pp 3125–3128
Peng L, Yang Y, Wang Z, Huang Z, Shen HT (2020) Mra-net: Improving vqa via multi-modal relation attention network. IEEE Trans Pattern Anal Mach Intell 44(1):318–329
Wu J, Weng W, Fu J, Liu L, Hu B (2022) Deep semantic hashing with dual attention for cross-modal retrieval. Neural Comput Appl 34(7):5397–5416
Su H, Wang P, Liu L, Li H, Li Z, Zhang Y (2020) Where to look and how to describe: fashion image retrieval with an attentional heterogeneous bilinear network. IEEE Trans Circuits Syst Video Technol 31(8):3254–3265
Zhang Z, Chen P, Shi X, Yang L (2019) Text-guided neural network training for image recognition in natural scenes and medicine. IEEE Trans Pattern Anal Mach Intell 43(5):1733–1745
Ma Z, Dong J, Long Z, Zhang Y, He Y, Xue H, Ji S (2020) Fine-grained fashion similarity learning by attribute-specific embedding network. Proc AAAI Conf Artif Intell (AAAI) 34:11741–11748
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 770–778
Kuang Z, Zhang X, Yu J, Li Z, Fan J (2021) Deep embedding of concept ontology for hierarchical fashion recognition. Neurocomputing 425:191–206
Yan C, Ding A, Zhang Y, Wang Z (2021) Learning fashion similarity based on hierarchical attribute embedding. In: Proceedings of 2021 IEEE 8th international conference on data science and advanced analytics (DSAA). pp 1–8
Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua T-S (2017) Sca-cnn: spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 5659–5667
Shajini M, Ramanan A (2021) An improved landmark-driven and spatial-channel attentive convolutional neural network for fashion clothes classification. Vis Comput 37(6):1517–1526
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp 7132–7141
Schroff F, Kalenichenko D, Philbin J (2015) Facenet: a unified embedding for face recognition and clustering. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 815–823
Wu H, Gao Y, Guo X, Al-Halah Z, Rennie S, Grauman K, Feris R (2021) Fashion iq: a new dataset towards retrieving images by natural language feedback. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). pp 11307–11317
Berg TL, Berg AC, Shih J (2010) Automatic attribute discovery and characterization from noisy web data. In: Proceedings of the European conference on computer vision (ECCV). pp 663–676
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wan, Y., Zou, G., Yan, C. et al. Dual attention composition network for fashion image retrieval with attribute manipulation. Neural Comput & Applic 35, 5889–5902 (2023). https://doi.org/10.1007/s00521-022-07994-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-022-07994-9