Abstract
Composed image retrieval, aimed at enhancing user searches by accurately capturing intent, involves semantically aligning and fusing image and text features. We propose the Semantic-guided Hierarchical Alignment and Fusion network (SHAF), which is specifically designed to combine information from both visual and textual modalities across various network layers. SHAF employs attention mechanisms to progressively align text and image features from low to high levels, effectively bridging the semantic gap between these modalities. The network integrates complementary information from images and text fragments in the query through dynamic weight allocation and feature enhancement mechanisms. This process generates a composite feature within a unified embedding space. Extensive experiments on the FashionIQ and Shoes (+7.15 and +7.58 in R@10) datasets show that SHAF performs better than the state-of-the-art models in composed image retrieval tasks. The code is publicly available on GitHub: https://github.com/Maserhe/SHAF.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Tian, Y., Newsam, S., Boakye, K.: Fashion image retrieval with text feedback by additive attention compositional learning. In: WACV, pp. 1011–1021 (2023)
Zhang, X., Zheng, Z., Wang, X., Yang, Y.: Relieving triplet ambiguity: consensus network for language-guided image retrieval. arXiv preprint arXiv:2306.02092 (2023)
Levy, M., Ben-Ari, R., Darshan, N., Lischinski, D.: Data roaming and early fusion for composed image retrieval. arXiv preprint arXiv:2303.09429 (2023)
Baldrati, A., Bertini, M., Uricchio, T., Del Bimbo, A.: Conditioned and composed image retrieval combining and partially fine-tuning CLIP-based features. In: CVPR, pp. 4955–4964 (2022)
Wu, H., Gao, Y., Guo, X., Al-Halah, Z., Rennie, S.: Fashion IQ: a new dataset towards retrieving images by natural language feedback. In: CVPR, pp. 11302–11312 (2021)
Delmas, G., Rezende, R., Csurka, G., Larlus, D.: ARTEMIS: attention-based retrieval with text-explicit matching and implicit similarity. In: ICLR (2022)
Chia, P., Attanasio, G., Bianchi, F., Terragni, S.: Contrastive language and vision learning of general fashion concepts. Sci. Rep. 12(1), 18958 (2022)
Goenka, S., Zheng, Z., Jaiswal, A., Chada, R.: FashionVLP: vision language transformer for fashion retrieval with feedback. In: CVPR, pp. 14085–14095 (2022)
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Conditional prompt learning for vision-language models. In: CVPR, pp. 16795–16804 (2022)
Kim, J., Yu, Y., Kim, H., Kim, G.: Dual compositional learning in interactive image retrieval. In: AAAI, pp. 1771–1779 (2021)
Wen, H., Song, X., Yang, X., Zhan, Y., Nie, L.: Comprehensive linguistic-visual composition network for image retrieval. In: SIGIR, pp. 1369–1378 (2021)
M. U. Anwaar, E. Labintcev, and M. Kleinsteuber, “Compositional learning of image-text query for image retrieval,” in WACV, 2021, pp. 1139–1148
Chen, Y., Zheng, Z., Ji, W., Qu, L., Chua, T.-S.: Composed image retrieval with text feedback via multi-grained uncertainty regularization. In: ICLR (2024)
Shin, M., Cho, Y., Ko, B., Gu, G.: RTIC: residual learning for text and image composition using graph convolutional network. arXiv preprint arXiv:2104.03015 (2021)
Liu, Z., Rodriguez-Opazo, C., Teney, D., Gould, S.: Image retrieval on real-life images with re-trained vision-and-language models. In: CVPR, pp. 2105–2114 (2021)
Lee, S., Kim, D., Han, B.: Cosmo: content-style modulation for image retrieval with text feedback. In: CVPR, pp. 802–812 (2021)
Radford, A., Hallacy, C., Ramesh, A., Goh, G.: Learning transferable visual models from natural language supervision. In: PMLR, pp. 8748–8763 (2021)
Brown, T., Mann, B., Ryder, N., Subbiah, M.: Language models are few-shot learners. NIPS 33, 1877–1901 (2020)
Chen, Y., Gong, S., Bazzani, L.: Image search with text feedback by visiolinguistic attention learning. In: CVPR, pp. 2998–3008 (2020)
Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. In: ICLR (2015)
Berg, T.L., Berg, A.C., Shih, J.: Automatic attribute discovery and characterization from noisy web data. In: ECCV, pp. 663–676 (2010)
Vo, N., Jiang, L., Sun, C., Murphy, K.: Composing text and image for image retrieval-an empirical odyssey. In: CVPR, pp. 6439–6448 (2019)
Vaswani, A., Guyon, I., Luxburg, U.V., Bengio, S.: Attention is all you need. NIPS 33, 5998–6008 (2017)
Zhu, H., Wei, Y., Zhao, Y., Zhang, C., Huang, S.: AMC: adaptive multi-expert collaborative network for text-guided image retrieval. ACM Trans. Multim. Comput. Commun. Appl. (2023)
Wang, Z., Codella, N., Chen, Y.: CLIP-TD: CLIP targeted distillation for vision-language tasks. arXiv preprint arXiv:2201.05729 (2022)
Appalaraju, S., Jasani, B., Kota, B.U., Xie, Y.: DocFormer: end-to-end transformer for document understanding. In: ICCV, pp. 973–983 (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Yan, C., Yang, E., Tao, R., Wan, Y., Ai, D. (2024). SHAF: Semantic-Guided Hierarchical Alignment and Fusion for Composed Image Retrieval. In: Huang, DS., Zhang, X., Zhang, C. (eds) Advanced Intelligent Computing Technology and Applications. ICIC 2024. Lecture Notes in Computer Science(), vol 14879. Springer, Singapore. https://doi.org/10.1007/978-981-97-5675-9_38
Download citation
DOI: https://doi.org/10.1007/978-981-97-5675-9_38
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-5674-2
Online ISBN: 978-981-97-5675-9
eBook Packages: Computer ScienceComputer Science (R0)