SHAF: Semantic-Guided Hierarchical Alignment and Fusion for Composed Image Retrieval

Yan, Cairong; Yang, Erhe; Tao, Ran; Wan, Yongquan; Ai, Derun

doi:10.1007/978-981-97-5675-9_38

Cairong Yan¹⁰,
Erhe Yang¹⁰,
Ran Tao¹⁰,
Yongquan Wan¹¹ &
…
Derun Ai¹²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14879))

Included in the following conference series:

International Conference on Intelligent Computing

487 Accesses

Abstract

Composed image retrieval, aimed at enhancing user searches by accurately capturing intent, involves semantically aligning and fusing image and text features. We propose the Semantic-guided Hierarchical Alignment and Fusion network (SHAF), which is specifically designed to combine information from both visual and textual modalities across various network layers. SHAF employs attention mechanisms to progressively align text and image features from low to high levels, effectively bridging the semantic gap between these modalities. The network integrates complementary information from images and text fragments in the query through dynamic weight allocation and feature enhancement mechanisms. This process generates a composite feature within a unified embedding space. Extensive experiments on the FashionIQ and Shoes (+7.15 and +7.58 in R@10) datasets show that SHAF performs better than the state-of-the-art models in composed image retrieval tasks. The code is publicly available on GitHub: https://github.com/Maserhe/SHAF.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Composed image retrieval: a survey on recent research and development

Article 27 February 2025

CAMIR: fine-tuning CLIP and multi-head cross-attention mechanism for multimodal image retrieval with sketch and text features

Article 24 December 2024

Bottom-Up Progressive Semantic Alignment for Image-Text Retrieval

References

Tian, Y., Newsam, S., Boakye, K.: Fashion image retrieval with text feedback by additive attention compositional learning. In: WACV, pp. 1011–1021 (2023)
Google Scholar
Zhang, X., Zheng, Z., Wang, X., Yang, Y.: Relieving triplet ambiguity: consensus network for language-guided image retrieval. arXiv preprint arXiv:2306.02092 (2023)
Levy, M., Ben-Ari, R., Darshan, N., Lischinski, D.: Data roaming and early fusion for composed image retrieval. arXiv preprint arXiv:2303.09429 (2023)
Baldrati, A., Bertini, M., Uricchio, T., Del Bimbo, A.: Conditioned and composed image retrieval combining and partially fine-tuning CLIP-based features. In: CVPR, pp. 4955–4964 (2022)
Google Scholar
Wu, H., Gao, Y., Guo, X., Al-Halah, Z., Rennie, S.: Fashion IQ: a new dataset towards retrieving images by natural language feedback. In: CVPR, pp. 11302–11312 (2021)
Google Scholar
Delmas, G., Rezende, R., Csurka, G., Larlus, D.: ARTEMIS: attention-based retrieval with text-explicit matching and implicit similarity. In: ICLR (2022)
Google Scholar
Chia, P., Attanasio, G., Bianchi, F., Terragni, S.: Contrastive language and vision learning of general fashion concepts. Sci. Rep. 12(1), 18958 (2022)
Article Google Scholar
Goenka, S., Zheng, Z., Jaiswal, A., Chada, R.: FashionVLP: vision language transformer for fashion retrieval with feedback. In: CVPR, pp. 14085–14095 (2022)
Google Scholar
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Conditional prompt learning for vision-language models. In: CVPR, pp. 16795–16804 (2022)
Google Scholar
Kim, J., Yu, Y., Kim, H., Kim, G.: Dual compositional learning in interactive image retrieval. In: AAAI, pp. 1771–1779 (2021)
Google Scholar
Wen, H., Song, X., Yang, X., Zhan, Y., Nie, L.: Comprehensive linguistic-visual composition network for image retrieval. In: SIGIR, pp. 1369–1378 (2021)
Google Scholar
M. U. Anwaar, E. Labintcev, and M. Kleinsteuber, “Compositional learning of image-text query for image retrieval,” in WACV, 2021, pp. 1139–1148
Google Scholar
Chen, Y., Zheng, Z., Ji, W., Qu, L., Chua, T.-S.: Composed image retrieval with text feedback via multi-grained uncertainty regularization. In: ICLR (2024)
Google Scholar
Shin, M., Cho, Y., Ko, B., Gu, G.: RTIC: residual learning for text and image composition using graph convolutional network. arXiv preprint arXiv:2104.03015 (2021)
Liu, Z., Rodriguez-Opazo, C., Teney, D., Gould, S.: Image retrieval on real-life images with re-trained vision-and-language models. In: CVPR, pp. 2105–2114 (2021)
Google Scholar
Lee, S., Kim, D., Han, B.: Cosmo: content-style modulation for image retrieval with text feedback. In: CVPR, pp. 802–812 (2021)
Google Scholar
Radford, A., Hallacy, C., Ramesh, A., Goh, G.: Learning transferable visual models from natural language supervision. In: PMLR, pp. 8748–8763 (2021)
Google Scholar
Brown, T., Mann, B., Ryder, N., Subbiah, M.: Language models are few-shot learners. NIPS 33, 1877–1901 (2020)
Google Scholar
Chen, Y., Gong, S., Bazzani, L.: Image search with text feedback by visiolinguistic attention learning. In: CVPR, pp. 2998–3008 (2020)
Google Scholar
Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. In: ICLR (2015)
Google Scholar
Berg, T.L., Berg, A.C., Shih, J.: Automatic attribute discovery and characterization from noisy web data. In: ECCV, pp. 663–676 (2010)
Google Scholar
Vo, N., Jiang, L., Sun, C., Murphy, K.: Composing text and image for image retrieval-an empirical odyssey. In: CVPR, pp. 6439–6448 (2019)
Google Scholar
Vaswani, A., Guyon, I., Luxburg, U.V., Bengio, S.: Attention is all you need. NIPS 33, 5998–6008 (2017)
Google Scholar
Zhu, H., Wei, Y., Zhao, Y., Zhang, C., Huang, S.: AMC: adaptive multi-expert collaborative network for text-guided image retrieval. ACM Trans. Multim. Comput. Commun. Appl. (2023)
Google Scholar
Wang, Z., Codella, N., Chen, Y.: CLIP-TD: CLIP targeted distillation for vision-language tasks. arXiv preprint arXiv:2201.05729 (2022)
Appalaraju, S., Jasani, B., Kota, B.U., Xie, Y.: DocFormer: end-to-end transformer for document understanding. In: ICCV, pp. 973–983 (2021)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Technology, Donghua University, Shanghai, 200051, China
Cairong Yan, Erhe Yang & Ran Tao
College of Information Technology, Shanghai Jian Qiao University, Shanghai, 201306, China
Yongquan Wan
College of Software, Nankai University, Tianjin, 300071, China
Derun Ai

Authors

Cairong Yan
View author publications
You can also search for this author in PubMed Google Scholar
Erhe Yang
View author publications
You can also search for this author in PubMed Google Scholar
Ran Tao
View author publications
You can also search for this author in PubMed Google Scholar
Yongquan Wan
View author publications
You can also search for this author in PubMed Google Scholar
Derun Ai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Erhe Yang .

Editor information

Editors and Affiliations

Eastern Institute of Technology, Ningbo, China
De-Shuang Huang
Tianjin University of Science and Technology, Tianjin, China
Xiankun Zhang
Tianjin University of Science and Technology, Tianjin, China
Chuanlei Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yan, C., Yang, E., Tao, R., Wan, Y., Ai, D. (2024). SHAF: Semantic-Guided Hierarchical Alignment and Fusion for Composed Image Retrieval. In: Huang, DS., Zhang, X., Zhang, C. (eds) Advanced Intelligent Computing Technology and Applications. ICIC 2024. Lecture Notes in Computer Science(), vol 14879. Springer, Singapore. https://doi.org/10.1007/978-981-97-5675-9_38

Download citation

DOI: https://doi.org/10.1007/978-981-97-5675-9_38
Published: 01 August 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-5674-2
Online ISBN: 978-981-97-5675-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics