Skip to main content

SHAF: Semantic-Guided Hierarchical Alignment and Fusion for Composed Image Retrieval

  • Conference paper
  • First Online:
Advanced Intelligent Computing Technology and Applications (ICIC 2024)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14879))

Included in the following conference series:

  • 487 Accesses

Abstract

Composed image retrieval, aimed at enhancing user searches by accurately capturing intent, involves semantically aligning and fusing image and text features. We propose the Semantic-guided Hierarchical Alignment and Fusion network (SHAF), which is specifically designed to combine information from both visual and textual modalities across various network layers. SHAF employs attention mechanisms to progressively align text and image features from low to high levels, effectively bridging the semantic gap between these modalities. The network integrates complementary information from images and text fragments in the query through dynamic weight allocation and feature enhancement mechanisms. This process generates a composite feature within a unified embedding space. Extensive experiments on the FashionIQ and Shoes (+7.15 and +7.58 in R@10) datasets show that SHAF performs better than the state-of-the-art models in composed image retrieval tasks. The code is publicly available on GitHub: https://github.com/Maserhe/SHAF.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Tian, Y., Newsam, S., Boakye, K.: Fashion image retrieval with text feedback by additive attention compositional learning. In: WACV, pp. 1011–1021 (2023)

    Google Scholar 

  2. Zhang, X., Zheng, Z., Wang, X., Yang, Y.: Relieving triplet ambiguity: consensus network for language-guided image retrieval. arXiv preprint arXiv:2306.02092 (2023)

  3. Levy, M., Ben-Ari, R., Darshan, N., Lischinski, D.: Data roaming and early fusion for composed image retrieval. arXiv preprint arXiv:2303.09429 (2023)

  4. Baldrati, A., Bertini, M., Uricchio, T., Del Bimbo, A.: Conditioned and composed image retrieval combining and partially fine-tuning CLIP-based features. In: CVPR, pp. 4955–4964 (2022)

    Google Scholar 

  5. Wu, H., Gao, Y., Guo, X., Al-Halah, Z., Rennie, S.: Fashion IQ: a new dataset towards retrieving images by natural language feedback. In: CVPR, pp. 11302–11312 (2021)

    Google Scholar 

  6. Delmas, G., Rezende, R., Csurka, G., Larlus, D.: ARTEMIS: attention-based retrieval with text-explicit matching and implicit similarity. In: ICLR (2022)

    Google Scholar 

  7. Chia, P., Attanasio, G., Bianchi, F., Terragni, S.: Contrastive language and vision learning of general fashion concepts. Sci. Rep. 12(1), 18958 (2022)

    Article  Google Scholar 

  8. Goenka, S., Zheng, Z., Jaiswal, A., Chada, R.: FashionVLP: vision language transformer for fashion retrieval with feedback. In: CVPR, pp. 14085–14095 (2022)

    Google Scholar 

  9. Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Conditional prompt learning for vision-language models. In: CVPR, pp. 16795–16804 (2022)

    Google Scholar 

  10. Kim, J., Yu, Y., Kim, H., Kim, G.: Dual compositional learning in interactive image retrieval. In: AAAI, pp. 1771–1779 (2021)

    Google Scholar 

  11. Wen, H., Song, X., Yang, X., Zhan, Y., Nie, L.: Comprehensive linguistic-visual composition network for image retrieval. In: SIGIR, pp. 1369–1378 (2021)

    Google Scholar 

  12. M. U. Anwaar, E. Labintcev, and M. Kleinsteuber, “Compositional learning of image-text query for image retrieval,” in WACV, 2021, pp. 1139–1148

    Google Scholar 

  13. Chen, Y., Zheng, Z., Ji, W., Qu, L., Chua, T.-S.: Composed image retrieval with text feedback via multi-grained uncertainty regularization. In: ICLR (2024)

    Google Scholar 

  14. Shin, M., Cho, Y., Ko, B., Gu, G.: RTIC: residual learning for text and image composition using graph convolutional network. arXiv preprint arXiv:2104.03015 (2021)

  15. Liu, Z., Rodriguez-Opazo, C., Teney, D., Gould, S.: Image retrieval on real-life images with re-trained vision-and-language models. In: CVPR, pp. 2105–2114 (2021)

    Google Scholar 

  16. Lee, S., Kim, D., Han, B.: Cosmo: content-style modulation for image retrieval with text feedback. In: CVPR, pp. 802–812 (2021)

    Google Scholar 

  17. Radford, A., Hallacy, C., Ramesh, A., Goh, G.: Learning transferable visual models from natural language supervision. In: PMLR, pp. 8748–8763 (2021)

    Google Scholar 

  18. Brown, T., Mann, B., Ryder, N., Subbiah, M.: Language models are few-shot learners. NIPS 33, 1877–1901 (2020)

    Google Scholar 

  19. Chen, Y., Gong, S., Bazzani, L.: Image search with text feedback by visiolinguistic attention learning. In: CVPR, pp. 2998–3008 (2020)

    Google Scholar 

  20. Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. In: ICLR (2015)

    Google Scholar 

  21. Berg, T.L., Berg, A.C., Shih, J.: Automatic attribute discovery and characterization from noisy web data. In: ECCV, pp. 663–676 (2010)

    Google Scholar 

  22. Vo, N., Jiang, L., Sun, C., Murphy, K.: Composing text and image for image retrieval-an empirical odyssey. In: CVPR, pp. 6439–6448 (2019)

    Google Scholar 

  23. Vaswani, A., Guyon, I., Luxburg, U.V., Bengio, S.: Attention is all you need. NIPS 33, 5998–6008 (2017)

    Google Scholar 

  24. Zhu, H., Wei, Y., Zhao, Y., Zhang, C., Huang, S.: AMC: adaptive multi-expert collaborative network for text-guided image retrieval. ACM Trans. Multim. Comput. Commun. Appl. (2023)

    Google Scholar 

  25. Wang, Z., Codella, N., Chen, Y.: CLIP-TD: CLIP targeted distillation for vision-language tasks. arXiv preprint arXiv:2201.05729 (2022)

  26. Appalaraju, S., Jasani, B., Kota, B.U., Xie, Y.: DocFormer: end-to-end transformer for document understanding. In: ICCV, pp. 973–983 (2021)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Erhe Yang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yan, C., Yang, E., Tao, R., Wan, Y., Ai, D. (2024). SHAF: Semantic-Guided Hierarchical Alignment and Fusion for Composed Image Retrieval. In: Huang, DS., Zhang, X., Zhang, C. (eds) Advanced Intelligent Computing Technology and Applications. ICIC 2024. Lecture Notes in Computer Science(), vol 14879. Springer, Singapore. https://doi.org/10.1007/978-981-97-5675-9_38

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-5675-9_38

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-5674-2

  • Online ISBN: 978-981-97-5675-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics