Skip to main content

FreestyleRet: Retrieving Images from Style-Diversified Queries

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Image Retrieval aims to retrieve corresponding images based on a given query. In application scenarios, users intend to express their retrieval intent through various query styles. However, current retrieval tasks predominantly focus on text-query retrieval exploration, leading to limited retrieval query options and potential ambiguity or bias in user intention. In this paper, we propose the Style-Diversified Query-Based Image Retrieval task, which enables retrieval based on various query styles. To facilitate the novel setting, we propose the first Diverse-Style Retrieval dataset, encompassing diverse query styles including text, sketch, low-resolution, and art. We also propose a light-weighted style-diversified retrieval framework. For various query style inputs, we apply the Gram Matrix to extract the query’s textural features and cluster them into a style space with style-specific bases. Then we employ the style-init prompt learning module to enable the visual encoder to comprehend the texture and style information of the query. Experiments demonstrate that our model outperforms existing retrieval models on the style-diversified retrieval task. Moreover, style-diversified queries (sketch+text, art+text, etc.) can be simultaneously retrieved in our model. The auxiliary information from other queries enhances the performance within the respective query, which may hold potential significance for the community.

H. Li and Y. Jia—Equal Contribution.

Code and Dataset available in here.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Aytar, Y., Castrejon, L., Vondrick, C., Pirsiavash, H., Torralba, A.: Cross-modal scene networks. IEEE Trans. Pattern Anal. Mach. Intell. 40(10), 2303–2314 (2017)

    Article  Google Scholar 

  2. Bossett, D., et al.: Emotion-based style transfer on visual art using gram matrices. In: 2021 IEEE MIT Undergraduate Research Technology Conference (URTC), pp. 1–5. IEEE (2021)

    Google Scholar 

  3. Brown, T., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)

    Google Scholar 

  4. Cheng, X., Zhang, N., Yu, J., Wang, Y., Li, G., Zhang, J.: Null-space diffusion sampling for zero-shot point cloud completion

    Google Scholar 

  5. Chowdhury, P.N., Bhunia, A.K., Sain, A., Koley, S., Xiang, T., Song, Y.Z.: Scenetrilogy: on human scene-sketch and its complementarity with photo and text. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10972–10983 (2023)

    Google Scholar 

  6. Chowdhury, P.N., Sain, A., Bhunia, A.K., Xiang, T., Gryaditskaya, Y., Song, Y.Z.: FS-COCO: towards understanding of freehand sketches of common objects in context. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13668, pp. 253–270. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20074-8_15

    Chapter  Google Scholar 

  7. Datta, R., Joshi, D., Li, J., Wang, J.Z.: Image retrieval: ideas, influences, and trends of the new age. ACM Comput. Surv. (CSUR) 40(2), 1–60 (2008)

    Article  Google Scholar 

  8. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)

    Google Scholar 

  9. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  10. Farahani, A., Voghoei, S., Rasheed, K., Arabnia, H.R.: A brief review of domain adaptation. In: Advances in Data Science and Information Engineering: Proceedings from ICDATA 2020 and IKE 2020, pp. 877–894 (2021)

    Google Scholar 

  11. Guo, Y., et al.: AnimateDiff: animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725 (2023)

  12. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, June 2016

    Google Scholar 

  13. Isinkaye, F.O., Folajimi, Y.O., Ojokoh, B.A.: Recommendation systems: principles, methods and evaluation. Egypt. Inform. J. 16(3), 261–273 (2015)

    Article  Google Scholar 

  14. Jia, M., et al.: Visual prompt tuning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13693, pp. 709–727. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19827-4_41

    Chapter  Google Scholar 

  15. Jin, P., et al.: Text-video retrieval with disentangled conceptualization and set-to-set alignment. arXiv preprint arXiv:2305.12218 (2023)

  16. Jin, P., et al.: DiffusionRet: generative text-video retrieval with diffusion model. arXiv preprint arXiv:2303.09867 (2023)

  17. Johnson, J., et al.: Image retrieval using scene graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3668–3678 (2015)

    Google Scholar 

  18. Kafai, M., Eshghi, K., Bhanu, B.: Discrete cosine transform locality-sensitive hashes for face retrieval. IEEE Trans. Multimedia 16(4), 1090–1103 (2014)

    Article  Google Scholar 

  19. Khattak, M.U., Rasheed, H., Maaz, M., Khan, S., Khan, F.S.: Maple: multi-modal prompt learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19113–19122 (2023)

    Google Scholar 

  20. Lester, B., Al-Rfou, R., Constant, N.: The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691 (2021)

  21. Li, H., Huang, J., Jin, P., Song, G., Wu, Q., Chen, J.: Weakly-supervised 3D spatial reasoning for text-based visual question answering. IEEE Trans. Image Process. 32, 3367–3382 (2023)

    Article  Google Scholar 

  22. Li, H., Li, X., Karimi, B., Chen, J., Sun, M.: Joint learning of object graph and relation graph for visual question answering. In: 2022 IEEE International Conference on Multimedia and Expo (ICME), pp. 01–06. IEEE (2022)

    Google Scholar 

  23. Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)

  24. Li, J., Li, D., Xiong, C., Hoi, S.: BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR (2022)

    Google Scholar 

  25. Li, X.L., Liang, P.: Prefix-tuning: optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190 (2021)

  26. Li, X., Yang, J., Ma, J.: Recent developments of content-based image retrieval (CBIR). Neurocomputing 452, 675–689 (2021)

    Article  Google Scholar 

  27. Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., Yang, M.H.: Universal style transfer via feature transforms. Adv. Neural Inf. Process. Syst. 30 (2017)

    Google Scholar 

  28. Liu, X., et al.: P-tuning: prompt tuning can be comparable to fine-tuning across scales and tasks. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 61–68 (2022)

    Google Scholar 

  29. Liu, X., et al.: P-tuning v2: prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602 (2021)

  30. Liu, Y., et al.: Hierarchical prompt learning for multi-task learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10888–10898 (2023)

    Google Scholar 

  31. Meng, X., Huang, J., Li, Z., Wang, C., Teng, S., Grau, A.: DedustGAN: unpaired learning for image dedusting based on retinex with GANs. Expert Syst. Appl. 243, 122844 (2024)

    Article  Google Scholar 

  32. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  33. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  34. Song, J., Song, Y.Z., Xiang, T., Hospedales, T.M.: Fine-grained image retrieval: the text/sketch input dilemma. In: BMVC, vol. 2, p. 7 (2017)

    Google Scholar 

  35. Tao, Y.: Image style transfer based on VGG neural network model. In: 2022 IEEE International Conference on Advances in Electrical Engineering and Computer Applications (AEECA), pp. 1475–1482. IEEE (2022)

    Google Scholar 

  36. Thomee, B., Lew, M.S.: Interactive search in image retrieval: a survey. Int. J. Multimed. Inf. Retr. 1, 71–86 (2012)

    Article  Google Scholar 

  37. Van Der Maaten, L.: Learning a parametric embedding by preserving local structure. In: Artificial Intelligence and Statistics, pp. 384–391. PMLR (2009)

    Google Scholar 

  38. Wang, P., Li, Y., Vasconcelos, N.: Rethinking and improving the robustness of image style transfer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 124–133 (2021)

    Google Scholar 

  39. Wang, Y., Yu, J., Zhang, J.: Zero-shot image restoration using denoising diffusion null-space model. arXiv preprint arXiv:2212.00490 (2022)

  40. Wu, X., et al.: Human preference score V2: a solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341 (2023)

  41. Xu, S., Pang, L., Shen, H., Cheng, X.: Match-prompt: improving multi-task generalization ability for neural text matching via prompt learning. In: Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pp. 2290–2300 (2022)

    Google Scholar 

  42. Yu, J., Wang, Y., Zhao, C., Ghanem, B., Zhang, J.: Freedom: training-free energy-guided conditional diffusion model. arXiv preprint arXiv:2303.09833 (2023)

  43. Zhang, Y., Yang, Q.: A survey on multi-task learning. IEEE Trans. Knowl. Data Eng. 34(12), 5586–5609 (2021)

    Article  Google Scholar 

  44. Zhou, K., Liu, Z., Qiao, Y., Xiang, T., Loy, C.C.: Domain generalization: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 45, 4396–4415 (2022)

    Google Scholar 

  45. Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Conditional prompt learning for vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16816–16825 (2022)

    Google Scholar 

  46. Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. Int. J. Comput. Vision 130(9), 2337–2348 (2022)

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by Natural Science Foundation of China (No. 62202014), and Shenzhen Basic Research Program (No. JCYJ20220813151736001)

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Li Yuan .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 4054 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, H. et al. (2025). FreestyleRet: Retrieving Images from Style-Diversified Queries. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15081. Springer, Cham. https://doi.org/10.1007/978-3-031-73337-6_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73337-6_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73336-9

  • Online ISBN: 978-3-031-73337-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics