Skip to main content
Log in

Unsupervised graph reasoning distillation hashing for multimodal hamming space search with vision-language model

  • Regular Paper
  • Published:
International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Abstract

Multimodal hash technology maps high-dimensional multimodal data into hash codes, which greatly reduces the cost of data storage and improves query speed through the Hamming similarity calculation. However, existing unsupervised methods still have two key obstacles: (1) With the evolution of large multimodal models, how to efficiently distill the multimodal matching relationship of large models to train a powerful student model? (2) Existing methods do not consider other adjacencies between multimodal instances, resulting in limited similarity representation. To address these obstacles, called Unsupervised Graph Reasoning Distillation Hashing (UGRDH) is proposed. The UGRDH approach uses the CLIP as the teacher model, thus extracting fine-grained multimodal features and relations for teacher–student distillation. Specifically, the multimodal features of the teacher are used to construct a similarity–complementary relation graph matrix, and the proposed graph convolution auxiliary network performs feature aggregation guided by the relation graph matrix to generate a more discriminative hash code. In addition, a cross-attention module was designed to reason potential instance relations to enable effective teacher–student distilled learning. Finally, UGRDH greatly improves search precision while maintaining lightness. Experimental results show that our method achieves about 1.5%, 3%, and 2.8% performance improvements on MS COCO, NUS-WIDE, and MIRFlickr, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Algorithm 1
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data Availability

The data that supports the findings of this study will be made available on request.

References

  1. Luo X, Wang H, Wu D, Chen C, Deng M, Huang J, Hua X-S (2023) A survey on deep hashing methods. ACM Trans Knowl Discov Data 17(1):1–50

    Article  Google Scholar 

  2. Zhu L, Zheng C, Guan W, Li J, Yang Y, Shen HT (2023) Multi-modal hashing for efficient multimedia retrieval: a survey. IEEE Trans Knowl Data Eng

  3. Li L, Zheng B, Sun W (2022) Adaptive structural similarity preserving for unsupervised cross modal hashing. In: Proceedings of the 30th ACM international conference on multimedia, pp 3712–3721

  4. Singh A, Gupta S (2022) Learning to hash: a comprehensive survey of deep learning-based hashing methods. Knowl Inf Syst 64(10):2565–2597

    Article  Google Scholar 

  5. Su S, Zhong Z, Zhang C (2019) Deep joint-semantics reconstructing hashing for large-scale unsupervised cross-modal retrieval. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3027–3035

  6. Liu S, Qian S, Guan Y, Zhan J, Ying L (2020) Joint-modal distribution-based similarity hashing for large-scale unsupervised deep cross-modal retrieval. In: Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval, pp 1379–1388

  7. Wang B, Zhang H, Zhu L, Nie L, Liu L (2023) Multi-level adversarial attention cross-modal hashing. Signal Processing: Image Communication, 117017

  8. Zhang P-F, Luo Y, Huang Z, Xu X-S, Song J (2021) High-order nonlocal hashing for unsupervised cross-modal retrieval. World Wide Web 24(2):563–583

    Article  Google Scholar 

  9. Shen X, Zhang H, Li L, Liu L (2021) Attention-guided semantic hashing for unsupervised cross-modal retrieval. In: 2021 IEEE international conference on multimedia and expo (ICME), pp 1–6. IEEE

  10. Mikriukov G, Ravanbakhsh M, Demir B (2022) Unsupervised contrastive hashing for cross-modal retrieval in remote sensing. In: ICASSP 2022–2022 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4463–4467. IEEE

  11. Tan W, Zhu L, Li J, Zhang Z, Zhang H (2023) Partial multi-modal hashing via neighbor-aware completion learning. IEEE Trans Multimedia

  12. Tu R-C, Jiang J, Lin Q, Cai C, Tian S, Wang H, Liu W (2023) Unsupervised cross-modal hashing with modality-interaction. IEEE Trans Circuits Syst Video Technol

  13. Wang Z, Yu J, Yu AW, Dai Z, Tsvetkov Y, Cao Y (2021) Simvlm: simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904

  14. Chen F-L, Zhang D-Z, Han M-L, Chen X-Y, Shi J, Xu S, Xu B (2023) Vlp: a survey on vision-language pre-training. Mach Intell Res 20(1):38–56

    Article  Google Scholar 

  15. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning, pp 8748–8763

  16. Gu X, Lin T-Y, Kuo W, Cui Y (2021) Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921

  17. Guo J, Guan X, Liu Y, Lu Y (2023) Distillation-based hashing transformer for cross-modal vessel image retrieval. IEEE Geosci Remote Sens Lett

  18. Hu H, Xie L, Hong R, Tian Q (2020) Creating something from nothing: unsupervised knowledge distillation for cross-modal hashing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3123–3132

  19. Li M, Wang H (2021) Unsupervised deep cross-modal hashing by knowledge distillation for large-scale cross-modal retrieval. In: Proceedings of the 2021 international conference on multimedia retrieval, pp 183–191

  20. Luo K, Zhang C, Li H, Jia X, Chen C (2023) Adaptive marginalized semantic hashing for unpaired cross-modal retrieval. IEEE Trans Multimedia

  21. Tan W, Zhu L, Guan W, Li J, Cheng Z (2022) Bit-aware semantic transformer hashing for multi-modal retrieval. In: Proceedings of the 45th international ACM SIGIR conference on research and development in information retrieval, pp 982–991

  22. Hou C, Li Z, Tang Z, Xie X, Ma H (2022) Multiple instance relation graph reasoning for cross-modal hash retrieval. Knowl-Based Syst 256:109891

    Article  Google Scholar 

  23. Liu L, Nie F, Wiliem A, Li Z, Zhang T, Lovell BC (2018) Multi-modal joint clustering with application for unsupervised attribute discovery. IEEE Trans Image Process 27(9):4345–4356

    Article  MathSciNet  Google Scholar 

  24. Liu L, Nie F, Zhang T, Wiliem A, Lovell BC (2016) Unsupervised automatic attribute discovery method via multi-graph clustering. In: 2016 23rd International conference on pattern recognition (ICPR), pp 1713–1718. IEEE

  25. Shi Y, Zhao Y, Liu X, Zheng F, Ou W, You X, Peng Q (2022) Deep adaptively-enhanced hashing with discriminative similarity guidance for unsupervised cross-modal retrieval. IEEE Trans Circuits Syst Video Technol

  26. Welling M, Kipf TN (2016) Semi-supervised classification with graph convolutional networks. In: International conference on learning representations (ICLR 2017)

  27. Zhang P-F, Li Y, Huang Z, Xu X-S (2021) Aggregation-based graph convolutional hashing for unsupervised cross-modal retrieval. IEEE Trans Multimedia 24:466–479

    Article  Google Scholar 

  28. Tan W, Zhu L, Li J, Zhang H, Han J (2022) Teacher-student learning: efficient hierarchical message aggregation hashing for cross-modal retrieval. IEEE Trans Multimedia

  29. Wu F, Li S, Gao G, Ji Y, Jing X-Y, Wan Z (2023) Semi-supervised cross-modal hashing via modality-specific and cross-modal graph convolutional networks. Pattern Recognit 136:109211

    Article  Google Scholar 

  30. Zhou X, Shen F, Liu L, Liu W, Nie L, Yang Y, Shen HT (2018) Graph convolutional network hashing. IEEE Trans Cybern 50(4):1460–1472

    Article  Google Scholar 

  31. Lu X, Zhu L, Liu L, Nie L, Zhang H (2021) Graph convolutional multi-modal hashing for flexible multimedia retrieval. In: Proceedings of the 29th ACM international conference on multimedia, pp 1414–1422

  32. Khan S, Naseer M, Hayat M, Zamir SW, Khan FS, Shah M (2022) Transformers in vision: a survey. ACM Comput Surv 54(10s):1–41

    Article  Google Scholar 

  33. Kim W, Son B, Kim I (2021) Vilt: vision-and-language transformer without convolution or region supervision. In: International conference on machine learning, pp 5583–5594

  34. Li J, Selvaraju R, Gotmare A, Joty S, Xiong C, Hoi SCH (2021) Align before fuse: vision and language representation learning with momentum distillation. Adv Neural Inf Process Syst 34:9694–9705

    Google Scholar 

  35. Bao H, Wang W, Dong L, Liu Q, Mohammed OK, Aggarwal K, Som S, Wei F (2021) Vlmo: unified vision-language pre-training with mixture-of-modality-experts. arXiv preprint arXiv:2111.02358

  36. Li X, Yin X, Li C, Zhang P, Hu X, Zhang L, Wang L, Hu H, Dong L, Wei F, et al (2020) Oscar: object-semantics aligned pre-training for vision-language tasks. In: European conference on computer vision, pp 121–137. Springer

  37. Gou J, Yu B, Maybank SJ, Tao D (2021) Knowledge distillation: a survey. Int J Comput Vis 129:1789–1819

    Article  Google Scholar 

  38. Tung F, Mori G (2019) Similarity-preserving knowledge distillation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1365–1374

  39. Zhang X, Wang X, Cheng P (2023) Unsupervised hashing retrieval via efficient correlation distillation. IEEE Trans Circuits Syst Video Technol

  40. Ma Y, Xu G, Sun X, Yan M, Zhang J, Ji R (2022) X-clip: end-to-end multi-grained contrastive learning for video-text retrieval. In: Proceedings of the 30th ACM international conference on multimedia, pp 638–647

  41. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

  42. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  43. Wang D, Wang Q, He L, Gao X, Tian Y (2020) Joint and individual matrix factorization hashing for large-scale cross-modal retrieval. Pattern Recognit 107:107479

    Article  Google Scholar 

  44. Ding G, Guo Y, Zhou J, Gao Y (2016) Large-scale cross-modality search via collective matrix factorization hashing. IEEE Trans Image Process 25(11):5427–5440

    Article  MathSciNet  Google Scholar 

  45. Yang D, Wu D, Zhang W, Zhang H, Li B, Wang W (2020) Deep semantic-alignment hashing for unsupervised cross-modal retrieval. In: Proceedings of the 2020 international conference on multimedia retrieval, pp 44–52

  46. Yu J, Zhou H, Zhan Y, Tao D (2021) Deep graph-neighbor coherence preserving network for unsupervised cross-modal hashing. In: Proceedings of the AAAI conference on artificial intelligence, vol 35, pp 4626–4634

  47. Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision, pp 740–755. Springer

  48. Chua T-S, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) Nus-wide: a real-world web image database from National University of Singapore. In: Proceedings of the ACM international conference on image and video retrieval, pp 1–9

  49. Huiskes MJ, Lew MS (2008) The mir flickr retrieval evaluation. In: Proceedings of the 1st ACM international conference on multimedia information retrieval, pp 39–43

  50. Wang W, Shen Y, Zhang H, Yao Y, Liu L (2021) Set and rebase: determining the semantic graph connectivity for unsupervised cross-modal hashing. In: Proceedings of the twenty-ninth international conference on international joint conferences on artificial intelligence, pp 853–859

  51. Zhang J, Peng Y, Yuan M (2018) Unsupervised generative adversarial cross-modal hashing. In: Proceedings of the AAAI conference on artificial intelligence, vol 32

  52. Wu G, Lin Z, Han J, Liu L, Ding G, Zhang B, Shen J (2018) Unsupervised deep hashing via binary latent factor models for large-scale cross-modal retrieval. In: IJCAI, vol 1, p 5

Download references

Acknowledgements

This work was supported by the Open Fund of Advanced Cryptography and System Security Key Laboratory of Sichuan Province (Grant No. SKLACSS–202208), National Natural Science Foundation of China (No.61772295), Postgraduate Scientific Research Innovation Project of Chongqing Normal University(YKC23025) and the Science and Technology Research Program of Chongqing Municipal Education Commission (Grant no.KJZD-M202000501).

Author information

Authors and Affiliations

Authors

Contributions

LS writing original draft preparation, YD writing review and editing, supervision, funding acquisition, All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Yumin Dong.

Ethics declarations

Conflict of interest

The authors declare that the publication of this paper has no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sun, L., Dong, Y. Unsupervised graph reasoning distillation hashing for multimodal hamming space search with vision-language model. Int J Multimed Info Retr 13, 16 (2024). https://doi.org/10.1007/s13735-024-00326-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13735-024-00326-8

Keywords

Navigation