Skip to main content

Multi-head Hashing with Orthogonal Decomposition for Cross-modal Retrieval

  • Conference paper
  • First Online:
MultiMedia Modeling (MMM 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14555))

Included in the following conference series:

  • 900 Accesses

Abstract

Recently, cross-modal hashing has become a promising line of research in cross-modal retrieval. It not only takes advantage of complementary multiple heterogeneous data modalities for improved retrieval accuracy, but also enjoys reduced memory footprint and fast query speed due to efficient binary feature embedding. With the boom of deep learning, convolutional neural network (CNN) has become the de facto method for advanced cross-model hashing algorithm. Recent research demonstrates that dominant role of CNN is challenged by increasingly effective Transformer architectures due to their advantages of long-range modeling by relaxing local inductive bias. However, the absence of inductive bias shatters the inherent geometric structure, which inevitably leads to compromised neighborhood correlation. To alleviate this problem, in this paper, we propose a novel cross-modal hashing method termed Multi-head Hashing with Orthogonal Decomposition (MHOD) for cross-modal retrieval. More specifically, with the multi-modal Transformers used as the backbones, MHOD leverages orthogonal decomposition for decoupling local cues and global features, and further captures their intrinsic correlations through our designed multi-head hash layer. In this way, the global and local representations are simultaneously embedded into the resulting binary code, leading to a comprehensive and robust representation. Extensive experiments on popular cross-modal retrieval benchmarking datasets demonstrate the proposed MHOD method achieves advantageous performance against the other state-of-the-art cross-modal hashing approaches.

Supported by the National Natural Science Foundation of China under Grant 62173186, 62076134, 62303230 and Jiangsu provincial colleges of Natural Science General Program under Grant 22KJB510004.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bai, C., Zeng, C., Ma, Q., Zhang, J., Chen, S.: Deep adversarial discrete hashing for cross-modal retrieval. In: International Conference on Multimedia Retrieval, pp. 525–531 (2020)

    Google Scholar 

  2. Cao, Y., Liu, B., Long, M., Wang, J.: Cross-modal hamming hashing. In: European Conference on Computer Vision, pp. 207–223 (2018)

    Google Scholar 

  3. Chua, T.S., Tang, J., Hong, R., Li, H., Luo, Z., Zheng, Y.: NUS-WIDE: a real-world web image database from national university of Singapore. In: ACM International Conference on Image and Video Retrieval, pp. 368–375 (2009)

    Google Scholar 

  4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics, pp. 4171–4186 (2019)

    Google Scholar 

  5. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations, pp. 1–22 (2021)

    Google Scholar 

  6. Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: VSE++: improved visual-semantic embeddings with hard negatives. In: British Machine Vision Conference, pp. 1–14 (2018)

    Google Scholar 

  7. Hong, J., Liu, H.: Deep cross-modal hashing retrieval based on semantics preserving and vision transformer. In: International Conference on Electronic Information Technology and Computer Engineering, pp. 52–57 (2022)

    Google Scholar 

  8. Huo, Y., et al.: Deep semantic-aware proxy hashing for multi-label cross-modal retrieval. IEEE Trans. Circuits Syst. Video Technol. (2023)

    Google Scholar 

  9. Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning, pp. 4904–4916 (2021)

    Google Scholar 

  10. Jiang, Q.Y., Li, W.J.: Deep cross-modal hashing. IEEE Conference on Computer Vision and Pattern Recognition, pp. 3270–3278 (2016)

    Google Scholar 

  11. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: International Conference for Learning Representations, pp. 1–15 (2015)

    Google Scholar 

  12. Liang, M., et al.: Semantic structure enhanced contrastive adversarial hash network for cross-media representation learning. In: ACM International Conference on Multimedia, pp. 277–285 (2022)

    Google Scholar 

  13. Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: European Conference on Computer Vision, pp. 740–755 (2014)

    Google Scholar 

  14. Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: petraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Neural Information Processing Systems, pp. 13–23 (2019)

    Google Scholar 

  15. Ma, L., Li, H., Meng, F., Wu, Q., Ngi Ngan, K.: Global and local semantics-preserving based deep hashing for cross-modal retrieval. Neurocomputing 312, 49–62 (2018)

    Article  Google Scholar 

  16. Ma, X., Zhang, T., Xu, C.: Multi-level correlation adversarial hashing for cross-modal retrieval. IEEE Trans. Multimedia 22, 3101–3114 (2020)

    Article  Google Scholar 

  17. Mikriukov, G., Ravanbakhsh, M., Demir, B.: Deep unsupervised contrastive hashing for large-scale cross-modal text-image retrieval in remote sensing. arXiv preprint arXiv:2201.08125 (2022)

  18. Nie, X., Wang, B., Li, J., Hao, F., Jian, M., Yin, Y.: Deep multiscale fusion hashing for cross-modal retrieval. IEEE Trans. Circuits Syst. Video Technol. 31, 401–410 (2021)

    Article  Google Scholar 

  19. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021)

    Google Scholar 

  20. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)

    Google Scholar 

  21. Shi, Y., et al.: Deep adaptively-enhanced hashing with discriminative similarity guidance for unsupervised cross-modal retrieval. IEEE Trans. Circuits Syst. Video Technol. 32, 7255–7268 (2022)

    Google Scholar 

  22. Singh, A., et al.: FLAVA: a foundational language and vision alignment model. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 15617–15629 (2022)

    Google Scholar 

  23. Su, S., Zhong, Z., Zhang, C.: Deep joint-semantics reconstructing hashing for large-scale unsupervised cross-modal retrieval. In: IEEE International Conference on Computer Vision, pp. 3027–3035 (2019)

    Google Scholar 

  24. Tu, J., Liu, X., Lin, Z., Hong, R., Wang, M.: Differentiable cross-modal hashing via multimodal transformers. In: ACM International Conference on Multimedia, pp. 453–461 (2022)

    Google Scholar 

  25. Vaswani, A., et al.: Attention is all you need. In: Neural Information Processing Systems, pp. 6000–6010 (2017)

    Google Scholar 

  26. Wang, H., Zhao, K., Zhao, D.: A triple fusion model for cross-modal deep hashing retrieval. Multimedia Syst. 29, 347–359 (2022)

    Article  Google Scholar 

  27. Yang, M., et al.: DOLG: single-stage image retrieval with deep orthogonal fusion of local and global features. In: IEEE International Conference on Computer Vision, pp. 11752–11761 (2021)

    Google Scholar 

  28. Yao, H.L., Zhan, Y.W., Chen, Z.D., Luo, X., Xu, X.S.: TEACH: attention-aware deep cross-modal hashing. In: International Conference on Multimedia Retrieval, pp. 376–384 (2021)

    Google Scholar 

  29. Zhang, P., Luo, Y., Huang, Z., Xu, X.S., Song, J.: High-order nonlocal hashing for unsupervised cross-modal retrieval. World Wide Web 24, 563–583 (2021)

    Article  Google Scholar 

  30. Zhen, L., Hu, P., Wang, X., Peng, D.: Deep supervised cross-modal retrieval. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 10386–10395 (2019)

    Google Scholar 

  31. Zhu, L., Cai, L., Song, J., Zhu, X., Zhang, C., Zhang, S.: MSSPQ: multiple semantic structure-preserving quantization for cross-modal retrieval. In: International Conference on Multimedia Retrieval, pp. 631–638 (2022)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jun Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liu, W., Li, J., Wu, Z., Xu, J., Yang, B. (2024). Multi-head Hashing with Orthogonal Decomposition for Cross-modal Retrieval. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14555. Springer, Cham. https://doi.org/10.1007/978-3-031-53308-2_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-53308-2_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-53307-5

  • Online ISBN: 978-3-031-53308-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics