Abstract
Recently, cross-modal hashing has become a promising line of research in cross-modal retrieval. It not only takes advantage of complementary multiple heterogeneous data modalities for improved retrieval accuracy, but also enjoys reduced memory footprint and fast query speed due to efficient binary feature embedding. With the boom of deep learning, convolutional neural network (CNN) has become the de facto method for advanced cross-model hashing algorithm. Recent research demonstrates that dominant role of CNN is challenged by increasingly effective Transformer architectures due to their advantages of long-range modeling by relaxing local inductive bias. However, the absence of inductive bias shatters the inherent geometric structure, which inevitably leads to compromised neighborhood correlation. To alleviate this problem, in this paper, we propose a novel cross-modal hashing method termed Multi-head Hashing with Orthogonal Decomposition (MHOD) for cross-modal retrieval. More specifically, with the multi-modal Transformers used as the backbones, MHOD leverages orthogonal decomposition for decoupling local cues and global features, and further captures their intrinsic correlations through our designed multi-head hash layer. In this way, the global and local representations are simultaneously embedded into the resulting binary code, leading to a comprehensive and robust representation. Extensive experiments on popular cross-modal retrieval benchmarking datasets demonstrate the proposed MHOD method achieves advantageous performance against the other state-of-the-art cross-modal hashing approaches.
Supported by the National Natural Science Foundation of China under Grant 62173186, 62076134, 62303230 and Jiangsu provincial colleges of Natural Science General Program under Grant 22KJB510004.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bai, C., Zeng, C., Ma, Q., Zhang, J., Chen, S.: Deep adversarial discrete hashing for cross-modal retrieval. In: International Conference on Multimedia Retrieval, pp. 525–531 (2020)
Cao, Y., Liu, B., Long, M., Wang, J.: Cross-modal hamming hashing. In: European Conference on Computer Vision, pp. 207–223 (2018)
Chua, T.S., Tang, J., Hong, R., Li, H., Luo, Z., Zheng, Y.: NUS-WIDE: a real-world web image database from national university of Singapore. In: ACM International Conference on Image and Video Retrieval, pp. 368–375 (2009)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics, pp. 4171–4186 (2019)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations, pp. 1–22 (2021)
Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: VSE++: improved visual-semantic embeddings with hard negatives. In: British Machine Vision Conference, pp. 1–14 (2018)
Hong, J., Liu, H.: Deep cross-modal hashing retrieval based on semantics preserving and vision transformer. In: International Conference on Electronic Information Technology and Computer Engineering, pp. 52–57 (2022)
Huo, Y., et al.: Deep semantic-aware proxy hashing for multi-label cross-modal retrieval. IEEE Trans. Circuits Syst. Video Technol. (2023)
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning, pp. 4904–4916 (2021)
Jiang, Q.Y., Li, W.J.: Deep cross-modal hashing. IEEE Conference on Computer Vision and Pattern Recognition, pp. 3270–3278 (2016)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: International Conference for Learning Representations, pp. 1–15 (2015)
Liang, M., et al.: Semantic structure enhanced contrastive adversarial hash network for cross-media representation learning. In: ACM International Conference on Multimedia, pp. 277–285 (2022)
Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: European Conference on Computer Vision, pp. 740–755 (2014)
Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: petraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Neural Information Processing Systems, pp. 13–23 (2019)
Ma, L., Li, H., Meng, F., Wu, Q., Ngi Ngan, K.: Global and local semantics-preserving based deep hashing for cross-modal retrieval. Neurocomputing 312, 49–62 (2018)
Ma, X., Zhang, T., Xu, C.: Multi-level correlation adversarial hashing for cross-modal retrieval. IEEE Trans. Multimedia 22, 3101–3114 (2020)
Mikriukov, G., Ravanbakhsh, M., Demir, B.: Deep unsupervised contrastive hashing for large-scale cross-modal text-image retrieval in remote sensing. arXiv preprint arXiv:2201.08125 (2022)
Nie, X., Wang, B., Li, J., Hao, F., Jian, M., Yin, Y.: Deep multiscale fusion hashing for cross-modal retrieval. IEEE Trans. Circuits Syst. Video Technol. 31, 401–410 (2021)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021)
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)
Shi, Y., et al.: Deep adaptively-enhanced hashing with discriminative similarity guidance for unsupervised cross-modal retrieval. IEEE Trans. Circuits Syst. Video Technol. 32, 7255–7268 (2022)
Singh, A., et al.: FLAVA: a foundational language and vision alignment model. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 15617–15629 (2022)
Su, S., Zhong, Z., Zhang, C.: Deep joint-semantics reconstructing hashing for large-scale unsupervised cross-modal retrieval. In: IEEE International Conference on Computer Vision, pp. 3027–3035 (2019)
Tu, J., Liu, X., Lin, Z., Hong, R., Wang, M.: Differentiable cross-modal hashing via multimodal transformers. In: ACM International Conference on Multimedia, pp. 453–461 (2022)
Vaswani, A., et al.: Attention is all you need. In: Neural Information Processing Systems, pp. 6000–6010 (2017)
Wang, H., Zhao, K., Zhao, D.: A triple fusion model for cross-modal deep hashing retrieval. Multimedia Syst. 29, 347–359 (2022)
Yang, M., et al.: DOLG: single-stage image retrieval with deep orthogonal fusion of local and global features. In: IEEE International Conference on Computer Vision, pp. 11752–11761 (2021)
Yao, H.L., Zhan, Y.W., Chen, Z.D., Luo, X., Xu, X.S.: TEACH: attention-aware deep cross-modal hashing. In: International Conference on Multimedia Retrieval, pp. 376–384 (2021)
Zhang, P., Luo, Y., Huang, Z., Xu, X.S., Song, J.: High-order nonlocal hashing for unsupervised cross-modal retrieval. World Wide Web 24, 563–583 (2021)
Zhen, L., Hu, P., Wang, X., Peng, D.: Deep supervised cross-modal retrieval. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 10386–10395 (2019)
Zhu, L., Cai, L., Song, J., Zhu, X., Zhang, C., Zhang, S.: MSSPQ: multiple semantic structure-preserving quantization for cross-modal retrieval. In: International Conference on Multimedia Retrieval, pp. 631–638 (2022)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Liu, W., Li, J., Wu, Z., Xu, J., Yang, B. (2024). Multi-head Hashing with Orthogonal Decomposition for Cross-modal Retrieval. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14555. Springer, Cham. https://doi.org/10.1007/978-3-031-53308-2_13
Download citation
DOI: https://doi.org/10.1007/978-3-031-53308-2_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53307-5
Online ISBN: 978-3-031-53308-2
eBook Packages: Computer ScienceComputer Science (R0)