Multi-head Hashing with Orthogonal Decomposition for Cross-modal Retrieval

Liu, Wei; Li, Jun; Wu, Zhijian; Xu, Jianhua; Yang, Bo

doi:10.1007/978-3-031-53308-2_13

Wei Liu¹⁴,
Jun Li¹⁴,
Zhijian Wu¹⁵,
Jianhua Xu¹⁴ &
…
Bo Yang¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14555))

Included in the following conference series:

International Conference on Multimedia Modeling

900 Accesses

Abstract

Recently, cross-modal hashing has become a promising line of research in cross-modal retrieval. It not only takes advantage of complementary multiple heterogeneous data modalities for improved retrieval accuracy, but also enjoys reduced memory footprint and fast query speed due to efficient binary feature embedding. With the boom of deep learning, convolutional neural network (CNN) has become the de facto method for advanced cross-model hashing algorithm. Recent research demonstrates that dominant role of CNN is challenged by increasingly effective Transformer architectures due to their advantages of long-range modeling by relaxing local inductive bias. However, the absence of inductive bias shatters the inherent geometric structure, which inevitably leads to compromised neighborhood correlation. To alleviate this problem, in this paper, we propose a novel cross-modal hashing method termed Multi-head Hashing with Orthogonal Decomposition (MHOD) for cross-modal retrieval. More specifically, with the multi-modal Transformers used as the backbones, MHOD leverages orthogonal decomposition for decoupling local cues and global features, and further captures their intrinsic correlations through our designed multi-head hash layer. In this way, the global and local representations are simultaneously embedded into the resulting binary code, leading to a comprehensive and robust representation. Extensive experiments on popular cross-modal retrieval benchmarking datasets demonstrate the proposed MHOD method achieves advantageous performance against the other state-of-the-art cross-modal hashing approaches.

Supported by the National Natural Science Foundation of China under Grant 62173186, 62076134, 62303230 and Jiangsu provincial colleges of Natural Science General Program under Grant 22KJB510004.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

High-order nonlocal Hashing for unsupervised cross-modal retrieval

Article 27 February 2021

Deep Consistency Preserving Network for Unsupervised Cross-Modal Hashing

Asymmetric Deep Cross-modal Hashing

References

Bai, C., Zeng, C., Ma, Q., Zhang, J., Chen, S.: Deep adversarial discrete hashing for cross-modal retrieval. In: International Conference on Multimedia Retrieval, pp. 525–531 (2020)
Google Scholar
Cao, Y., Liu, B., Long, M., Wang, J.: Cross-modal hamming hashing. In: European Conference on Computer Vision, pp. 207–223 (2018)
Google Scholar
Chua, T.S., Tang, J., Hong, R., Li, H., Luo, Z., Zheng, Y.: NUS-WIDE: a real-world web image database from national university of Singapore. In: ACM International Conference on Image and Video Retrieval, pp. 368–375 (2009)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics, pp. 4171–4186 (2019)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations, pp. 1–22 (2021)
Google Scholar
Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: VSE++: improved visual-semantic embeddings with hard negatives. In: British Machine Vision Conference, pp. 1–14 (2018)
Google Scholar
Hong, J., Liu, H.: Deep cross-modal hashing retrieval based on semantics preserving and vision transformer. In: International Conference on Electronic Information Technology and Computer Engineering, pp. 52–57 (2022)
Google Scholar
Huo, Y., et al.: Deep semantic-aware proxy hashing for multi-label cross-modal retrieval. IEEE Trans. Circuits Syst. Video Technol. (2023)
Google Scholar
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning, pp. 4904–4916 (2021)
Google Scholar
Jiang, Q.Y., Li, W.J.: Deep cross-modal hashing. IEEE Conference on Computer Vision and Pattern Recognition, pp. 3270–3278 (2016)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: International Conference for Learning Representations, pp. 1–15 (2015)
Google Scholar
Liang, M., et al.: Semantic structure enhanced contrastive adversarial hash network for cross-media representation learning. In: ACM International Conference on Multimedia, pp. 277–285 (2022)
Google Scholar
Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: European Conference on Computer Vision, pp. 740–755 (2014)
Google Scholar
Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: petraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Neural Information Processing Systems, pp. 13–23 (2019)
Google Scholar
Ma, L., Li, H., Meng, F., Wu, Q., Ngi Ngan, K.: Global and local semantics-preserving based deep hashing for cross-modal retrieval. Neurocomputing 312, 49–62 (2018)
Article Google Scholar
Ma, X., Zhang, T., Xu, C.: Multi-level correlation adversarial hashing for cross-modal retrieval. IEEE Trans. Multimedia 22, 3101–3114 (2020)
Article Google Scholar
Mikriukov, G., Ravanbakhsh, M., Demir, B.: Deep unsupervised contrastive hashing for large-scale cross-modal text-image retrieval in remote sensing. arXiv preprint arXiv:2201.08125 (2022)
Nie, X., Wang, B., Li, J., Hao, F., Jian, M., Yin, Y.: Deep multiscale fusion hashing for cross-modal retrieval. IEEE Trans. Circuits Syst. Video Technol. 31, 401–410 (2021)
Article Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021)
Google Scholar
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)
Google Scholar
Shi, Y., et al.: Deep adaptively-enhanced hashing with discriminative similarity guidance for unsupervised cross-modal retrieval. IEEE Trans. Circuits Syst. Video Technol. 32, 7255–7268 (2022)
Google Scholar
Singh, A., et al.: FLAVA: a foundational language and vision alignment model. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 15617–15629 (2022)
Google Scholar
Su, S., Zhong, Z., Zhang, C.: Deep joint-semantics reconstructing hashing for large-scale unsupervised cross-modal retrieval. In: IEEE International Conference on Computer Vision, pp. 3027–3035 (2019)
Google Scholar
Tu, J., Liu, X., Lin, Z., Hong, R., Wang, M.: Differentiable cross-modal hashing via multimodal transformers. In: ACM International Conference on Multimedia, pp. 453–461 (2022)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Neural Information Processing Systems, pp. 6000–6010 (2017)
Google Scholar
Wang, H., Zhao, K., Zhao, D.: A triple fusion model for cross-modal deep hashing retrieval. Multimedia Syst. 29, 347–359 (2022)
Article Google Scholar
Yang, M., et al.: DOLG: single-stage image retrieval with deep orthogonal fusion of local and global features. In: IEEE International Conference on Computer Vision, pp. 11752–11761 (2021)
Google Scholar
Yao, H.L., Zhan, Y.W., Chen, Z.D., Luo, X., Xu, X.S.: TEACH: attention-aware deep cross-modal hashing. In: International Conference on Multimedia Retrieval, pp. 376–384 (2021)
Google Scholar
Zhang, P., Luo, Y., Huang, Z., Xu, X.S., Song, J.: High-order nonlocal hashing for unsupervised cross-modal retrieval. World Wide Web 24, 563–583 (2021)
Article Google Scholar
Zhen, L., Hu, P., Wang, X., Peng, D.: Deep supervised cross-modal retrieval. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 10386–10395 (2019)
Google Scholar
Zhu, L., Cai, L., Song, J., Zhu, X., Zhang, C., Zhang, S.: MSSPQ: multiple semantic structure-preserving quantization for cross-modal retrieval. In: International Conference on Multimedia Retrieval, pp. 631–638 (2022)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer and Electronic Information, Nanjing Normal University, Nanjing, 210023, China
Wei Liu, Jun Li & Jianhua Xu
School of Data Science and Engineering, East China Normal University, Shanghai, 200062, China
Zhijian Wu
School of Artificial Intelligence, Nanjing University of Information Science and Technology, Nanjing, 210044, China
Bo Yang

Authors

Wei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jun Li
View author publications
You can also search for this author in PubMed Google Scholar
Zhijian Wu
View author publications
You can also search for this author in PubMed Google Scholar
Jianhua Xu
View author publications
You can also search for this author in PubMed Google Scholar
Bo Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jun Li .

Editor information

Editors and Affiliations

University of Amsterdam, Amsterdam, The Netherlands
Stevan Rudinac
Delft University of Technology, Delft, The Netherlands
Alan Hanjalic
Delft University of Technology, Delft, The Netherlands
Cynthia Liem
University of Amsterdam, Amsterdam, The Netherlands
Marcel Worring
Reykjavik University, Reykjavik, Iceland
Björn Þór Jónsson
Microsoft Research Lab – Asia, Beijing, China
Bei Liu
The University of Tokyo, Tokyo, Japan
Yoko Yamakata

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, W., Li, J., Wu, Z., Xu, J., Yang, B. (2024). Multi-head Hashing with Orthogonal Decomposition for Cross-modal Retrieval. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14555. Springer, Cham. https://doi.org/10.1007/978-3-031-53308-2_13

Download citation

DOI: https://doi.org/10.1007/978-3-031-53308-2_13
Published: 28 January 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53307-5
Online ISBN: 978-3-031-53308-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Multi-head Hashing with Orthogonal Decomposition for Cross-modal Retrieval