Abstract
Cross-modal retrieval aims to retrieve related items in one modality using a query from another modality. As the foundational and key challenge of it, image-text retrieval has garnered significant research interest from scholars. In recent years, hashing techniques have gained widespread interest for large-scale dataset retrieval due to their minimal storage requirements and rapid query processing capabilities. However, existing hashing approaches either learn unified representations for both modalities or specific representations within each modality. The former approach lacks modality-specific information, while the latter does not consider the relationships between image-text pairs across various modalities. Therefore, we propose an innovative supervised hashing method that leverages intra-modality and inter-modality matrix factorization. This method integrates semantic labels into the hash code learning process, aiming to understand both inter-modality and intra-modality relationships within a unified framework for diverse data types. The objective is to preserve inter-modal complementarity and intra-modal consistency in multimodal data. Our approach involves: (1) mapping data from various modalities into a shared latent semantic space through inter-modality matrix factorization to derive unified hash codes, and (2) mapping data from each modality into modality-specific latent semantic spaces via intra-modality matrix factorization to obtain modality-specific hash codes. These are subsequently merged to construct the final hash codes. Experimental results demonstrate that our approach surpasses several state-of-the-art cross-modal image-text retrieval hashing methods. Additionally, ablation studies further validate the effectiveness of each component within our model.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability and access
The data that support the findings of this study are available from the corresponding author upon reasonable request.
References
Pei X, Liu Z, Gao S, Su Y (2023) Complementarity is the king: multi-modal and multi-grained hierarchical semantic enhancement network for cross-modal retrieval. Expert Syst Appl 216:119415
Datar M, Immorlica N, Indyk P, Mirrokni VS (2004) Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the twentieth annual symposium on computational geometry, pp 253–262
Hu M, Yang Y, Shen F, Xie N, Hong R, Shen HT (2018) Collective reconstructive embeddings for cross-modal hashing. IEEE Trans Image Process 28:2770–2784
Liu X, Li A, Du J-X, Peng S-J, Fan W (2018) Efficient cross-modal retrieval via flexible supervised collective matrix factorization hashing. Multimed Tool Appl 77:28665–28683
Masci J, Bronstein MM, Bronstein AM, Schmidhuber J (2013) Multimodal similarity-preserving hashing. IEEE Trans Pattern Anal Mach Intell 36:824–830
Wang B, Yang Y, Xu X, Hanjalic A, Shen HT (2017) Adversarial cross-modal retrieval. In: Proceedings of the 25th ACM international conference on multimedia, pp 154–162
Wang Y, Ou X, Liang J, Sun Z (2020) Deep semantic reconstruction hashing for similarity retrieval. IEEE Trans Circuits Syst Video Technol 31:387–400
Lu X, Zhu L, Cheng Z, Song X, Zhang H (2019) Efficient discrete latent semantic hashing for scalable cross-modal retrieval. Signal Process 154:217–231
Zhang S, Li J, Jiang M, Yuan P, Zhang B (2017) Scalable discrete supervised multimedia hash learning with clustering. IEEE Trans Circuits Syst Video Technol 28:2716–2729
Wang J, Liu W, Kumar S, Chang S-F (2015) Learning to hash for indexing big data—a survey. Proc IEEE 104:34–57
Takahashi T, Kurita T (2014) Mixture of subspaces image representation and compact coding for large-scale image retrieval. IEEE Trans Pattern Anal Mach Intell 37:1469–1479
Shen F, Shen C, Shi Q, Van den Hengel A, Tang Z, Shen HT (2015) Hashing on nonlinear manifolds. IEEE Trans Image Process 24:1839–1851
Deng C, Chen Z, Liu X, Gao X, Tao D (2018) Triplet-based deep hashing network for cross-modal retrieval. IEEE Trans Image Process 27:3893–3903
Yang E, Deng C, Li C, Liu W, Li J, Tao D (2018) Shared predictive cross-modal deep quantization. IEEE Trans Neural Netw Learn Syst 29:5292–5303
Wang W, Yang X, Ooi BC, Zhang D, Zhuang Y (2016) Effective deep learning-based multi-modal retrieval. VLDB J 25:79–101
Hotelling H (1992) Relations between two sets of variates. In: Breakthroughs in statistics: methodology and distribution, pp 162–190
Lai PL, Fyfe C (2000) Kernel and nonlinear canonical correlation analysis. Lai, Pei Ling and Fyfe, Colin 10(5):365–377
Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet GR, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: Proceedings of the 18th ACM international conference on multimedia, pp 251–260
Wang C, Yang H, Meinel C (2015) Deep semantic mapping for cross-modal retrieval. In: 2015 IEEE 27th International Conference on Tools with Artificial Intelligence (ICTAI), pp 234–241
Wang Y, Su Y, Li W, Xiao J, Li X, Liu A-A (2023) Dual-path rare content enhancement network for image and text matching. IEEE Trans Circuits Syst Video Technol 33(10):6144–6158
Li W, Yang S, Li Q, Li X, Liu A-A (2023) Commonsense-guided semantic and relational consistencies for image-text retrieval. IEEE Trans Multimed
Yang X, Gao X, Song B, Han B (2020) Hierarchical deep embedding for aurora image retrieval. IEEE Trans Cybern 51:5773–5785
He S, Wang B, Wang Z, Yang Y, Shen F, Huang Z, Shen HT (2020) Bidirectional discrete matrix factorization hashing for image search. IEEE Trans Cybern 50:4157–4168
Zhu L, Lu X, Cheng Z, Li J, Zhang H (2020) Flexible multi-modal hashing for scalable multimedia retrieval. ACM Transactions on Intelligent Systems and Technology (TIST) 11:1–20
Gong Y, Lazebnik S, Gordo A, Perronnin F (2013) Iterative quantization: a procrustean approach to learning binary codes for large-scale image retrieval. IEEE Trans Pattern Anal Mach Intell 35:2916–2929
Tang J, Li Z, Wang M, Zhao R (2015) Neighborhood discriminant hashing for large-scale image retrieval. IEEE Trans Image Process 24:2827–2840
Ji R, Liu H, Cao L, Liu D, Wu Y, Huang F (2017) Toward optimal manifold hashing via discrete locally linear embedding. IEEE Trans Image Process 26(11):5411–5420
Liu W, Wang J, Ji R, Jiang Y-G, Chang S-F (2012) Supervised hashing with kernels. In: 2012 IEEE conference on computer vision and pattern recognition, pp 2074–2081
Gui J, Liu T, Sun Z, Tao D, Tan T (2018) Fast supervised discrete hashing. IEEE Trans Pattern Anal Mach Intell 40(2):490–496
Luo X, Zhang P-F, Huang Z, Nie L, Xu X-S (2019) Discrete hashing with multiple supervision. IEEE Trans Image Process 28(6):2962–2975
Song J, Yang Y, Yang Y, Huang Z, Shen HT (2013) Inter-media hashing for large-scale retrieval from heterogeneous data sources. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data, pp 785–796
Ding G, Guo Y, Zhou J (2014) Collective matrix factorization hashing for multimodal data. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2075–2082
Zhou J, Ding G, Guo Y (2014) Latent semantic sparse hashing for cross-modal similarity search. In: Proceedings of the 37th international ACM SIGIR conference on research & development in information retrieval, pp 415–424
Wang D, Wang Q, Gao X (2017) Robust and flexible discrete hashing for cross-modal similarity search. IEEE Trans Circuits Syst Video Technol 28:2703–2715
Li J, Li F, Zhu L, Cui H, Li J (2023) Prototype-guided knowledge transfer for federated unsupervised cross-modal hashing. In: Proceedings of the 31st ACM international conference on multimedia, pp 1013–1022
Cui J, He Z, Huang Q, Fu Y, Li Y, Wen J (2024) Structure-aware contrastive hashing for unsupervised cross-modal retrieval. Neural Netw 174:106211
Lin Z, Ding G, Hu M, Wang J (2015) Semantics-preserving hashing for cross-view retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3864–3872
Tang J, Wang K, Shao L (2016) Supervised matrix factorization hashing for cross-modal retrieval. IEEE Trans Image Process 25:3157–3166
Liu X, Hu Z, Ling H, Cheung Y-m (2019) Mtfh: a matrix tri-factorization hashing framework for efficient cross-modal retrieval. IEEE Trans Pattern Anal Mach Intell 43:964–981
Zhang D, Li W-J (2014) Large-scale supervised multimodal hashing with semantic correlation maximization. Proc AAAI Conf Artif Intell 28:2177–2183
Wang D, Gao X, Wang X, He L (2018) Label consistent matrix factorization hashing for large-scale cross-modal similarity search. IEEE Trans Pattern Anal Mach Intell 41:2466–2479
Shen HT, Liu L, Yang Y, Xu X, Huang Z, Shen F, Hong R (2021) Exploiting subspace relation in semantic labels for cross-modal hashing. IEEE Trans Knowl Data Eng 33:3351–3365
Jiang Q-Y, Li W-J (2017) Deep cross-modal hashing. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3232–3240
Li C, Deng C, Li N, Liu W, Gao X, Tao D (2018) Self-supervised adversarial hashing networks for cross-modal retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4242–4251
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2020) Generative adversarial networks. Commun ACM 63(11):139–144
Shu Z, Li L, Yu J, Zhang D, Yu Z, Wu X-J (2023) Online supervised collective matrix factorization hashing for cross-modal retrieval. Appl Intell 53(11):14201–14218
Shu Z, Yong K, Zhang D, Yu J, Yu Z, Wu X-J (2023) Robust supervised matrix factorization hashing with application to cross-modal retrieval. Neural Comput Appl 35(9):6665–6684
Wang D, Gao X, Wang X, He L (2015) Semantic topic multimodal hashing for cross-media retrieval. In: Twenty-fourth international joint conference on artificial intelligence, pp 3890–3896
Kumar S, Udupa R (2011) Learning hash functions for cross-view similarity search. Twenty-Second Int Joint Conf Artif Intell 22:1360–1367
Zhan Y-W, Wang Y, Sun Y, Wu X-M, Luo X, Xu X-S (2022) Discrete online cross-modal hashing. Pattern Recogn 122:108262
Zhang D, Wu X-J (2022) Robust and discrete matrix factorization hashing for cross-modal retrieval. Pattern Recogn 122:108343
Chen Y, Quan J, Zhang Y, Feng R, Zhang T (2023) Deep cross-modal hashing with fine-grained similarity. Appl Intell 53(23):28954–28973
Acknowledgements
This work was supported by Humanities and Social Sciences Project of Education Ministry (20YJA870013), Scientific Research Studio in Colleges and Universities of Ji’nan City (202228105). Authors would like to thank reviewers for their helpful comments.
Author information
Authors and Affiliations
Contributions
DongXue Shi: Writing - original draft, Conceptualization, Methodology, Software, Investigation, Validation. Zheng Liu: Conceptualization, Writing - review & editing, Supervision, Project administration, Funding acquisition. Shanshan Gao: Writing - review & editing, Funding acquisition. Ang Li: Software, Writing - review & editing, Supervision.
Corresponding author
Ethics declarations
Competing interest
The authors have no competing interests to declare that are relevant to the content of this article.
Ethical and informed consent for data used
This work described in this manuscript is original and has not been under consideration for publication elsewhere. All authors read and approved the final manuscript. The research in this manuscript does not involve human participants and animals. We obtain ethical and informed consent from data subjects before collecting, using, or disclosing their personal data.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Shi, D., Liu, Z., Gao, S. et al. Semantic-aware matrix factorization hashing with intra- and inter-modality fusion for image-text retrieval. Appl Intell 55, 5 (2025). https://doi.org/10.1007/s10489-024-06060-2
Accepted:
Published:
DOI: https://doi.org/10.1007/s10489-024-06060-2