Abstract
Cross-modal hashing has attracted widespread attention due to its ability to reduce the complexity of storage and retrieval. However, many existing methods use a symbolic function to map hash codes, which leads to a loss of semantic information when mapping the original features to a low-dimensional space and consequently decreases retrieval accuracy. To address these challenges, we propose a cross-modal hashing method called Multi-Label Semantic Sharing based on Graph Convolutional Network for Image-to-Text Retrieval (MLSS). Specifically, we employ dual transformers to encode multimodal data and utilize CNN to assist in extracting local information from images, thereby enhancing the matching capability between images and text. Additionally, we design a multi-label semantic sharing module based on a graph convolutional network, which learns a unified multi-label classifier and establishes a semantic bridge between the feature representation space and the hashing space for images and text. By leveraging multi-label semantic information to guide feature and hash learning, MLSS generates hash codes that preserve semantic similarity information, leading to a significant improvement in the performance of image-to-text retrieval. Our experiments on three benchmark datasets demonstrate that MLSS outperforms several state-of-the-art cross-modal retrieval methods. Our code can be found at https://github.com/My1new/MLSS.








Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The primary data for this study have been derived from a publicly accessible repository, and the experimental data supporting the findings of this study can be obtained from the corresponding paper or supplementary materials.
References
Xia, D., Miao, L., Fan, A.: A cross- modal multimedia retrieval method using depth correlation mining in big data environment. Multim. Tools Appl. 79(1), 1339–1354 (2020)
Dong, X.F., Liu, L., Zhu, L., et al.: Adversarial graph convolutional network for cross- modal retrieval. IEEE Trans. Circ. Syst Video Technol. 32(3), 1634–1645 (2022)
Peng, Y., Qi, J.: CM- GANs: cross- modal generative adversarial networks for common representation learning. ACM Trans. Multim. Comput., Commun. Appl. 15(1), 1–24 (2019)
Kou, F., Du, J., Cui, W., et al.: Common semantic representation method based on object attention and adversarial learning for cross- modal data in IoV. IEEE Trans. Veh. Technol. 68(12), 11588–11598 (2019)
Shi, L., Du, J., Cheng, G., et al.: Cross-media search method based on complementary attention and generative adversarial network for social networks. Int. J. Intell. Syst. 37(8), 4393–4416 (2022)
Li, Z., Lu, H., Fu, H., et al.: Image-text bidirectional learning network based cross-modal retrieval. Neurocomputing 483, 148–159 (2022)
Cao, Y., Long, M., Wang, J.,et al.: Deep visual-semantic hashing for cross-modal retrieval. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1445–1454 (2016)
Xu, X., Shen, F., Yang, Y., et al.: Learning discriminative binary codes for large-scale cross-modal retrieval. IEEE Trans. Image Process. 26(5), 2494–2507 (2017)
Lu, X., Zhu, L., Cheng, Z., et al.: Efficient discrete latent semantic hashing for scalable cross-modal retrieval. Signal Process. 154, 217–231 (2019)
Meng, M., Sun, J., Liu, J., et al.: Semantic disentanglement adversarial hashing for cross-modal retrieval. IEEE Trans. Circ. Syst. Video Technol. (2023). https://doi.org/10.1109/TCSVT.2023.3293104
Lu, J., Liong, V.E., Tan, Y.P.: Adversarial multi-label variational hashing. IEEE Trans. Image Process. 30, 332–344 (2020)
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60, 91–110 (2004)
Gan, C., Yang, T., Gong, B.: Learning attributes equals multi-source domain generalization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 87–97 (2016)
Wang, D., Gao, X., Wang, X., et al.: Semantic topic multimodal hashing for cross-media retrieval. In: Proceedings of the twenty-fourth international joint conference on artificial intelligence (IJCAI 2015). pp. 3890–3896 (2015). https://www.ijcai.org/Proceedings/15/Papers/546.pdf
Jiang, Q.Y., Li, W.J.: Deep cross-modal hashing. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3232–3240
Chen, Z.D., Yu, W.J., Li, C.X. et al.: Dual deep neural networks cross-modal hashing. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2018, 32(1). pp. 274–281. https://ojs.aaai.org/index.php/AAAI/article/view/11249
Xie, D., Deng, C., Li, C., et al.: Multi-task consistency-preserving adversarial hashing for cross-modal retrieval. IEEE Trans. Image Process. 29, 3626–3637 (2020)
Zhang, X., Lai, H., Feng, J.: Attention- aware deep adversarial hashing for cross-modal retrieval. In: European Conference on Computer Vision. Cham: Springer, pp. 614–629 (2018)
Wang, X., Zou, X., Bakker, E.M., et al.: Self-constraining and attention-based hashing network for bit-scalable crossmodal retrieval. Neurocomputing 400, 255–271 (2020)
Kipf, T N., Welling. M.: Semi-supervised classification with graph convolutional networks. arxiv preprint arxiv:1609.02907 (2016)
Xu, R., Li, C., Yan, J., et al.: Graph convolutional network hashing for cross-modal retrieval. IJCAI 2019, 982–988 (2019)
Wang, S., Wang, R., Yao, Z., et al.: Cross-modal scene graph matching for relationship-aware image-text retrieval. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 1508–1517(2020)
Chen, Z M., Wei, X S., Wang, P., et al.: Multi-label image recognition with graph convolutional networks. In: Proceeding of IEEE Conference on Computer Vision Pattern Recognition. pp. 5177–5186 (2019)
Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, 30 (2017)
Ramesh, A., Dhariwal, P., Nichol, A., et al.: Hierarchical text-conditional image generation with clip latents. 1(2): 3 arxiv preprint arxiv:2204.06125, (2022)
Zhang, R., Guo, Z., Zhang, W., et al.: Pointclip: Ppoint cloud understanding by clip. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8552–8562 (2022)
Fan, L., Krishnan, D., Isola, P., et al.: Improving clip training with language rewrites. In: Advances in Neural Information Processing Systems, 36 (2024)
Radford, A., Kim, J W., Hallacy, C., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning. PMLR, 8748–8763 (2021)
Tu, J., Liu. X., Lin. Z., et al.: Differentiable cross-modal hashing via multimodal transformers. In: Proceedings of the 30th ACM International Conference on Multimedia. pp. 453–461 (2022)
Jiang, D., Ye, M.: Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2787–2797 (2023)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017)
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat Methodol. 67(2), 301–320 (2005)
Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1–22 (2010). https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2929880/pdf/nihms201118.pdf
Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. arxiv preprint arxiv:1508.07909 (2015)
Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. Proc. icml. 30(1), 3 (2013)
Kingma, D P., Ba, J.: Adam: a method for stochastic optimization. arxiv preprint arxiv:1412.6980 (2014)
Huiskes, M J., Lew, M S.: The mir flickr retrieval evaluation. In: Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval. pp. 39-43 (2008)
Chua, T S., Tang, J., Hong, R., et al.: Nus-wide: a real-world web image database from national university of singapore. In: Proceedings of the ACM International Conference on Image and Video Retrieval. pp. 1–9 (2009)
Lin, T.Y., Maire, M., Belongie, S.: Microsoft coco: common objects in context. Computer Vision-ECCV, et al.: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13. Springer International Publishing, pp. 740–755 (2014)
Cao, Y., Liu, B., Long, M., et al.: Cross-modal hamming hashing. In: Proceedings of the European conference on computer vision (ECCV). pp. 202–218 (2018)
Gu, W., Gu, X., Gu, J., et al.: Adversary guided asymmetric hashing for cross-modal retrieval. In: Proceedings of the 2019 on International Conference on Multimedia Retrieval. pp 159–167 (2019)
Bai, C., Zeng, C., Ma, Q., et al.: Deep adversarial discrete hashing for cross-modal retrieval. In: Proceedings of the 2020 International Conference on Multimedia Retrieval. pp. 525–531 (2020)
Lin, X., Sun, S., Huang, W., et al.: EAPT: efficient attention pyramid transformer for image processing. IEEE Trans. Multimedia 25, 50–61 (2021)
Pennington, J., Socher, R., Manning, C D.: Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). pp. 1532–1543 (2014)
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ma, Y., Wang, M., Lu, G. et al. Multi-label semantic sharing based on graph convolutional network for image-to-text retrieval. Vis Comput 41, 1827–1840 (2025). https://doi.org/10.1007/s00371-024-03496-y
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00371-024-03496-y