Skip to main content
Log in

Multi-label semantic sharing based on graph convolutional network for image-to-text retrieval

  • Research
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

Cross-modal hashing has attracted widespread attention due to its ability to reduce the complexity of storage and retrieval. However, many existing methods use a symbolic function to map hash codes, which leads to a loss of semantic information when mapping the original features to a low-dimensional space and consequently decreases retrieval accuracy. To address these challenges, we propose a cross-modal hashing method called Multi-Label Semantic Sharing based on Graph Convolutional Network for Image-to-Text Retrieval (MLSS). Specifically, we employ dual transformers to encode multimodal data and utilize CNN to assist in extracting local information from images, thereby enhancing the matching capability between images and text. Additionally, we design a multi-label semantic sharing module based on a graph convolutional network, which learns a unified multi-label classifier and establishes a semantic bridge between the feature representation space and the hashing space for images and text. By leveraging multi-label semantic information to guide feature and hash learning, MLSS generates hash codes that preserve semantic similarity information, leading to a significant improvement in the performance of image-to-text retrieval. Our experiments on three benchmark datasets demonstrate that MLSS outperforms several state-of-the-art cross-modal retrieval methods. Our code can be found at https://github.com/My1new/MLSS.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

The primary data for this study have been derived from a publicly accessible repository, and the experimental data supporting the findings of this study can be obtained from the corresponding paper or supplementary materials.

References

  1. Xia, D., Miao, L., Fan, A.: A cross- modal multimedia retrieval method using depth correlation mining in big data environment. Multim. Tools Appl. 79(1), 1339–1354 (2020)

    Article  MATH  Google Scholar 

  2. Dong, X.F., Liu, L., Zhu, L., et al.: Adversarial graph convolutional network for cross- modal retrieval. IEEE Trans. Circ. Syst Video Technol. 32(3), 1634–1645 (2022)

    Article  MATH  Google Scholar 

  3. Peng, Y., Qi, J.: CM- GANs: cross- modal generative adversarial networks for common representation learning. ACM Trans. Multim. Comput., Commun. Appl. 15(1), 1–24 (2019)

    Article  MATH  Google Scholar 

  4. Kou, F., Du, J., Cui, W., et al.: Common semantic representation method based on object attention and adversarial learning for cross- modal data in IoV. IEEE Trans. Veh. Technol. 68(12), 11588–11598 (2019)

    Article  MATH  Google Scholar 

  5. Shi, L., Du, J., Cheng, G., et al.: Cross-media search method based on complementary attention and generative adversarial network for social networks. Int. J. Intell. Syst. 37(8), 4393–4416 (2022)

    Article  MATH  Google Scholar 

  6. Li, Z., Lu, H., Fu, H., et al.: Image-text bidirectional learning network based cross-modal retrieval. Neurocomputing 483, 148–159 (2022)

    Article  MATH  Google Scholar 

  7. Cao, Y., Long, M., Wang, J.,et al.: Deep visual-semantic hashing for cross-modal retrieval. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1445–1454 (2016)

  8. Xu, X., Shen, F., Yang, Y., et al.: Learning discriminative binary codes for large-scale cross-modal retrieval. IEEE Trans. Image Process. 26(5), 2494–2507 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  9. Lu, X., Zhu, L., Cheng, Z., et al.: Efficient discrete latent semantic hashing for scalable cross-modal retrieval. Signal Process. 154, 217–231 (2019)

    Article  MATH  Google Scholar 

  10. Meng, M., Sun, J., Liu, J., et al.: Semantic disentanglement adversarial hashing for cross-modal retrieval. IEEE Trans. Circ. Syst. Video Technol. (2023). https://doi.org/10.1109/TCSVT.2023.3293104

    Article  MATH  Google Scholar 

  11. Lu, J., Liong, V.E., Tan, Y.P.: Adversarial multi-label variational hashing. IEEE Trans. Image Process. 30, 332–344 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  12. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60, 91–110 (2004)

    Article  MATH  Google Scholar 

  13. Gan, C., Yang, T., Gong, B.: Learning attributes equals multi-source domain generalization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 87–97 (2016)

  14. Wang, D., Gao, X., Wang, X., et al.: Semantic topic multimodal hashing for cross-media retrieval. In: Proceedings of the twenty-fourth international joint conference on artificial intelligence (IJCAI 2015). pp. 3890–3896 (2015). https://www.ijcai.org/Proceedings/15/Papers/546.pdf

  15. Jiang, Q.Y., Li, W.J.: Deep cross-modal hashing. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3232–3240

  16. Chen, Z.D., Yu, W.J., Li, C.X. et al.: Dual deep neural networks cross-modal hashing. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2018, 32(1). pp. 274–281. https://ojs.aaai.org/index.php/AAAI/article/view/11249

  17. Xie, D., Deng, C., Li, C., et al.: Multi-task consistency-preserving adversarial hashing for cross-modal retrieval. IEEE Trans. Image Process. 29, 3626–3637 (2020)

    Article  MATH  Google Scholar 

  18. Zhang, X., Lai, H., Feng, J.: Attention- aware deep adversarial hashing for cross-modal retrieval. In: European Conference on Computer Vision. Cham: Springer, pp. 614–629 (2018)

  19. Wang, X., Zou, X., Bakker, E.M., et al.: Self-constraining and attention-based hashing network for bit-scalable crossmodal retrieval. Neurocomputing 400, 255–271 (2020)

    Article  Google Scholar 

  20. Kipf, T N., Welling. M.: Semi-supervised classification with graph convolutional networks. arxiv preprint arxiv:1609.02907 (2016)

  21. Xu, R., Li, C., Yan, J., et al.: Graph convolutional network hashing for cross-modal retrieval. IJCAI 2019, 982–988 (2019)

    MATH  Google Scholar 

  22. Wang, S., Wang, R., Yao, Z., et al.: Cross-modal scene graph matching for relationship-aware image-text retrieval. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 1508–1517(2020)

  23. Chen, Z M., Wei, X S., Wang, P., et al.: Multi-label image recognition with graph convolutional networks. In: Proceeding of IEEE Conference on Computer Vision Pattern Recognition. pp. 5177–5186 (2019)

  24. Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, 30 (2017)

  25. Ramesh, A., Dhariwal, P., Nichol, A., et al.: Hierarchical text-conditional image generation with clip latents. 1(2): 3 arxiv preprint arxiv:2204.06125, (2022)

  26. Zhang, R., Guo, Z., Zhang, W., et al.: Pointclip: Ppoint cloud understanding by clip. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8552–8562 (2022)

  27. Fan, L., Krishnan, D., Isola, P., et al.: Improving clip training with language rewrites. In: Advances in Neural Information Processing Systems, 36 (2024)

  28. Radford, A., Kim, J W., Hallacy, C., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning. PMLR, 8748–8763 (2021)

  29. Tu, J., Liu. X., Lin. Z., et al.: Differentiable cross-modal hashing via multimodal transformers. In: Proceedings of the 30th ACM International Conference on Multimedia. pp. 453–461 (2022)

  30. Jiang, D., Ye, M.: Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2787–2797 (2023)

  31. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017)

    Article  MATH  Google Scholar 

  32. Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat Methodol. 67(2), 301–320 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  33. Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1–22 (2010). https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2929880/pdf/nihms201118.pdf

  34. Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. arxiv preprint arxiv:1508.07909 (2015)

  35. Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. Proc. icml. 30(1), 3 (2013)

    Google Scholar 

  36. Kingma, D P., Ba, J.: Adam: a method for stochastic optimization. arxiv preprint arxiv:1412.6980 (2014)

  37. Huiskes, M J., Lew, M S.: The mir flickr retrieval evaluation. In: Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval. pp. 39-43 (2008)

  38. Chua, T S., Tang, J., Hong, R., et al.: Nus-wide: a real-world web image database from national university of singapore. In: Proceedings of the ACM International Conference on Image and Video Retrieval. pp. 1–9 (2009)

  39. Lin, T.Y., Maire, M., Belongie, S.: Microsoft coco: common objects in context. Computer Vision-ECCV, et al.: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13. Springer International Publishing, pp. 740–755 (2014)

  40. Cao, Y., Liu, B., Long, M., et al.: Cross-modal hamming hashing. In: Proceedings of the European conference on computer vision (ECCV). pp. 202–218 (2018)

  41. Gu, W., Gu, X., Gu, J., et al.: Adversary guided asymmetric hashing for cross-modal retrieval. In: Proceedings of the 2019 on International Conference on Multimedia Retrieval. pp 159–167 (2019)

  42. Bai, C., Zeng, C., Ma, Q., et al.: Deep adversarial discrete hashing for cross-modal retrieval. In: Proceedings of the 2020 International Conference on Multimedia Retrieval. pp. 525–531 (2020)

  43. Lin, X., Sun, S., Huang, W., et al.: EAPT: efficient attention pyramid transformer for image processing. IEEE Trans. Multimedia 25, 50–61 (2021)

    Article  MATH  Google Scholar 

  44. Pennington, J., Socher, R., Manning, C D.: Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). pp. 1532–1543 (2014)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Meng Wang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ma, Y., Wang, M., Lu, G. et al. Multi-label semantic sharing based on graph convolutional network for image-to-text retrieval. Vis Comput 41, 1827–1840 (2025). https://doi.org/10.1007/s00371-024-03496-y

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-024-03496-y

Keywords