Multi-label semantic sharing based on graph convolutional network for image-to-text retrieval

Ma, Ying; Wang, Meng; Lu, Guangyun; Sun, Yajun

doi:10.1007/s00371-024-03496-y

Multi-label semantic sharing based on graph convolutional network for image-to-text retrieval

Research
Published: 10 June 2024

Volume 41, pages 1827–1840, (2025)
Cite this article

The Visual Computer Aims and scope Submit manuscript

Ying Ma¹,
Meng Wang²,
Guangyun Lu³ &
…
Yajun Sun¹

232 Accesses
Explore all metrics

Abstract

Cross-modal hashing has attracted widespread attention due to its ability to reduce the complexity of storage and retrieval. However, many existing methods use a symbolic function to map hash codes, which leads to a loss of semantic information when mapping the original features to a low-dimensional space and consequently decreases retrieval accuracy. To address these challenges, we propose a cross-modal hashing method called Multi-Label Semantic Sharing based on Graph Convolutional Network for Image-to-Text Retrieval (MLSS). Specifically, we employ dual transformers to encode multimodal data and utilize CNN to assist in extracting local information from images, thereby enhancing the matching capability between images and text. Additionally, we design a multi-label semantic sharing module based on a graph convolutional network, which learns a unified multi-label classifier and establishes a semantic bridge between the feature representation space and the hashing space for images and text. By leveraging multi-label semantic information to guide feature and hash learning, MLSS generates hash codes that preserve semantic similarity information, leading to a significant improvement in the performance of image-to-text retrieval. Our experiments on three benchmark datasets demonstrate that MLSS outperforms several state-of-the-art cross-modal retrieval methods. Our code can be found at https://github.com/My1new/MLSS.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Graph Convolutional Network Semantic Enhancement Hashing for Self-supervised Cross-Modal Retrieval

Unsupervised deep hashing with multiple similarity preservation for cross-modal image-text retrieval

Article 11 April 2024

Global and Local Feature Based Deep Cross-Modal Hashing

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability

The primary data for this study have been derived from a publicly accessible repository, and the experimental data supporting the findings of this study can be obtained from the corresponding paper or supplementary materials.

References

Xia, D., Miao, L., Fan, A.: A cross- modal multimedia retrieval method using depth correlation mining in big data environment. Multim. Tools Appl. 79(1), 1339–1354 (2020)
Article MATH Google Scholar
Dong, X.F., Liu, L., Zhu, L., et al.: Adversarial graph convolutional network for cross- modal retrieval. IEEE Trans. Circ. Syst Video Technol. 32(3), 1634–1645 (2022)
Article MATH Google Scholar
Peng, Y., Qi, J.: CM- GANs: cross- modal generative adversarial networks for common representation learning. ACM Trans. Multim. Comput., Commun. Appl. 15(1), 1–24 (2019)
Article MATH Google Scholar
Kou, F., Du, J., Cui, W., et al.: Common semantic representation method based on object attention and adversarial learning for cross- modal data in IoV. IEEE Trans. Veh. Technol. 68(12), 11588–11598 (2019)
Article MATH Google Scholar
Shi, L., Du, J., Cheng, G., et al.: Cross-media search method based on complementary attention and generative adversarial network for social networks. Int. J. Intell. Syst. 37(8), 4393–4416 (2022)
Article MATH Google Scholar
Li, Z., Lu, H., Fu, H., et al.: Image-text bidirectional learning network based cross-modal retrieval. Neurocomputing 483, 148–159 (2022)
Article MATH Google Scholar
Cao, Y., Long, M., Wang, J.,et al.: Deep visual-semantic hashing for cross-modal retrieval. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1445–1454 (2016)
Xu, X., Shen, F., Yang, Y., et al.: Learning discriminative binary codes for large-scale cross-modal retrieval. IEEE Trans. Image Process. 26(5), 2494–2507 (2017)
Article MathSciNet MATH Google Scholar
Lu, X., Zhu, L., Cheng, Z., et al.: Efficient discrete latent semantic hashing for scalable cross-modal retrieval. Signal Process. 154, 217–231 (2019)
Article MATH Google Scholar
Meng, M., Sun, J., Liu, J., et al.: Semantic disentanglement adversarial hashing for cross-modal retrieval. IEEE Trans. Circ. Syst. Video Technol. (2023). https://doi.org/10.1109/TCSVT.2023.3293104
Article MATH Google Scholar
Lu, J., Liong, V.E., Tan, Y.P.: Adversarial multi-label variational hashing. IEEE Trans. Image Process. 30, 332–344 (2020)
Article MathSciNet MATH Google Scholar
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60, 91–110 (2004)
Article MATH Google Scholar
Gan, C., Yang, T., Gong, B.: Learning attributes equals multi-source domain generalization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 87–97 (2016)
Wang, D., Gao, X., Wang, X., et al.: Semantic topic multimodal hashing for cross-media retrieval. In: Proceedings of the twenty-fourth international joint conference on artificial intelligence (IJCAI 2015). pp. 3890–3896 (2015). https://www.ijcai.org/Proceedings/15/Papers/546.pdf
Jiang, Q.Y., Li, W.J.: Deep cross-modal hashing. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3232–3240
Chen, Z.D., Yu, W.J., Li, C.X. et al.: Dual deep neural networks cross-modal hashing. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2018, 32(1). pp. 274–281. https://ojs.aaai.org/index.php/AAAI/article/view/11249
Xie, D., Deng, C., Li, C., et al.: Multi-task consistency-preserving adversarial hashing for cross-modal retrieval. IEEE Trans. Image Process. 29, 3626–3637 (2020)
Article MATH Google Scholar
Zhang, X., Lai, H., Feng, J.: Attention- aware deep adversarial hashing for cross-modal retrieval. In: European Conference on Computer Vision. Cham: Springer, pp. 614–629 (2018)
Wang, X., Zou, X., Bakker, E.M., et al.: Self-constraining and attention-based hashing network for bit-scalable crossmodal retrieval. Neurocomputing 400, 255–271 (2020)
Article Google Scholar
Kipf, T N., Welling. M.: Semi-supervised classification with graph convolutional networks. arxiv preprint arxiv:1609.02907 (2016)
Xu, R., Li, C., Yan, J., et al.: Graph convolutional network hashing for cross-modal retrieval. IJCAI 2019, 982–988 (2019)
MATH Google Scholar
Wang, S., Wang, R., Yao, Z., et al.: Cross-modal scene graph matching for relationship-aware image-text retrieval. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 1508–1517(2020)
Chen, Z M., Wei, X S., Wang, P., et al.: Multi-label image recognition with graph convolutional networks. In: Proceeding of IEEE Conference on Computer Vision Pattern Recognition. pp. 5177–5186 (2019)
Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, 30 (2017)
Ramesh, A., Dhariwal, P., Nichol, A., et al.: Hierarchical text-conditional image generation with clip latents. 1(2): 3 arxiv preprint arxiv:2204.06125, (2022)
Zhang, R., Guo, Z., Zhang, W., et al.: Pointclip: Ppoint cloud understanding by clip. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8552–8562 (2022)
Fan, L., Krishnan, D., Isola, P., et al.: Improving clip training with language rewrites. In: Advances in Neural Information Processing Systems, 36 (2024)
Radford, A., Kim, J W., Hallacy, C., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning. PMLR, 8748–8763 (2021)
Tu, J., Liu. X., Lin. Z., et al.: Differentiable cross-modal hashing via multimodal transformers. In: Proceedings of the 30th ACM International Conference on Multimedia. pp. 453–461 (2022)
Jiang, D., Ye, M.: Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2787–2797 (2023)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017)
Article MATH Google Scholar
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat Methodol. 67(2), 301–320 (2005)
Article MathSciNet MATH Google Scholar
Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1–22 (2010). https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2929880/pdf/nihms201118.pdf
Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. arxiv preprint arxiv:1508.07909 (2015)
Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. Proc. icml. 30(1), 3 (2013)
Google Scholar
Kingma, D P., Ba, J.: Adam: a method for stochastic optimization. arxiv preprint arxiv:1412.6980 (2014)
Huiskes, M J., Lew, M S.: The mir flickr retrieval evaluation. In: Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval. pp. 39-43 (2008)
Chua, T S., Tang, J., Hong, R., et al.: Nus-wide: a real-world web image database from national university of singapore. In: Proceedings of the ACM International Conference on Image and Video Retrieval. pp. 1–9 (2009)
Lin, T.Y., Maire, M., Belongie, S.: Microsoft coco: common objects in context. Computer Vision-ECCV, et al.: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13. Springer International Publishing, pp. 740–755 (2014)
Cao, Y., Liu, B., Long, M., et al.: Cross-modal hamming hashing. In: Proceedings of the European conference on computer vision (ECCV). pp. 202–218 (2018)
Gu, W., Gu, X., Gu, J., et al.: Adversary guided asymmetric hashing for cross-modal retrieval. In: Proceedings of the 2019 on International Conference on Multimedia Retrieval. pp 159–167 (2019)
Bai, C., Zeng, C., Ma, Q., et al.: Deep adversarial discrete hashing for cross-modal retrieval. In: Proceedings of the 2020 International Conference on Multimedia Retrieval. pp. 525–531 (2020)
Lin, X., Sun, S., Huang, W., et al.: EAPT: efficient attention pyramid transformer for image processing. IEEE Trans. Multimedia 25, 50–61 (2021)
Article MATH Google Scholar
Pennington, J., Socher, R., Manning, C D.: Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). pp. 1532–1543 (2014)

Download references

Author information

Authors and Affiliations

College of Science, Guangxi University of Science and Technology, Liuzhou, 545000, China
Ying Ma & Yajun Sun
Tus College of Digit, Guangxi University of Science and Technology, Liuzhou, 545000, China
Meng Wang
College of Information Science and Engineering, Liuzhou Institute of Technology, Liuzhou, 545000, China
Guangyun Lu

Authors

Ying Ma
View author publications
You can also search for this author inPubMed Google Scholar
Meng Wang
View author publications
You can also search for this author inPubMed Google Scholar
Guangyun Lu
View author publications
You can also search for this author inPubMed Google Scholar
Yajun Sun
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Meng Wang.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Ma, Y., Wang, M., Lu, G. et al. Multi-label semantic sharing based on graph convolutional network for image-to-text retrieval. Vis Comput 41, 1827–1840 (2025). https://doi.org/10.1007/s00371-024-03496-y

Download citation

Accepted: 14 May 2024
Published: 10 June 2024
Issue Date: February 2025
DOI: https://doi.org/10.1007/s00371-024-03496-y

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-label semantic sharing based on graph convolutional network for image-to-text retrieval

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Graph Convolutional Network Semantic Enhancement Hashing for Self-supervised Cross-Modal Retrieval

Unsupervised deep hashing with multiple similarity preservation for cross-modal image-text retrieval

Global and Local Feature Based Deep Cross-Modal Hashing

Explore related subjects

Data availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now