Skip to main content
Log in

Text-assisted attention-based cross-modal hashing

  • Regular Paper
  • Published:
International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Abstract

As one of the hottest research topics in multimedia information retrieval, cross-modal hashing has drawn widespread attention in the past decades. How to minimize the semantic gap of heterogeneous data and accurately calculate the similarity of cross-modal data is a key challenge for this task. A paradigm for tackling this problem is to map features of multi-modal data into common space. However, these approaches lack inter-modal information interaction and may not achieve satisfactory results. To overcome this problem, we propose a novel text-assisted attention-based cross-modal hashing (TAACH) method in this paper. Firstly, TAACH relies on LabelNet supervision to guide the learning of hash functions for each modality. In addition, a novel text-assisted attention mechanism is designed in our TAACH to densely integrate text features into image features, perceiving their spatial correlation and enhancing the consistency of image and text knowledge. Extensive experiments on four benchmark datasets show the effectiveness of our proposed TAACH, and it also achieves competitive performance compared to state-of-the-art methods. The source code is available at https://github.com/SWU-CS-MediaLab/TAACH.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Data availibility

The MIRFLICKR-25K dataset is included in this article [32]. The NUS-WIDE dataset is included in [33]. The Microsoft COCO2014 dataset is included in [19]. The IAPR TC-12 dataset is included in [34].

References

  1. Peng Y, Huang X, Zhao Y (2018) An overview of cross-media retrieval: concepts, methodologies, benchmarks, and challenges. IEEE Trans Circuits Syst Video Technol 28(9):2372–2385

    Article  Google Scholar 

  2. Wang K, Yin Q, Wang W, Wu S, Wang L (2016) A comprehensive survey on cross-modal retrieval. arXiv preprint arXiv:1607.06215

  3. Ding G, Guo Y, Zhou J, Gao Y (2016) Large-scale cross-modality search via collective matrix factorization hashing. IEEE Trans Image Process 25(11):5427–5440

    Article  MathSciNet  Google Scholar 

  4. Ding K, Fan B, Huo C, Xiang S, Pan C (2016) Cross-modal hashing via rank-order preserving. IEEE Trans Multimed 19(3):571–585

    Article  Google Scholar 

  5. Gu J, Cai J, Joty SR, Niu L, Wang G (2018) Look, imagine and match: improving textual-visual cross-modal retrieval with generative models. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7181–7189

  6. Kumar S, Udupa R (2011) Learning hash functions for cross-view similarity search. In: Twenty-second international joint conference on artificial intelligence

  7. Song J, Yang Y, Yang Y, Huang Z, Shen HT (2013) Inter-media hashing for large-scale retrieval from heterogeneous data sources. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data, pp 785–796

  8. Zhou J, Ding G, Guo Y (2014) Latent semantic sparse hashing for cross-modal similarity search. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, pp 415–424

  9. Bronstein MM, Bronstein AM, Michel F, Paragios N (2010) Data fusion through cross-modality metric learning using similarity-sensitive hashing. In 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, pp 3594–3601

  10. Lin Z, Ding G, Hu M, Wang J (2015) Semantics-preserving hashing for cross-view retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3864–3872

  11. Wang D, Gao X, Wang X, He L (2015) Semantic topic multimodal hashing for cross-media retrieval. In: Twenty-fourth international joint conference on artificial intelligence

  12. Fei W, Zhou Yu, Yang Y, Tang S, Zhang Y, Zhuang Y (2013) Sparse multi-modal hashing. IEEE Trans Multimed 16(2):427–439

    Google Scholar 

  13. Zhang D, Li WJ (2014) Large-scale supervised multimodal hashing with semantic correlation maximization. In: Proceedings of the AAAI conference on artificial intelligence, vol 28

  14. Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828

    Article  Google Scholar 

  15. Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90

    Article  Google Scholar 

  16. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)

  17. Cao Y, Long M, Wang J, Yang Q, Yu PS (2016) Deep visual-semantic hashing for cross-modal retrieval. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp 1445–1454

  18. Jiang QY, Li WJ (2017) Deep cross-modal hashing. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3232–3240

  19. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick C (2014) Microsoft coco: Common objects in context. In: Computer vision–ECCV 2014: 13th European conference, Zurich, Switzerland, 6–12 Sept, 2014, Proceedings, Part V 13. Springer, pp 740–755

  20. Shen Y, Liu L, Shao L, Song J (2017) Deep binaries: encoding semantic-rich cues for efficient textual-visual cross retrieval. In: Proceedings of the IEEE international conference on computer vision, pp 4097–4106

  21. Yang E, Deng C, Liu W, Liu X, Tao D, Gao X (2017) Pairwise relationship guided deep hashing for cross-modal retrieval. In: Proceedings of the AAAI conference on artificial intelligence, vol 31

  22. Li C, Deng C, Li N, Liu W, Gao X, Tao D (2018) Self-supervised adversarial hashing networks for cross-modal retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4242–4251

  23. Ma X, Zhang T, Changsheng X (2020) Multi-level correlation adversarial hashing for cross-modal retrieval. IEEE Trans Multimed 22(12):3101–3114

    Article  Google Scholar 

  24. Wang J, Zhang T, Sebe N, Shen HT (2017) A survey on learning to hash. IEEE Trans Pattern Anal Mach intel 40(4):769–790

    Article  Google Scholar 

  25. Wang X, Zou X, Bakker EM, Song W (2020) Self-constraining and attention-based hashing network for bit-scalable cross-modal retrieval. Neurocomputing 400:255–271

    Article  Google Scholar 

  26. Zou X, Wu S, Zhang N, Bakker EM (2022) Multi-label modality enhanced attention based self-supervised deep cross-modal hashing. Knowl Based Syst 239:107927

    Article  Google Scholar 

  27. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst

  28. Yang Z, Yang D, Dyer C, He X, Smola A, Hovy E (2016) Hierarchical attention networks for document classification. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1480–1489

  29. Zhang X, Lai H, Feng J (2018) Attention-aware deep adversarial hashing for cross-modal retrieval. In: Proceedings of the European conference on computer vision (ECCV), pp 591–606

  30. Nair V, Hinton GE (2010) Rectified linear units improve restricted Boltzmann machines. In: International conference on machine learning

  31. Cao Y, Long M, Wang J, Yu PS (2016) Correlation hashing network for efficient cross-modal retrieval. arXiv preprint arXiv:1602.06697

  32. Huiskes MJ, Lew MS (2008) The mir flickr retrieval evaluation. In: Proceedings of the 1st ACM international conference on Multimedia information retrieval, pp 39–43

  33. Chua TS, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) Nus-wide: a real-world web image database from national university of Singapore. In: Proceedings of the ACM international conference on image and video retrieval, pp 1–9

  34. Escalante HJ, Hernández CA, Gonzalez JA, López-López A, Montes M, Morales EF, Enrique Sucar L, Villasenor L, Grubinger M (2010) The segmented and annotated IAPR TC-12 benchmark. Comput Vis Image Underst 114(4):419–428

    Article  Google Scholar 

  35. Mandal D, Chaudhury KN, Biswas S (2017) Generalized semantic preserving hashing for n-label cross-modal retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4076–4084

  36. Zou X, Wang X, Bakker EM, Wu S (2021) Multi-label semantics preserving based deep cross-modal hashing. Signal Process Image Commun 93:116131

Download references

Acknowledgements

This work was supported by the Fundamental Research Funds for the Central Universities, China (SWU-KT22032).

Author information

Authors and Affiliations

Authors

Contributions

Xiang Yuan made contributions to the conception of research, software, investigation, implementation of experiments, and writing the manuscript. Shihao Shan contributed to the methodology and software. Yuwen Huo contributed to the revision and editing of the returned manuscript. Junkai Jiang contributed to the methodology and software. Song Wu contributed to the research conception, methodology, software, wrote, reviewed, and edited the manuscript.

Corresponding author

Correspondence to Song Wu.

Ethics declarations

Conflict of interest

There are no declared competing interests of the authors that are pertinent to the subject matter of this study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yuan, X., Shan, S., Huo, Y. et al. Text-assisted attention-based cross-modal hashing. Int J Multimed Info Retr 13, 3 (2024). https://doi.org/10.1007/s13735-023-00311-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13735-023-00311-7

Keywords

Navigation