Skip to main content
Log in

A novel deep translated attention hashing for cross-modal retrieval

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

In recent years, driven by the increasing number of cross-modal data such as images and texts, cross-modal retrieval has received intensive attention. Great progress has made in deep cross-modal hash retrieval, which integrates feature leaning and hash learning into an end-to-end trainable framework to obtain the better hash codes. However, due to the heterogeneity between images and texts, it is still a challenge to compare the similarity between them. Most previous approaches embed images and texts into a joint embedding subspace independently and then compare their similarity, which ignore the influence of irrelevant regions (regions in images without the corresponding textual description) on cross-modal retrieval and the fine-grained interactions between images and texts. To address these issues, a new cross-modal hashing called Deep Translated Attention Hashing for Cross-Modal Retrieval (DTAH) is proposed. Firstly, DTAH extracts image and text features through the bottom-up attention and the recurrent neural network respectively to reduce the influence of irrelevant regions on cross-modal retrieval. Then, with the help of cross-modal attention module, DTAH captures the fine-grained interactions between vision and language at region level and word level, and then embeds the text features into the image feature space. In this way, the proposed DTAH effectively shrinks the heterogeneity between images and texts, and can learn the discriminative hash codes. Extensive experiments on three benchmark data sets demonstrate that DTAH surpasses the state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Alphonse AS, Mary NAB, Starvin MS (2020) Classification of membrane protein using tetra peptide pattern. Anal Biochem 606:113845

    Article  Google Scholar 

  2. Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 6077–6086

  3. Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical correlation analysis. In: proceedings of the 30th international conference on international conference on machine learning, pp 1247–1255.

  4. Cadene R, Ben-younes H, Cord M, Thome N (2019) Murel: multimodal relational reasoning for visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1989-1998.

  5. Cao Y, Long M, Wang J, Yang Q, Yu P S (2016) Deep visual-semantic hashing for cross-modal retrieval. In: proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp 1445-1454.

  6. Chatfield K, Simonyan K, Vedaldi A, Zisserman A (2014) Return of the devil in the details: Delving deep into convolutional nets, arXiv preprint arXiv:1405.3531.

  7. Chua T-S, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) Nus-wide: a real-world web image database from national university of Singapore. In: Proceedings of the ACM international conference on image and video retrieval, 48

  8. Chung J, Gülçehre Ç, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling, arXiv preprint arXiv:1412.3555

  9. Ding G, Guo Y, Zhou J (2014) collective matrix factorization hashing for multimodal data. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2075-2082.

  10. Escalante HJ, Hernández CA, Gonzalez JA, López-López A, Montes M, Morales EF, Sucar LE, Villaseñor L, Grubinger M (2010) The segmented and annotated iapr tc-12 benchmark. Comput Vis Image Underst 114(4):419–428

    Article  Google Scholar 

  11. Fu J, Zheng H, Mei T (2017) Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 4438–4446.

  12. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: proceedings of the 27th international conference on neural information processing systems, pp 2672-2680.

  13. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778

  14. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neuralcomputation 9(8):1735–1780

    Google Scholar 

  15. Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141.

  16. Huang P-Y, Vaibhav, Chang X, Hauptmann AG (2019) Improving what cross-modal retrieval models learn through object-oriented inter- and intra-modal attention networks. In: Proceedings of the 2019 on International Conference on Multimedia Retrieval, 244–252

  17. Huiskes M J, Lew M S (2008) The mir flickr retrieval evaluation. In: proceedings of the 1st ACM international conference on multimedia information retrieval, pp 39-43

  18. Irie G, Arai H, Taniguchi Y (2015) Alternating co-quantization for cross-modal hashing. In: proceedings of the 2015 IEEE international conference on computer vision (ICCV), pp 1886–1894.

  19. Jayapriya K, Mary NAB (2019) Employing a novel 2-gram subgroup intra pattern (2gsip) with stacked auto encoder for membrane protein classification. Mol Biol Rep 46(2):2259–2272

    Article  Google Scholar 

  20. Jayapriya K, Jacob IJ, Mary NAB (2020) Person re-identification using prioritized chromatic texture (pct) with deep learning. Multimed Tools Appl 79(39):29399–29410

    Article  Google Scholar 

  21. Jiang Q, Li W (2017) Deep cross-modal hashing. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3232–3240.

  22. Jin L, Shu X, Li K, Li Z, Qi G-J, Tang J (2019) Deep ordinal hashing with spatial attention. IEEE Trans Image Process 28(5):2173–2186

    Article  MathSciNet  Google Scholar 

  23. Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L-J, Shamma DA, Bernstein MS, Fei-Fei L (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73

    Article  MathSciNet  Google Scholar 

  24. Kumar S, Udupa R (2011) Learning hash functions for cross-view similarity search. In: proceedings of the 22nd international joint conference on artificial intelligence, pp 1360-1365.

  25. Li C, Deng C, Li N, Liu W, Gao X, Tao D (2018) Self-supervised adversarial hashing networks for cross-modal retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4242–4251.

  26. Li Z, Tang J, Mei T (2019) Deep collaborative embedding for social image understanding. IEEE Trans Pattern Anal Mach Intell 41(9):2070–2083

    Article  Google Scholar 

  27. Li Z, Tang J, Zhang L, Yang J (2020) Weakly-supervised semantic guided hashing for social image retrieval. Int J Comput Vis 128(8):2265–2278

    Article  MathSciNet  Google Scholar 

  28. Lin Z, Ding G, Hu M, Wang J (2015) Semantics-preserving hashing for cross-view retrieval. In: 2015 IEEE conference on computer vision and pattern recognition, pp 3864-3872

  29. Liu W, Mu C, Kumar S, Chang S-F (2014) Discrete graph hashing. In: proceedings of the 27th international conference on neural information processing systems, 3419-3427

  30. Liu H, Ji R, Wu Y, Hua G (2016) Supervised matrix factorization for cross-modality hashing. In: proceedings of the 25th international joint conference on artificial intelligence, pp 1767-1773.

  31. Luong M-T, Pham H, Manning C D (2015) Effective approaches to attention-based neural machine translation, arXiv preprint arXiv:1508.04025

  32. Peng H, He J, Chen S, Wang Y, Qiao Y (2019) Dual-supervised attention network for deep cross-modal hashing. Pattern Recogn Lett 128:333–339

    Article  Google Scholar 

  33. Ren S, He K, Girshick R, Sun J (2017) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149

    Article  Google Scholar 

  34. Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556.

  35. Song J, Yang Y, Yang Y, Huang Z, Shen H-T (2013) Inter-media hashing for large-scale retrieval from heterogeneous data sources. In: proceedings of the 2013 ACM SIGMOD international conference on Management of Data, pp 785-796

  36. Wang D, Gao X, Wang X, He L (2015) Semantic topic multimodal hashing for cross-media retrieval. In: proceedings of the 24th international conference on artificial intelligence, pp 3890-3896

  37. Wang B, Yang Y, Xu X, Hanjalic A, Shen H T (2017) Adversarial cross-modal retrieval. In: proceedings of the 2017 ACM on multimedia conference, pp 154-162.

  38. Wu L, Wang Y, Shao L (2019) Cycle-consistent deep generative hashing for cross-modal retrieval. IEEE Trans Image Process 28(4):1602–1612

    Article  MathSciNet  Google Scholar 

  39. Wu J, Weng W, Fu J, Liu L, Hu B (2021) Deep semantic hashing with dual attention for cross-modal retrieval. Neural Comput & Applic 34:5397–5416. https://doi.org/10.1007/s00521-021-06696-y

    Article  Google Scholar 

  40. Xiong H, He Z, Hu X, Wu H (2018) Multi-channel encoder for neural machine translation. In: 32nd AAAI conference on artificial intelligence, pp 4962-4969

  41. Yang E, Deng C, Liu W, Liu X, Tao D, Gao X (2017) Pairwise relationship guided deep hashing for cross-modal retrieval. In: proceedings of the 31st AAAI conference on artificial intelligence, pp 1618-1625

  42. Yang X, Liu W, Liu W, Tao D (2021) A survey on canonical correlation analysis. IEEE Trans Knowl Data Eng 33(6):2349–2368

    Article  Google Scholar 

  43. Ye L, Rochan M, Liu Z, Wang Y (2019) Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 10502–10511

  44. Zhang D, Li W-J (2014) Large-scale supervised multimodal hashing with semantic correlation maximization. In: proceedings of the 28th AAAI conference on artificial intelligence, pp 2177-2183.

  45. Zhang X, Lai H, Feng J (2018) Attention-aware deep adversarial hashing for cross-modal retrieval. In: European Conference on Computer Vision, 591–606, Attention-Aware Deep Adversarial Hashing for Cross-Modal Retrieval.

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grant No. 62020106011, 61828105, Chen Guang Project supported by Shanghai Municipal Education Commission and Shanghai Education Development Foundation under Grant No.17CG41.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ran Ma.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yu, H., Ma, R., Su, M. et al. A novel deep translated attention hashing for cross-modal retrieval. Multimed Tools Appl 81, 26443–26461 (2022). https://doi.org/10.1007/s11042-022-12860-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-12860-w

Keywords

Navigation