Skip to main content
Log in

Semantics-preserving hashing based on multi-scale fusion for cross-modal retrieval

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Research on hash-based cross-modal retrieval has been a hotspot in the field of content-based multimedia retrieval research. Most deep cross-modal hashing methods only consider inter-modal loss that can remain local information of training data, and ignore the loss within data samples of the same modality that can remain the global information of dataset. In addition, they also ignore the factor that different scales of single modal data contain different semantic information, which affects the representation of data features. In this paper, we propose a semantics-preserving hashing method based on multi-scale fusion. More concretely, a multi-scale fusion pooling model is proposed for both image feature training network and text feature training network. Therefore, we can extract the multi-scale features of image dataset and solve the sparsity problem of text BOW vectors. When constructing the loss function, we consider intra-modal loss while considering inter-modal loss. Therefore, the output hash code retains both global and local underlying semantic correlation when image and text feature training network are trained. Experiment results on NUS-WIDE and MIRFlickr-25 K prove that against other existing methods, our algorithm improves cross-modal retrieval accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval[M]. ACM press, New York

    Google Scholar 

  2. Bronstein M M, Bronstein A M, Michel F, et al. (2010) Data fusion through cross-modality metric learning using similarity-sensitive hashing[C]//2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, 3594-3601

  3. Chua T S, Tang J, Hong R, et al. (2009) NUS-WIDE: a real-world web image database from National University of Singapore[C]//Proceedings of the ACM international conference on image and video retrieval. 1-9

  4. Han Y, Wu F, Tian Q, Zhuang Y (2012) Graph-Guided Sparse Reconstruction for Region Tagging. IEEE Conference on Computer Vision and Pattern Recognition

  5. He K, Zhang X, Ren S et al (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition[J]. IEEE transactions on pattern analysis and machine intelligence 37(9):1904–1916

    Article  Google Scholar 

  6. He X, Peng Y, Xie L (2019) A new benchmark and approach for fine-grained cross-media retrieval[C]//Proceedings of the 27th ACM International Conference on Multimedia. 1740-1748

  7. Huiskes MJ, Lew MS (2008) The MIR flickr retrieval evaluation[C]//Proceedings of the 1st ACM international conference on Multimedia information retrieval. 39-43

  8. J Zhang J, Peng Y (2018) Query-adaptive image retrieval by deep-weighted hashing[J]. IEEE Transactions on Multimedia 20(9):2400–2414

    Article  Google Scholar 

  9. Jiang QY, Li WJ (2017) Deep cross-modal hashing[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 3232-3240

  10. Kumar S, Udupa R (2011) Learning hash functions for cross-view similarity search[C]//Twenty-Second International Joint Conference on Artificial Intelligence

  11. Li C, Deng C, Li N et al. (2018) Self-supervised adversarial hashing networks for cross-modal retrieval[C]//Proce-edings of the IEEE conference on computer vision and pattern recognition. 4242-4251

  12. Lin Y, Zheng Z, Zhang H, et al. Bayesian query expansion for multi-camera person re-identification[J]. Pattern Recognition Letters, 2018.

  13. Lin Z, Ding G, Hu M, et al. Semantics-preserving hashing for cross-view retrieval[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 3864-3872.

  14. Long M, Cao Y, Wang J, et al. Composite correlation quantization for efficient multimodal retrieval[C]////Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. 2016: 579-588.

  15. Lu X, Chen Y, Li X (2017) Hierarchical recurrent neural hashing for image retrieval with hierarchical convolutional features[J]. IEEE Transactions on Image Processing 27(1):106–120

    Article  MathSciNet  Google Scholar 

  16. Mu N, Xu X, Zhang X et al (2018) Salient object detection using a covariance-based CNN model in low-contrast images[J]. Neural Computing and Applications 29(8):181–192

    Article  Google Scholar 

  17. Peng Y, Huang X, Zhao Y (2017) An overview of cross-media retrieval: Concepts, methodologies, benchmarks, and challenges[J]. IEEE Transactions on circuits and systems for video technology 28(9):2372–2385

    Article  Google Scholar 

  18. Peng Y, Zhang J, Ye Z. Deep reinforcement learning for image hashing[J]. IEEE Transactions on Multimedia, 2019.

  19. Russakovsky O, Deng J, Su H et al (2015) Imagenet large scale visual recognition challenge[J]. International journal of computer vision 115(3):211–252

    Article  MathSciNet  Google Scholar 

  20. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014.

  21. Wang B, Yang Y, Xu X, et al. Adversarial cross-modal retrieval[C]//Proceedings of the 25th ACM international conference on Multimedia. 2017: 154-162.

  22. Wu F, Han Y, Liu X et al (2012) The heterogeneous feature selection with structural sparsity for multimedia annotation and hashing: a survey[J]. International Journal of Multimedia Information Retrieval 1(1):3–15

    Article  MathSciNet  Google Scholar 

  23. Xu Y, Han Y, Hong R et al (2018) Sequential video VLAD: Training the aggregation locally and temporally[J]. IEEE Transactions On Image Processing 27(10):4933–4944

    Article  MathSciNet  Google Scholar 

  24. Yang E, Deng C, Liu W, et al. Pairwise relationship guided deep hashing for cross-modal retrieval[C]// Thirty-first AAAI conference on artificial intelligence. 2017.

  25. Yang Y, Ma Z, Hauptmann AG et al (2012) Feature selection for multimedia analysis by sharing information among multiple tasks[J]. IEEE Transactions on Multimedia 15(3):661–669

    Article  Google Scholar 

  26. Ye Z, Peng Y. Multi-scale correlation for sequential cross-modal hashing learning[C]//Proceedings of the 26th ACM international conference on Multimedia. 2018: 852-860.

  27. Zhaoda Ye and Yuxin Peng. 2019. Sequential Cross-Modal Hashing Learning via Multi-scale Correlation Mining. ACM Trans. Multimedia Comput. Commun. Appl. 15, 4, Article 105 (December 2019), 20 pages.

  28. Yuan M, Peng Y. Text-to-image synthesis via symmetrical distillation networks[C]//Proceedings of the 26th ACM international conference on Multimedia. 2018: 1407-1415.

  29. Yuwono B, Lee DL. Server ranking for distributed text retrieval systems on the internet[M]//Database Systems For Advanced Applications' 97. 1997: 41-49.

  30. Zhang D, Li W J. Large-scale supervised multimodal hashing with semantic correlation maximization[C]//Twenty-Eighth AAAI Conference on Artificial Intelligence. 2014.

  31. Zhang H, Wang T, Dai G (2020) Semi-supervised cross-modal common representation learning with vector-valued manifold regularization[J]. Pattern Recognition Letters 130:335–344

    Article  Google Scholar 

  32. Zhang J, Han Y, Jiang J (2017) Semi-supervised tensor learning for image classification[J]. Multimedia Systems 23(1):63–73

    Article  Google Scholar 

  33. Zhang J, Peng Y (2017) SSDH: semi-supervised deep hashing for large scale image retrieval[J]. IEEE Transactions on Circuits and Systems for Video Technology 29(1):212–225

    Article  Google Scholar 

  34. Zhang J, Peng Y (2019) Multi-pathway generative adversarial hashing for unsupervised cross-modal retrieval[J]. IEEE Transactions on Multimedia 22(1):174–187

    Article  MathSciNet  Google Scholar 

  35. Zhuang Y, Yu Z, Wang W, et al. Cross-media hashing with neural networks[C]//Proceedings of the 22nd ACM international conference on Multimedia. 2014: 901-904.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hong Zhang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, H., Pan, M. Semantics-preserving hashing based on multi-scale fusion for cross-modal retrieval. Multimed Tools Appl 80, 17299–17314 (2021). https://doi.org/10.1007/s11042-020-09869-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-09869-4

Keywords

Navigation