Abstract
Cross-modal hashing which embeds data to binary codes is an efficient tool for retrieving heterogeneous but correlated multimedia data. In real applications, the sizes of queries are much larger than that of the training set and the queries may be dissimilar to training data, which lays bare the shortage of generalization of deterministic models, such as cross-encoder and autoencoder. In this paper, we design a variational cross-encoder (VCE), a generative model, to tackle this problem. At the bottleneck layer, the VCE outputs distributions parameterized by means and variances. As VCE can generate diversified data using noises, the proposed model can perform better in testing data. Ideally, each distribution is expected to describe a category of data and samples of this distribution are expected to generate data in the same category. Under this expectation, the means and variances can be used as real codes for input data. However, the generated data generally are not belonging to the same category as the input data. Hence, we add a penalty term on variance output of VCE and use means as real codes for further generating hashing codes. Experiments on three widely used datasets validate the effectiveness of our method.


Similar content being viewed by others
References
Tian, D., Wei, Y., Zhou, D.: Learning decorrelated hashing codes with label relaxation for multimodal retrieval. IEEE Access, 1 (2020)
Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51(1), 117–122 (2008)
Weiss, Y., Torralba, A., Fergus, R.: Spectral hashing. Adv. Neural. Inf. Process. Syst. 21, 53–1760 (2008)
Liu, W., Wang, J., Chang, S.-f.: Hashing with graphs. In: International Conference on Machine Learning (2011)
Yunchao, G., Lazebnik, S.: Iterative quantization: a procrustean approach to learning binary codes. IEEE Conf. Comput. Vis. Patt. Recogn. 35, 2916 (2011)
Shen, F., Shen, C., Shi, Q., Hengel, A.V.D., Tang, Z.: Inductive hashing on manifolds. IEEE Conf. Comput. Vis. Patt. Recogn. (2013). https://doi.org/10.4855/arXiv.1303.7043
Zhu, X., Huang, Z., Shen, H.T., Zhao, X.: Linear cross-modal hashing for efficient multimedia search. In: Proceedings of ACM International Conference on Multimedia, pp. 143–152 (2013)
Bronstein, M.M., Bronstein, A.M., Michel, F., Paragios, N.: Data fusion through cross-modality metric learning using similarity-sensitive hashing. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3594–3601 (2010)
Kumar, S., Udupa, R.: Learning hash functions for cross-view similarity search. Proceed. Int .Conf Artif. Intell. (2011). https://doi.org/10.5591/978-1-57735-516-8/IJCAI11-230
Zhen, Y., Yeung, D.-Y.: A probabilistic model for multimodal hash function learning. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 940–948 (2012)
Lin, Z., Ding, G., Han, J., Wang, J.: Cross-view retrieval via probability-based semantics-preserving hashing. IEEE Trans. Cybern. 47(12), 4342–4355 (2017)
Zhang, D., Li, W.-J.: Large-scale supervised multimodal hashing with semantic correlation maximization. In: Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, pp. 2177–2183 (2014)
Wang, D., Gao, X., Wang, X., He, L., Yuan, B.: Multimodal discriminative binary embedding for large-scale cross-modal retrieval. IEEE Trans. Image Process. 25(10), 4540–4554 (2016)
Chen, Z., Zhong, F., Min, G., Leng, Y., Ying, Y.: Supervised intra- and inter-modality similarity preserving hashing for cross-modal retrieval. IEEE Access 6, 27796–27808 (2018)
Liu, Y., Ji, S., Fu, Q., Chiu, D.K.W., Gong, M.: An efficient dual semantic preserving hashing for cross-modal retrieval. Neurocomputing 492, 264–277 (2022). https://doi.org/10.1016/j.neucom.2022.04.011
Qin, J., Fei, L., Zhang, Z., Wen, J., Xu, Y., Zhang, D.: Joint specifics and consistency hash learning for large-scale cross-modal retrieval. IEEE Trans. Image Process. 31, 5343–5358 (2022). https://doi.org/10.1109/TIP.2022.3195059
Wang, Y., Chen, Z.-D., Luo, X., Xu, X.-S.: A high-dimensional sparse hashing framework for cross-modal retrieval. IEEE Trans. Circuits Syst. Video Technol. 32(12), 8822–8836 (2022). https://doi.org/10.1109/TCSVT.2022.3195874
Zhang, D., Wu, X.-J., Xu, T., Kittler, J.: Watch: Two-stage discrete cross-media hashing. IEEE Trans. Knowl. Data Eng. 35(6), 6461–6474 (2023). https://doi.org/10.1109/TKDE.2022.3159131
Hoang, T., Do, T.-T., Nguyen, T.V., Cheung, N.-M.: Multimodal mutual information maximization: a novel approach for unsupervised deep cross-modal hashing. IEEE Trans. Neural Netw. Learn. Syst. (2022). https://doi.org/10.1109/TNNLS.2021.3135420
Wang, L., Zhu, L., Yu, E., Sun, J., Zhang, H.: Fusion-supervised deep cross-modal hashing. In: 2019 IEEE International Conference on Multimedia and Expo (ICME), pp. 37–42 (2019). https://doi.org/10.1109/ICME.2019.00015
Yu, E., Ma, J., Sun, J., Chang, X., Zhang, H., Hauptmann, A.G.: Deep discrete cross-modal hashing with multiple supervision. Neurocomput. 486(C), 215–224 (2022). https://doi.org/10.1016/j.neucom.2021.11.035
Li, J., Yu, E., Ma, J., Chang, X., Zhang, H., Sun, J.: Discrete fusion adversarial hashing for cross-modal retrieval. Knowl. Based Syst. 253, 109503 (2022). https://doi.org/10.1016/j.knosys.2022.109503
Peng, Y., Huang, X., Qi, J.: Cross-media shared representation byhierarchical learning with multiple deep networks. Twenty-Fifth Int. Joint Conf. Artif. Intell. 3846, 3846–3853 (2016)
Yan, F., Mikolajczyk, K.: Deep correlation for matching images and text. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3441–3450 (2015)
Hu, P., Wang, X., Zhen, L., Peng, D.: Separated variational hashing networks for cross-modal retrieval. In: Proceedings of the 27th ACM International Conference on Multimedia. MM ’19, pp. 1721–1729. Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3343031.3351078
Feng, F., Wang, X., Li, R.: Cross-modal retrieval with correspondence autoencoder. In: Proceedings of the 22nd ACM International Conference on Multimedia. MM ’14, pp. 7–16. Association for Computing Machinery, New York, NY, USA (2014). https://doi.org/10.1145/2647868.2654902
Kingma, D.P., Welling, M.: Auto-Encoding Variational Bayes (2013)
Liong, V.E., Lu, J., Duan, L., Tan, Y.: Deep variational and structural hashing. IEEE Trans. Pattern Anal. Mach. Intell. 42(3), 580–595 (2020)
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch. In: NIPS-W (2017)
Kingma, D.P., Ba, J.: Adam: A Method for Stochastic Optimization (2014)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017)
Huiskes, M.J., Lew, M.S.: The MIR flickr retrieval evaluation. In: Proceedings of the ACM International Conference on Multimedia Information Retrieval (2008)
He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition (2015)
Chua, T.-S., Tang, J., Hong, R., Li, H., Luo, Z., Zheng, Y.-T.: Nus-wide: A real-world web image database from national university of singapore. In: Proceedings of ACM Conference on Image and Video Retrieval, pp. 48–1489 (2009)
Ding, G., Guo, Y., Zhou, J.: Collective matrix factorization hashing for multimodal data. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2083–2090 (2014)
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
Wang, D., Cui, P., Ou, M., Zhu, W.: Deep multimodal hashing with orthogonal regularization. In: Proceedings of International Joint Conference on Artificial Intelligence, pp. 2291–2297 (2015)
Su, S., Zhong, Z., Zhang, C.: Deep joint-semantics reconstructing hashing for large-scale unsupervised cross-modal retrieval, pp. 3027–3035 (2019)
Li, C., Deng, C., Li, N., Liu, W., Gao, X., Tao, D.: Self-Supervised Adversarial Hashing Networks for Cross-Modal Retrieval (2018)
Zhang, D., Wu, X.-J.: Robust and discrete matrix factorization hashing for cross-modal retrieval. Pattern Recogn. (2022). https://doi.org/10.1016/j.patcog.2021.108343
Zhang, C., Li, H., Gao, Y., Chen, C.: Weakly-supervised enhanced semantic-aware hashing for cross-modal retrieval. IEEE Trans. Knowl. Data Eng. 35(6), 6475–6488 (2023). https://doi.org/10.1109/TKDE.2022.3172216
Funding
This work was supported in part by the National Natural Science Foundation of China under Grant 62076204; in part by the Natural Science Foundation of Shaanxi Province under Grant 2020JQ-197.
Author information
Authors and Affiliations
Contributions
DT conceived of the presented idea and performed the analytic calculations. DT, YC, YW and DZ contributed to the design and implementation of the research, to the analysis of the results and to the writing of the manuscript. All authors reviewed the manuscript
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Communicated by Y. Zhang.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Tian, D., Cao, Y., Wei, Y. et al. Narrowing the variance of variational cross-encoder for cross-modal hashing. Multimedia Systems 29, 3421–3430 (2023). https://doi.org/10.1007/s00530-023-01161-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00530-023-01161-3