Class Concentration with Twin Variational Autoencoders for Unsupervised Cross-Modal Hashing

Zhao, Yang; Zhu, Yazhou; Liao, Shengbin; Ye, Qiaolin; Zhang, Haofeng

doi:10.1007/978-3-031-26351-4_15

Yang Zhao¹²,
Yazhou Zhu¹²,
Shengbin Liao¹³,
Qiaolin Ye¹⁴ &
…
Haofeng Zhang¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13846))

Included in the following conference series:

Asian Conference on Computer Vision

312 Accesses

Abstract

Multi-modal deep hash learning is arguably one of the most commonly used unsupervised methods in cross-modal retrieval tasks. Most existing deep hashing methods focus on maintaining similarity information in the hash code learning step. Although accurate and compact binary representations are learned, these methods fail to encourage discriminative learning of features. In this paper, we propose a new method called Class Concentrated Variational auto-encoder (CCTV) to learn discriminative hash codes. The novelty of CCTV lies in two aspects. First, the proposed method focuses on the concentration of the mean vector of latent features. Based on the assumption that the features in the shared latent space produce multivariate Gaussian, CCTV updates the mean vectors and the cluster centroids of the latent features at the same time by minimizing the class concentration loss, so as to narrow the distance between the cluster centroids and the mean vectors, and further make the concentration more compact. Secondly, under the constraint of raw similarity information, CCTV is different from previous works, it uses the mean vector of latent features as the representation of the images to reduce the influence of variance, and then embeds them in the Hamming space. Our experimental evaluation on four multimedia benchmarks shows a significant improvement over the state-of-the-art methods. Code is available at: https://github.com/theusernamealreadyexists/CCTV.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bronstein, M.M., Bronstein, A.M., Michel, F., Paragios, N.: Data fusion through cross-modality metric learning using similarity-sensitive hashing. In: CVPR, pp. 3594–3601 (2010)
Google Scholar
Cao, Y., Long, M., Wang, J., Yu, P.S.: Correlation hashing network for efficient cross-modal retrieval. In: BMVC (2017)
Google Scholar
Cao, Z., Long, M., Wang, J., Yu, P.S.: Hashnet: deep learning to hash by continuation. In: ICCV, pp. 5608–5617 (2017)
Google Scholar
Cer, D., et al.: Universal sentence encoder. arXiv preprint arXiv:1803.11175 (2018)
Chen, H., Ding, G., Liu, X., Lin, Z., Liu, J., Han, J.: IMRAM: iterative matching with recurrent attention memory for cross-modal image-text retrieval. In: CVPR, pp. 12655–12663 (2020)
Google Scholar
Chua, T.S., Tang, J., Hong, R., Li, H., Luo, Z., Zheng, Y.: Nus-wide: a real-world web image database from national university of Singapore. In: ICIVR, pp. 1–9 (2009)
Google Scholar
Deng, C., Chen, Z., Liu, X., Gao, X., Tao, D.: Triplet-based deep hashing network for cross-modal retrieval. IEEE TIP 27(8), 3893–3903 (2018)
MathSciNet MATH Google Scholar
Ding, G., Guo, Y., Zhou, J.: Collective matrix factorization hashing for multimodal data. In: CVPR, pp. 2075–2082 (2014)
Google Scholar
Donoser, M., Bischof, H.: Diffusion processes for retrieval revisited. In: CVPR, pp. 1320–1327 (2013)
Google Scholar
Gu, Y., Wang, S., Zhang, H., Yao, Y., Yang, W., Liu, L.: Clustering-driven unsupervised deep hashing for image retrieval. Neurocomputing 368, 114–123 (2019)
Article Google Scholar
Hu, D., Nie, F., Li, X.: Deep binary reconstruction for cross-modal hashing. IEEE TMM 21(4), 973–985 (2018)
Google Scholar
Hu, H., Xie, L., Hong, R., Tian, Q.: Creating something from nothing: unsupervised knowledge distillation for cross-modal hashing. In: CVPR, June 2020
Google Scholar
Hu, M., Yang, Y., Shen, F., Xie, N., Hong, R., Shen, H.T.: Collective reconstructive embeddings for cross-modal hashing. IEEE TIP 28(6), 2770–2784 (2018)
MathSciNet MATH Google Scholar
Huiskes, M.J., Lew, M.S.: The MIR Flickr retrieval evaluation. In: ACM MM, pp. 39–43 (2008)
Google Scholar
Irie, G., Arai, H., Taniguchi, Y.: Alternating co-quantization for cross-modal hashing. In: CVPR, pp. 1886–1894 (2015)
Google Scholar
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. arXiv preprint arXiv:2102.05918 (2021)
Jiang, Q.Y., Li, W.J.: Deep cross-modal hashing. In: CVPR, pp. 3232–3240 (2017)
Google Scholar
Jiang, Q.Y., Li, W.J.: Discrete latent factor model for cross-modal hashing. IEEE TIP 28(7), 3490–3501 (2019)
MathSciNet MATH Google Scholar
Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114 (2013)
Kumar, S., Udupa, R.: Learning hash functions for cross-view similarity search. In: IJCAI (2011)
Google Scholar
Li, C., Deng, C., Li, N., Liu, W., Gao, X., Tao, D.: Self-supervised adversarial hashing networks for cross-modal retrieval. In: CVPR, June 2018
Google Scholar
Li, C., Deng, C., Wang, L., Xie, D., Liu, X.: Coupled cyclegan: unsupervised hashing network for cross-modal retrieval. In: AAAI, pp. 176–183 (2019)
Google Scholar
Li, C., Chen, Z., Zhang, P., Luo, X., Nie, L., Xu, X.: Supervised robust discrete multimodal hashing for cross-media retrieval. IEEE TMM 21(11), 2863–2877 (2019)
Google Scholar
Li, X., Shen, C., Dick, A., Van Den Hengel, A.: Learning compact binary codes for visual tracking. In: CVPR, pp. 2419–2426 (2013)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Lin, Z., Ding, G., Hu, M., Wang, J.: Semantics-preserving hashing for cross-view retrieval. In: CVPR, pp. 3864–3872 (2015)
Google Scholar
Liu, S., Qian, S., Guan, Y., Zhan, J., Ying, L.: Joint-modal distribution-based similarity hashing for large-scale unsupervised deep cross-modal retrieval. In: ACM SIGIR, pp. 1379–1388 (2020)
Google Scholar
Luo, X., Yin, X.Y., Nie, L., Song, X., Wang, Y., Xu, X.S.: SDMCH: supervised discrete manifold-embedded cross-modal hashing. In: IJCAI, pp. 2518–2524 (2018)
Google Scholar
Mandal, D., Chaudhury, K.N., Biswas, S.: Generalized semantic preserving hashing for n-label cross-modal retrieval. In: CVPR, pp. 4076–4084 (2017)
Google Scholar
Peng, Y., Qi, J.: CM-GANs: cross-modal generative adversarial networks for common representation learning. ACM TOMM 15(1), 1–24 (2019)
Article MathSciNet Google Scholar
Rasiwasia, N., et al.: A new approach to cross-modal multimedia retrieval. In: ACM MM, pp. 251–260 (2010)
Google Scholar
Rastegari, M., Choi, J., Fakhraei, S., Hal, D., Davis, L.: Predictable dual-view hashing. In: ICML, pp. 1328–1336 (2013)
Google Scholar
Schonfeld, E., Ebrahimi, S., Sinha, S., Darrell, T., Akata, Z.: Generalized zero-and few-shot learning via aligned variational autoencoders. In: CVPR, pp. 8247–8255 (2019)
Google Scholar
Shen, H.T., et al.: Exploiting subspace relation in semantic labels for cross-modal hashing. IEEE TKDE (2020)
Google Scholar
Shen, X., Zhang, H., Li, L., Zhang, Z., Chen, D., Liu, L.: Clustering-driven deep adversarial hashing for scalable unsupervised cross-modal retrieval. Neurocomputing 459, 152–164 (2021)
Article Google Scholar
Shi, Y., You, X., Zheng, F., Wang, S., Peng, Q.: Equally-guided discriminative hashing for cross-modal retrieval. In: IJCAI, pp. 4767–4773 (2019)
Google Scholar
Song, J., Yang, Y., Yang, Y., Huang, Z., Shen, H.T.: Inter-media hashing for large-scale retrieval from heterogeneous data sources. In: ACM SIGKDD, pp. 785–796 (2013)
Google Scholar
Su, S., Zhong, Z., Zhang, C.: Deep joint-semantics reconstructing hashing for large-scale unsupervised cross-modal retrieval. In: ICCV, pp. 3027–3035 (2019)
Google Scholar
Sun, C., Song, X., Feng, F., Zhao, W.X., Zhang, H., Nie, L.: Supervised hierarchical cross-modal hashing. In: ACM SIGIR, pp. 725–734 (2019)
Google Scholar
Wang, D., Cui, P., Ou, M., Zhu, W.: Learning compact hash codes for multimodal representations using orthogonal deep structure. IEEE TMM 17(9), 1404–1416 (2015)
Google Scholar
Weiss, Y., Torralba, A., Fergus, R., et al.: Spectral hashing. In: NeurIPS, vol. 1, p. 4. Citeseer (2008)
Google Scholar
Wu, B., Yang, Q., Zheng, W.S., Wang, Y., Wang, J.: Quantized correlation hashing for fast cross-modal search. In: IJCAI, pp. 3946–3952. Citeseer (2015)
Google Scholar
Wu, G., et al.: Unsupervised deep hashing via binary latent factor models for large-scale cross-modal retrieval. In: IJCAI, pp. 2854–2860 (2018)
Google Scholar
Xie, J., Girshick, R., Farhadi, A.: Unsupervised deep embedding for clustering analysis. In: ICML, pp. 478–487 (2016)
Google Scholar
Yang, D., Wu, D., Zhang, W., Zhang, H., Li, B., Wang, W.: Deep semantic-alignment hashing for unsupervised cross-modal retrieval. In: ICMR, pp. 44–52 (2020)
Google Scholar
Yang, E., Deng, C., Li, C., Liu, W., Li, J., Tao, D.: Shared predictive cross-modal deep quantization. IEEE TNNLS 29(11), 5292–5303 (2018)
Google Scholar
Yang, E., Deng, C., Liu, W., Liu, X., Tao, D., Gao, X.: Pairwise relationship guided deep hashing for cross-modal retrieval. In: AAAI (2017)
Google Scholar
Yu, J., Zhou, H., Zhan, Y., Tao, D.: Deep graph-neighbor coherence preserving network for unsupervised cross-modal hashing (2021)
Google Scholar
Zhai, D., Chang, H., Zhen, Y., Liu, X., Chen, X., Gao, W.: Parametric local multimodal hashing for cross-view similarity search. In: IJCAI (2013)
Google Scholar
Zhang, H., et al.: Deep unsupervised self-evolutionary hashing for image retrieval. IEEE Trans. Multimedia 23, 3400–3413 (2021)
Article Google Scholar
Zhang, J., Peng, Y., Yuan, M.: Unsupervised generative adversarial cross-modal hashing. In: AAAI (2018)
Google Scholar
Zheng, Z., Zheng, L., Garrett, M., Yang, Y., Xu, M., Shen, Y.D.: Dual-path convolutional image-text embeddings with instance loss. ACM TOMM 16(2), 1–23 (2020)
Article Google Scholar

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grants No. 61872187, No. 62077023 and No. 62072246, in part by the Natural Science Foundation of Jiangsu Province under Grant No. BK20201306, and in part by the “111” Program under Grant No. B13022.

Author information

Authors and Affiliations

School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China
Yang Zhao, Yazhou Zhu & Haofeng Zhang
National Engineering Research Center for E-learning, Huazhong Normal University, Wuhan, 430079, China
Shengbin Liao
School of Information Science and Technology, Nanjing Forestry University, Nanjing, 210037, China
Qiaolin Ye

Authors

Yang Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Yazhou Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Shengbin Liao
View author publications
You can also search for this author in PubMed Google Scholar
Qiaolin Ye
View author publications
You can also search for this author in PubMed Google Scholar
Haofeng Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Haofeng Zhang .

Editor information

Editors and Affiliations

University of Wollongong, Wollongong, NSW, Australia
Lei Wang
University of Bonn, Bonn, Germany
Juergen Gall
University of Adelaide, Adelaide, SA, Australia
Tat-Jun Chin
National Institute of Informatics, Tokyo, Japan
Imari Sato
Johns Hopkins University, Baltimore, MD, USA
Rama Chellappa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhao, Y., Zhu, Y., Liao, S., Ye, Q., Zhang, H. (2023). Class Concentration with Twin Variational Autoencoders for Unsupervised Cross-Modal Hashing. In: Wang, L., Gall, J., Chin, TJ., Sato, I., Chellappa, R. (eds) Computer Vision – ACCV 2022. ACCV 2022. Lecture Notes in Computer Science, vol 13846. Springer, Cham. https://doi.org/10.1007/978-3-031-26351-4_15

Download citation

DOI: https://doi.org/10.1007/978-3-031-26351-4_15
Published: 26 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26350-7
Online ISBN: 978-3-031-26351-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Class Concentration with Twin Variational Autoencoders for Unsupervised Cross-Modal Hashing