Skip to main content
Log in

Autoencoder-based self-supervised hashing for cross-modal retrieval

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Cross-modal retrieval has gained lots of attention in the era of the multimedia data explosion. Taking advantage of low storage cost and fast retrieval speed, hash learning-based methods become more and more popular in this field. The crucial bottlenecks of cross-modal retrieval are twofold: the heterogeneous gap in different modalities and the semantic gap among similar data with various modalities. To address these issues, we adopt self-supervised fashion to bridge the heterogeneous gap by generating the cohesive features of different instances. To mitigate the semantic gap, we use triplet sampling to optimize the semantic loss in inter-modal and intra-modal, which increase the discriminability of our approach. Experimental on two benchmark datasets show the efficiency and robustness of our method, and the extended experiments show the scalability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Akaho S (2006) A kernel method for canonical correlation analysis. arXiv:0609071.0609071

  2. Cao Y, Long M, Wang J, Zhu H (2016) Correlation Autoencoder Hashing for Supervised Cross-Modal Search. In: Proceedings of the ACM on International Conference on Multimedia Retrieval. ACM Press, New York, New York, USA, pp 197–204

  3. Carreira-Perpiñán M A, Raziperchikolaei R (2015) Hashing with binary autoencoders. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol 07-12-June, pp 557–566

  4. Chatfield K, Simonyan K, Vedaldi A, Zisserman A (2014) Return of the Devil in the Details: Delving Deep into Convolutional Nets. In: Proceedings of the British Machine Vision Conference

  5. Chen J, Cheung W K, Wang A (2018) Learning deep unsupervised binary codes for image retrieval. In: International Joint Conference on Artificial Intelligence, vol 2018-July, pp 613–619

  6. Chen L, Srivastava S, Duan Z, Xu C (2017) Deep cross-modal audio-visual generation. In: Proceedings of the Thematic Workshops of ACM Multimedia Association for Computing Machinery, Inc New York, New York, USA, pp 349–357

  7. Chua T-S, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) NUS-WIDE: a real-world web image database from National University of Singapore. In: Proceeding of the ACM International Conference on Image and Video Retrieval - CIVR ’09 ACM Press New York, New York, USA 1

  8. Deng C, Chen Z, Liu X, Gao X, Tao D (2018) Triplet-Based Deep Hashing Network for Cross-Modal Retrieval. IEEE Trans Image Process 27(8):3893–3903

    Article  MathSciNet  Google Scholar 

  9. Ding G, Guo Y, Zhou J, Gao Y (2016) Large-Scale Cross-Modality Search via Collective Matrix Factorization Hashing. IEEE Trans Image Process 25(11):5427–5440

    Article  MathSciNet  Google Scholar 

  10. Doersch C, Zisserman A, Deepmind (2017) Multi-task Self-Supervised Visual Learning. In: Proceedings of the IEEE international conference on computer vision, pp 2070–2079

  11. Guan J, Li Y, Sun J, Wang X, Zhao H, Zhang J, Liu Z, Qi S (2019) Graph-based supervised discrete image hashing. J Vis Commun Image Represent 58:675–687

    Article  Google Scholar 

  12. Hotelling H (1936) Relations Between Two Sets of Variates. Biometrika 28(3-4):321–377

    Article  Google Scholar 

  13. Hu M, Yang Y, Shen F, Xie N, Hong R, Shen H T (2019) Collective Reconstructive Embeddings for Cross-Modal Hashing. IEEE Trans Image Process 28(6):2770–2784

    Article  MathSciNet  Google Scholar 

  14. Huiskes M J, Lew M S (2008) The MIR flickr retrieval evaluation. In: Proceeding of the ACM international conference on Multimedia information retrieval ACM Press New York, New York, USA 39

  15. Jiang Q-Y, Li W-J (2017) Deep Cross-Modal Hashing. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp 3232–3240

  16. Kolesnikov A, Zhai X, Beyer L (2019) Revisiting self-supervised visual representation learning. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol 2019-June, pp 1920–1929

  17. Kumar S, Udupa R (2011) Learning hash functions for cross-view similarity search. In: International Joint Conference on Artificial Intelligence. AAAI Press, pp 1360–1365

  18. Lai H, Pan Y, Liu Y, Yan S (2015) Simultaneous feature learning and hash coding with deep neural networks Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol 07-12-June, pp 3270–3278

  19. Li B, Liu X, Dinesh K, Duan Z, Sharma G (2019) Creating a Multitrack Classical Music Performance Dataset for Multimodal Music Analysis: Challenges, Insights, and Applications. IEEE Transactions on Multimedia 21(2):522–535

    Article  Google Scholar 

  20. Lin Z, Ding G, Hu M, Wang J (2015) Semantics-preserving hashing for cross-view retrieval. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol 07-12-June. IEEE, pp 3864–3872

  21. Liu H, Lin M, Zhang S, Wu Y, Huang F, Ji R (2018) Dense Auto-Encoder Hashing for Robust Cross-Modality Retrieval. In: Proceedings of ACM Multimedia Conference on Multimedia Conference. ACM Press, New York, New York, USA, pp 1589–1597

  22. Liu X, Yu G, Domeniconi C, Wang J, Ren Y, Guo M (2019) Ranking-Based Deep Cross-Modal Hashing. Proceedings of the AAAI Conference on Artificial Intelligence 33:4400–4407

    Article  Google Scholar 

  23. Peng Y, Zhai X, Zhao Y, Huang X (2016) Semi-supervised cross-media feature learning with unified patch graph regularization. IEEE Transactions on Circuits and Systems for Video Technology 26(3):583–596

    Article  Google Scholar 

  24. Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet GRG, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: Proceedings of the international conference on Multimedia. ACM Press, New York, New York, USA, pp 251–260

  25. Sharma A, Kumar A, Daume H, Jacobs D W (2012) Generalized Multiview Analysis: A discriminative latent space. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, pp 2160–2167

  26. Sun L, Ji S, Ye J (2008) A least squares formulation for canonical correlation analysis. In: Proceedings of the international conference on Machine learning. ACM Press, New York, New York, USA, pp 1024–1031

  27. Van Der Maaten L (2014) Accelerating t-sne using tree-based algorithms. The Journal of Machine Learning Research 15(1):3221–3245

    MathSciNet  MATH  Google Scholar 

  28. Wang D, Gao X, Wang X, He L (2015) Semantic topic multimodal hashing for cross-media retrieval. In: International Joint Conference on Artificial Intelligence 2015-Janua, pp 3890–3896

  29. Wang K, He R, Wang L, Wang W, Tan T (2016) Joint Feature Selection and Subspace Learning for Cross-Modal Retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 38(10):2010–2023

    Article  Google Scholar 

  30. Wang X, Shi Y, Kitani K M (2016) Deep supervised hashing with triplet labels. In: Proceedings of Asian conference on computer vision, vol 10111 LNCS. Springer, Cham, pp 70–84

  31. Yang E, Deng C, Liu W, Liu X, Tao D, Gao X (2017) Pairwise Relationship Guided Deep Hashing for Cross-Modal Retrieval. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 1618–1625

  32. Zhai X, Peng Y, Xiao J (2013) Heterogeneous Metric Learning with Joint Graph Regularization for Cross-Media Retrieval. In: AAAI Conference on Artificial Intelligence, pp 1198–1204

  33. Zhai X, Peng Y, Xiao J (2014) Learning cross-media joint representation with sparse and semisupervised regularization. IEEE Transactions on Circuits and Systems for Video Technology 24(6):965–978

    Article  Google Scholar 

  34. Zhang D, Li W-J (2014) Large-Scale Supervised Multimodal Hashing with Semantic Correlation Maximization. In: Proceedings of the AAAI Conference on Artificial Intelligence. AAAI Press, pp 2177–2183

  35. Zhang C, Peng Y (2018) Better and faster: Knowledge transfer from multiple self-supervised learning tasks via graph distillation for video classification. In: IJCAI International Joint Conference on Artificial Intelligence, vol 2018-July, pp 1135–1141

  36. Zhao F, Huang Y, Wang L, Tan T (2015) Deep semantic ranking based hashing for multi-label image retrieval. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol 07-12-June, pp 1556–1564

  37. Zhou J, Ding G, Guo Y (2014) Latent semantic sparse hashing for cross-modal similarity search. In: Proceedings of the international ACM SIGIR conference on Research & development in information retrieval. ACM Press, New York, New York, USA, pp 415–424

  38. Zhuang B, Lin G, Shen C, Reid I (2016) Fast training of triplet-based deep binary embedding networks. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol 2016-Decem, pp 5955–5964

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shuhan Qi.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, Y., Wang, X., Cui, L. et al. Autoencoder-based self-supervised hashing for cross-modal retrieval. Multimed Tools Appl 80, 17257–17274 (2021). https://doi.org/10.1007/s11042-020-09599-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-09599-7

Keywords

Navigation