Abstract
Cross-modal retrieval has gained lots of attention in the era of the multimedia data explosion. Taking advantage of low storage cost and fast retrieval speed, hash learning-based methods become more and more popular in this field. The crucial bottlenecks of cross-modal retrieval are twofold: the heterogeneous gap in different modalities and the semantic gap among similar data with various modalities. To address these issues, we adopt self-supervised fashion to bridge the heterogeneous gap by generating the cohesive features of different instances. To mitigate the semantic gap, we use triplet sampling to optimize the semantic loss in inter-modal and intra-modal, which increase the discriminability of our approach. Experimental on two benchmark datasets show the efficiency and robustness of our method, and the extended experiments show the scalability.






Similar content being viewed by others
References
Akaho S (2006) A kernel method for canonical correlation analysis. arXiv:0609071.0609071
Cao Y, Long M, Wang J, Zhu H (2016) Correlation Autoencoder Hashing for Supervised Cross-Modal Search. In: Proceedings of the ACM on International Conference on Multimedia Retrieval. ACM Press, New York, New York, USA, pp 197–204
Carreira-Perpiñán M A, Raziperchikolaei R (2015) Hashing with binary autoencoders. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol 07-12-June, pp 557–566
Chatfield K, Simonyan K, Vedaldi A, Zisserman A (2014) Return of the Devil in the Details: Delving Deep into Convolutional Nets. In: Proceedings of the British Machine Vision Conference
Chen J, Cheung W K, Wang A (2018) Learning deep unsupervised binary codes for image retrieval. In: International Joint Conference on Artificial Intelligence, vol 2018-July, pp 613–619
Chen L, Srivastava S, Duan Z, Xu C (2017) Deep cross-modal audio-visual generation. In: Proceedings of the Thematic Workshops of ACM Multimedia Association for Computing Machinery, Inc New York, New York, USA, pp 349–357
Chua T-S, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) NUS-WIDE: a real-world web image database from National University of Singapore. In: Proceeding of the ACM International Conference on Image and Video Retrieval - CIVR ’09 ACM Press New York, New York, USA 1
Deng C, Chen Z, Liu X, Gao X, Tao D (2018) Triplet-Based Deep Hashing Network for Cross-Modal Retrieval. IEEE Trans Image Process 27(8):3893–3903
Ding G, Guo Y, Zhou J, Gao Y (2016) Large-Scale Cross-Modality Search via Collective Matrix Factorization Hashing. IEEE Trans Image Process 25(11):5427–5440
Doersch C, Zisserman A, Deepmind (2017) Multi-task Self-Supervised Visual Learning. In: Proceedings of the IEEE international conference on computer vision, pp 2070–2079
Guan J, Li Y, Sun J, Wang X, Zhao H, Zhang J, Liu Z, Qi S (2019) Graph-based supervised discrete image hashing. J Vis Commun Image Represent 58:675–687
Hotelling H (1936) Relations Between Two Sets of Variates. Biometrika 28(3-4):321–377
Hu M, Yang Y, Shen F, Xie N, Hong R, Shen H T (2019) Collective Reconstructive Embeddings for Cross-Modal Hashing. IEEE Trans Image Process 28(6):2770–2784
Huiskes M J, Lew M S (2008) The MIR flickr retrieval evaluation. In: Proceeding of the ACM international conference on Multimedia information retrieval ACM Press New York, New York, USA 39
Jiang Q-Y, Li W-J (2017) Deep Cross-Modal Hashing. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp 3232–3240
Kolesnikov A, Zhai X, Beyer L (2019) Revisiting self-supervised visual representation learning. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol 2019-June, pp 1920–1929
Kumar S, Udupa R (2011) Learning hash functions for cross-view similarity search. In: International Joint Conference on Artificial Intelligence. AAAI Press, pp 1360–1365
Lai H, Pan Y, Liu Y, Yan S (2015) Simultaneous feature learning and hash coding with deep neural networks Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol 07-12-June, pp 3270–3278
Li B, Liu X, Dinesh K, Duan Z, Sharma G (2019) Creating a Multitrack Classical Music Performance Dataset for Multimodal Music Analysis: Challenges, Insights, and Applications. IEEE Transactions on Multimedia 21(2):522–535
Lin Z, Ding G, Hu M, Wang J (2015) Semantics-preserving hashing for cross-view retrieval. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol 07-12-June. IEEE, pp 3864–3872
Liu H, Lin M, Zhang S, Wu Y, Huang F, Ji R (2018) Dense Auto-Encoder Hashing for Robust Cross-Modality Retrieval. In: Proceedings of ACM Multimedia Conference on Multimedia Conference. ACM Press, New York, New York, USA, pp 1589–1597
Liu X, Yu G, Domeniconi C, Wang J, Ren Y, Guo M (2019) Ranking-Based Deep Cross-Modal Hashing. Proceedings of the AAAI Conference on Artificial Intelligence 33:4400–4407
Peng Y, Zhai X, Zhao Y, Huang X (2016) Semi-supervised cross-media feature learning with unified patch graph regularization. IEEE Transactions on Circuits and Systems for Video Technology 26(3):583–596
Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet GRG, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: Proceedings of the international conference on Multimedia. ACM Press, New York, New York, USA, pp 251–260
Sharma A, Kumar A, Daume H, Jacobs D W (2012) Generalized Multiview Analysis: A discriminative latent space. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, pp 2160–2167
Sun L, Ji S, Ye J (2008) A least squares formulation for canonical correlation analysis. In: Proceedings of the international conference on Machine learning. ACM Press, New York, New York, USA, pp 1024–1031
Van Der Maaten L (2014) Accelerating t-sne using tree-based algorithms. The Journal of Machine Learning Research 15(1):3221–3245
Wang D, Gao X, Wang X, He L (2015) Semantic topic multimodal hashing for cross-media retrieval. In: International Joint Conference on Artificial Intelligence 2015-Janua, pp 3890–3896
Wang K, He R, Wang L, Wang W, Tan T (2016) Joint Feature Selection and Subspace Learning for Cross-Modal Retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 38(10):2010–2023
Wang X, Shi Y, Kitani K M (2016) Deep supervised hashing with triplet labels. In: Proceedings of Asian conference on computer vision, vol 10111 LNCS. Springer, Cham, pp 70–84
Yang E, Deng C, Liu W, Liu X, Tao D, Gao X (2017) Pairwise Relationship Guided Deep Hashing for Cross-Modal Retrieval. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 1618–1625
Zhai X, Peng Y, Xiao J (2013) Heterogeneous Metric Learning with Joint Graph Regularization for Cross-Media Retrieval. In: AAAI Conference on Artificial Intelligence, pp 1198–1204
Zhai X, Peng Y, Xiao J (2014) Learning cross-media joint representation with sparse and semisupervised regularization. IEEE Transactions on Circuits and Systems for Video Technology 24(6):965–978
Zhang D, Li W-J (2014) Large-Scale Supervised Multimodal Hashing with Semantic Correlation Maximization. In: Proceedings of the AAAI Conference on Artificial Intelligence. AAAI Press, pp 2177–2183
Zhang C, Peng Y (2018) Better and faster: Knowledge transfer from multiple self-supervised learning tasks via graph distillation for video classification. In: IJCAI International Joint Conference on Artificial Intelligence, vol 2018-July, pp 1135–1141
Zhao F, Huang Y, Wang L, Tan T (2015) Deep semantic ranking based hashing for multi-label image retrieval. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol 07-12-June, pp 1556–1564
Zhou J, Ding G, Guo Y (2014) Latent semantic sparse hashing for cross-modal similarity search. In: Proceedings of the international ACM SIGIR conference on Research & development in information retrieval. ACM Press, New York, New York, USA, pp 415–424
Zhuang B, Lin G, Shen C, Reid I (2016) Fast training of triplet-based deep binary embedding networks. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol 2016-Decem, pp 5955–5964
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Li, Y., Wang, X., Cui, L. et al. Autoencoder-based self-supervised hashing for cross-modal retrieval. Multimed Tools Appl 80, 17257–17274 (2021). https://doi.org/10.1007/s11042-020-09599-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-020-09599-7