Two-stage deep learning for supervised cross-modal retrieval

Shao, Jie; Zhao, Zhicheng; Su, Fei

doi:10.1007/s11042-018-7068-0

Two-stage deep learning for supervised cross-modal retrieval

Published: 19 December 2018

Volume 78, pages 16615–16631, (2019)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Jie Shao¹,
Zhicheng Zhao^1,2 &
Fei Su^1,2

507 Accesses
8 Citations
Explore all metrics

Abstract

This paper deals with the problem of modeling internet images and associated texts for cross-modal retrieval such as text-to-image retrieval and image-to-text retrieval. Recently, supervised cross-modal retrieval has attracted increasing attention. Inspired by a typical two-stage method, i.e., semantic correlation matching(SCM), we propose a novel two-stage deep learning method for supervised cross-modal retrieval. Limited by the fact that traditional canonical correlation analysis (CCA) is a 2-view method, the supervised semantic information is only considered in the second stage of SCM. To maximize the value of semantics, we expand CCA from 2-view to 3-view and conduct supervised learning in both stages. In the first learning stage, we embed 3-view CCA into a deep architecture to learn non-linear correlation between image, text and semantics. To overcome over-fitting, we add the reconstruct loss of each view into the loss function, which includes the correlation loss of every two views and regularization of parameters. In the second stage, we build a novel fully-convolutional network (FCN), which is trained by joint supervision of contrastive loss and center loss to learn better features. The proposed method is evaluated on two publicly available data sets, and the experimental results show that our method is competitive with state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval

Article 03 May 2023

Preserving Semantic Neighborhoods for Robust Cross-Modal Retrieval

2D-Convolution Based Feature Fusion for Cross-Modal Correlation Learning

Notes

References

Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical correlation analysis. In: Proceedings of the 30th international conference on machine learning, pp 1247–1255
Cai J, Tang Y, Wang J (2016) Kernel canonical correlation analysis via gradient descent. Neurocomputing 182:322–331
Article Google Scholar
Chua TS, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) Nus-wide: a real-world web image database from national university of singapore. In: Proceedings of the ACM international conference on image and video retrieval. ACM, p 48
Costa PJ, Coviello E, Doyle G, Rasiwasia N, Lanckriet GR, Levy R, Vasconcelos N (2013) On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Trans Pattern Anal Mach Intell 36(3):521–35
Article Google Scholar
Costa Pereira J, Coviello E, Doyle G, Rasiwasia N, Lanckriet GR, Levy R, Vasconcelos N (2014) On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Trans Pattern Anal Mach Intell 36(3):521–535
Article Google Scholar
Feng F, Wang X, Li R (2014) Cross-modal retrieval with correspondence autoencoder. In: Proceedings of the ACM international conference on multimedia. ACM, pp 7–16
Feng F, Li R, Wang X (2015) Deep correspondence restricted boltzmann machine for cross-modal retrieval. Neurocomputing 154:50–60
Article Google Scholar
Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Mikolov T et al (2013) Devise: a deep visual-semantic embedding model. In: Advances in neural information processing systems, pp 2121–2129
Gao Z, Zhang H, Xu G, Xue Y, Hauptmann AG (2015) Multi-view discriminative and structured dictionary learning with group sparsity for human action recognition. Signal Process 112:83–97
Article Google Scholar
Gao Z, Wang D, He X, Zhang H (2018) Group-pair convolutional neural networks for multi-view based 3d object retrieval
Gong Y, Ke Q, Isard M, Lazebnik S (2014) A multi-view embedding space for modeling internet images, tags, and their semantics. Int J Comput Vis 106 (2):210–233
Article Google Scholar
Grangier D, Bengio S (2008) A discriminative kernel-based approach to rank images from text queries. IEEE Trans Pattern Anal Mach Intell 30(8):1371–1384
Article Google Scholar
Hadsell R, Chopra S, Lecun Y (2006) Dimensionality reduction by learning an invariant mapping. In: 2006 IEEE computer society conference on computer vision and pattern recognition, pp 1735–1742
Hardoon DR, Szedmak S, Shawetaylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639
Article MATH Google Scholar
Hinton GE, Salakhutdinov R (2009) Replicated softmax: an undirected topic model. In: Advances in neural information processing systems, pp 1607–1614
Huang X, Peng Y (2017) Cross-modal deep metric learning with multi-task regularization. arXiv:http://arXiv.org/abs/1703.07026
Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional architecture for fast feature embedding. arXiv:http://arXiv.org/abs/1408.5093
Kang C, Liao S, He Y, Wang J, Xiang S, Pan C (2014) Cross-modal similarity learning: a low rank bilinear formulation. arXiv:http://arXiv.org/abs/1411.4738
Li D, Dimitrova N, Li M, Sethi IK (2003) Multimedia content processing through cross-modal association. In: Proceedings of the eleventh ACM international conference on multimedia. ACM, pp 604–611
Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng AY (2011) Multimodal deep learning. In: Proceedings of the 28th international conference on machine learning (ICML-11), pp 689–696
Nie W, Liu A, Su Y (2016) Cross-domain semantic transfer from large-scale social media. Multimed Syst 22(1):75–85
Article Google Scholar
Peng Y, Huang X, Qi J (2016) Cross-media shared representation by hierarchical learning with multiple deep networks. In: International joint conference on artificial intelligence (IJCAI), pp 3846–3853
Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet GR, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: Proceedings of the international conference on multimedia. ACM, pp 251–260
Rosipal R, Krämer N (2006) Overview and recent advances in partial least squares. In: Subspace, latent structure and feature selection. Springer, pp 34–51
Shao J, Zhao Z, Su F, Yue T (2015) 3view deep canonical correlation analysis for cross-modal retrieval. In: Visual communications and image processing (VCIP), 2015. IEEE, pp 1–4
Shao J, Wang L, Zhao Z, Cai A et al (2016) Deep canonical correlation analysis with progressive and hypergraph learning for cross-modal retrieval. Neurocomputing 214:618–628
Article Google Scholar
Sharma A, Kumar A, Daume H III, Jacobs DW (2012) Generalized multiview analysis: a discriminative latent space. In: 2012 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 2160–2167
Smolensky P (1986) Parallel distributed processing: explorations in the microstructure of cognition, vol. 1. chapter information processing in dynamical systems: foundations of harmony theory. MIT Press, Cambridge. 15, 18
Google Scholar
Socher R, Ganjoo M, Manning CD, Ng A (2013) Zero-shot learning through cross-modal transfer. In: Advances in neural information processing systems, pp 935–943
Song J, Yang Y, Yang Y, Huang Z, Shen HT (2013) Inter-media hashing for large-scale retrieval from heterogeneous data sources. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data. ACM, pp 785–796
Song J, Gao L, Nie F, Shen HT, Yan Y, Sebe N (2016) Optimized graph learning using partial tags and multiple features for image and video annotation. IEEE Trans Image Process 25(11):4999–5011
Article MathSciNet MATH Google Scholar
Song J, Guo Y, Gao L, Li X, Hanjalic A, Shen HT (2018) From deterministic to generative: multimodal stochastic rnns for video captioning. IEEE Trans Neural Netw Learn Syst PP(99):1–12. https://doi.org/10.1109/TNNLS.2018.2851077
Google Scholar
Song J, Zhang H, Li X, Gao L, Wang M, Hong R (2018) Self-supervised video hashing with hierarchical binary auto-encoder. IEEE Trans Image Process 27 (7):3210–3221
Article MathSciNet MATH Google Scholar
Srivastava N, Salakhutdinov R (2012) Learning representations for multimodal data with deep belief nets. In: International conference on machine learning workshop
Sun S (2013) A survey of multi-view machine learning. Neural Comput & Appl 23(7-8):2031–2038
Article Google Scholar
Sun S, Hardoon DR (2010) Active learning with extremely sparse labeled examples. Neurocomputing 73(16):2980–2988
Article Google Scholar
Sun Y, Chen Y, Wang X, Tang X (2014) Deep learning face representation by joint identification-verification. In: Advances in neural information processing systems, pp 1988–1996
Tenenbaum JB, Freeman WT (2000) Separating style and content with bilinear models. Neural computation 12(6):1247–1283
Article Google Scholar
Wang X, Gao L, Song J, Shen H (2017) Beyond frame-level cnn: saliency-aware 3-d cnn with lstm for video action recognition. IEEE Signal Process Lett 24(4):510–514
Article Google Scholar
Wang X, Gao L, Wang P, Sun X, Liu X (2018) Two-stream 3-d convnet fusion for action recognition in videos with arbitrary size and length. IEEE Trans Multimed 20(3):634–644
Article Google Scholar
Wei Y, Zhao Y, Zhu Z, Wei S, Xiao Y, Feng J, Yan S (2016) Modality-dependent cross-media retrieval. ACM Trans Intell Syst Technol (TIST) 7 (4):57
Google Scholar
Welling M, Rosen-Zvi M, Hinton GE (2004) Exponential family harmoniums with an application to information retrieval. In: Advances in neural information processing systems, pp 1481–1488
Wen Y, Zhang K, Li Z, Qiao Y (2016) A discriminative feature learning approach for deep face recognition. In: European conference on computer vision. Springer, pp 499–515
Weston J, Bengio S, Usunier N (2010) Large scale image annotation: learning to rank with joint word-image embeddings. Mach Learn 81(1):21–35
Article MathSciNet Google Scholar
Wu F, Lu X, Zhang Z, Yan S, Rui Y, Zhuang Y (2013) Cross-media semantic representation via bi-directional learning to rank. In: Proceedings of the 21st ACM international conference on multimedia. ACM, pp 877–886
Zhai X, Peng Y, Xiao J (2014) Learning cross-media joint representation with sparse and semisupervised regularization. IEEE Trans Circuits Syst Video Technol 24 (6):965–978
Article Google Scholar
Zhu X, Li X, Zhang S, Xu Z, Yu L, Wang C (2017) Graph pca hashing for similarity search. IEEE Trans Multimed 19(9):2033–2044
Article Google Scholar
Zu C, Zhang D (2016) Canonical sparse cross-view correlation analysis. Neurocomputing 191:263–272
Article Google Scholar

Download references

Acknowledgments

This work is supported by Chinese National Natural Science Foundation(61532018, 61471049), and Key Laboratory of Forensic Marks, Ministry of Public Security of China.

Author information

Authors and Affiliations

School of Information and Communication Engineering, Beijing University of Posts and Telecommunications, Beijing, China
Jie Shao, Zhicheng Zhao & Fei Su
Beijing Key Laboratory of Network System and Network Culture, Beijing University of Posts and Telecommunications, Beijing, China
Zhicheng Zhao & Fei Su

Authors

Jie Shao
View author publications
You can also search for this author in PubMed Google Scholar
Zhicheng Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Fei Su
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhicheng Zhao.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The first two authors contributed equally to this work.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shao, J., Zhao, Z. & Su, F. Two-stage deep learning for supervised cross-modal retrieval. Multimed Tools Appl 78, 16615–16631 (2019). https://doi.org/10.1007/s11042-018-7068-0

Download citation

Received: 25 April 2018
Revised: 05 November 2018
Accepted: 11 December 2018
Published: 19 December 2018
Issue Date: 30 June 2019
DOI: https://doi.org/10.1007/s11042-018-7068-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Two-stage deep learning for supervised cross-modal retrieval

Abstract

Access this article

Similar content being viewed by others

Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval

Preserving Semantic Neighborhoods for Robust Cross-Modal Retrieval

2D-Convolution Based Feature Fusion for Cross-Modal Correlation Learning

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Two-stage deep learning for supervised cross-modal retrieval

Abstract

Access this article

Similar content being viewed by others

Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval

Preserving Semantic Neighborhoods for Robust Cross-Modal Retrieval

2D-Convolution Based Feature Fusion for Cross-Modal Correlation Learning

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation