Abstract
A great many of approaches have been developed for cross-modal retrieval, among which subspace learning based ones dominate the landscape. Concerning whether using the semantic label information or not, subspace learning based approaches can be categorized into two paradigms, unsupervised and supervised. However, for multi-label cross-modal retrieval, supervised approaches just simply exploit multi-label information towards a discriminative subspace, without considering the correlations between multiple labels shared by multi-modalities, which often leads to an unsatisfactory retrieval performance. To address this issue, in this paper we propose a general framework, which jointly incorporates semantic correlations into subspace learning for multi-label cross-modal retrieval. By introducing the HSIC-based regularization term, the correlation information among multiple labels can be not only leveraged but also the consistency between the modality similarity from each modality is well preserved. Besides, based on the semantic-consistency projection, the semantic gap between the low-level feature space of each modality and the shared high-level semantic space can be balanced by a mid-level consistent one, where multi-label cross-modal retrieval can be performed effectively and efficiently. To solve the optimization problem, an effective iterative algorithm is designed, along with its convergence analysis theoretically and experimentally. Experimental results on real-world datasets have shown the superiority of the proposed method over several existing cross-modal subspace learning methods.






Similar content being viewed by others
References
Akaho S (2007) A kernel method for canonical correlation analysis. In: The international meeting of the psychometric society (IMPS)
Carneiro G, Chan AB, Moreno PJ, Vasconcelos N (2007) Supervised learning of semantic classes for image annotation and retrieval. IEEE Trans Pattern Anal Mach Intell 29(3):394–410
Chen X, Yuan X, Chen Q, Yan S, Chua TS (2011) Multi-label visual classification with label exclusive context. In: IEEE international conference on computer vision (ICCV)
Chen Y, Wang L, Wang W, Zhang Z (2012) Continuum regression for cross-modal multimedia retrieval. In: IEEE international conference on image processing (ICIP)
Chua TS, Tang J, Hong R, Li H, Luo Z, Zhang Y (2009) Nus-wide: a real-world web image database from national university of singapore. In: ACM international conference on image and video
Cui C, Lin P, Nie X, Yin Y, Zhu Q (2017) Hybrid textual-visual relevance learning for content-based image retrieval. J Vis Commun Image Represent 48:367–374
Diethe T, Hardoon DR, Shawe-Taylor J (2008) Multiview fisher discriminative analysis. In: NIPS workshop on learning from multiple sources
Everingham M, Gool LV, Williams CK, Winn J, Zisserman A (2010) The pascal visual object classes (voc) challenge. Int J Comput Vis 88(2):303–338
Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Mikolov T (2013) Devise: a deep visual-semantic embedding model. In: Advances in neural information processing systems (NIPS)
Gong Y, Ke Q, Isard M, Lazebnik S (2014) A multi-view embedding space for modeling internet images, tags, and their semantics. Int J Comput Vis 106 (2):210–233
Gretton A, Bousquet O, Smola A, Scholkopf B (2005) Measuring statistical dependence with Hilbert-Schmidt norms. In: International conference on algorithmic learning theory. Springer, Berlin
Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664
He R, Zhang M, Wang L, Ji Y, Yin Q (2015) Cross-modal subspace learning via pairwise constraints. IEEE Trans Image Process 24(12):5543–5556
Higham NJ (2002) Accuracy and stability of numerical algorithms. Society for Industrial and Applied Mathematics
Hotelling H (1936) Relations between two sets of variates. Biometrika 28 (3/4):321–377
Hwang SJ, Grauman K (2010) Accounting for the relative importance of objects in image retrieval. In: Proceedings of the British machine vision conference (BMVC)
Ji S, Yu S, Ye J (2010) A shared-subspace learning framework for multi-label classification. ACM Trans Knowl Discov Data (TKDD) 4(2):1–29
Jia Y, Salzmann M, Darrell T (2011) Learning cross-modality similarity for multinomial data. In: IEEE international conference on computer vision (ICCV)
Jiang S, Song X, Huang Q (2014) Relative image similarity learning with contextual information for internet cross-media retrieval. Multimed Syst 20(6):645–657
Kan M, Shan S, Zhang H, Lao S, Chen X (2016) Multi-view discrinative analysis. IEEE Trans Pattern Anal Mach Intell 38(1):188–194
Kang F, Jin R, Sukthankar R (2006) Correlated label propagation with application to multi-label learning. In: IEEE conference on computer vision and pattern recognition (CVPR)
Li Z, Liu J, Yang Y, Zhou X, Lu H (2014) Clustering-guided sparse structural learning for unsupervised feature selection. IEEE Trans Knowl Data Eng (TKDE) 26(9):2138–2150
Liao R, Zhu J, Qin Z (2014) Nonparametric bayesian upstream supervised multi-modal topic models. In: ACM international conference on web search and data mining
Liu Y, Jin R, Yang L (2006) Semi-supervised multi-label learning by constrained non-negative matrix factorization. In: Proceedings of the thirty-first AAAI conference on artificial intelligence
Pereira JC, Doyle G, Rasiwasia N, Lanckriet GRG, Levy R, Vasconcelos N (2014) On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Trans Pattern Anal Mach Intell 36(3):521–535
Ranjan V, Rasiwasia N, Jawahar C (2015) Multi-label cross-modal retrieval. In: IEEE international conference on computer vision (ICCV)
Rashtchian C, Young P, Hodosh M, Hockenmaier J (2010) Collecting image annotations using amazon’s mechanical turk. In: The NAACL HLT 2010 workshop on creating speech and language data with Amazon’s mechanical Turk
Rasiwasia N, Pereira JC, Coviello E, Doyle G, Lanckriet GRG, ahd Nuno Vasconcelos RL (2010) A new appraoch to cross-modal multimedia retrieval. In: International conference on machine learning (international conference on machine learning (ICML))
Rosipal R, Trejo LJ (2003) Kernel partial least square regression in reproducing kernel Hilbert space. Pattern Recognit 36(9):1961–1971
Sharma A, Jacobs DW (2011) Bypassing synthesis: PLS for face recognition with pose, low-resolution and sketch. In: IEEE conference on computer vision and pattern recognition (CVPR)
Sharma A, Kumar A, Daume H III (2012) Generalized multi-view analysis: a discriminative latent space. In: IEEE conference on computer vision and pattern recognition (CVPR)
Shu X, Qi G, Tang J, Wang J (2015) Weakly-shared deep transfer newworks for heterogeneous-domain knowledge propagation. In: ACM international conference on multimedia
Song G, Wang S, Huang Q, Tian Q (2017) Multimodal similarity gaussian process latent variable model. IEEE Trans Image Process 26(9):4168–4181
Tae-Kyun K, Kittler J, Cipolla R (2007) Discriminative learning and recognition of image set classes using canonical correlation. IEEE Trans Pattern Anal Mach Intell 29(6):1005–1018
Tang J, Shu X, Li Z, Qi G, Wang J (2016) Generalized deep transfer networks for knowledge propagation in heterogeneous domains. ACM Trans Multimed Comput Commun Appl 12(4s):1–22
Tenenbaum JB, Freeman WT (2000) Separating style and content with bilinear models. Neural Comput 12(6):1247–1283
Udupa R, Khapra M (2010) Improving the multilingual user experience of wikipedia using cross-language name search. In: Human language technologies: the 2010 annual conference of the North American chapter of the association for computational linguistics
Wang S, Jiang S (2015) Instre:a new benchmark for instance-level object retrieval and recognition. ACM Trans Multimed Comput Commun Appl 11(3):1–37
Wang Y, Wu F, Song J, Li X, Zhuang Y (2014) Multi-modal mutual topic reinforce modeling for cross-media retrieval. In: ACM international conference on multimedia
Wang K, He R, Wang L, Wang W, Tan T (2016) Joint feature selection and subspace learning for cross-modal retrieval. IEEE Trans Pattern Anal Mach Intell 38(10):2010–2023
Wang K, Yin Q, Wang W, Wu S, Wang L (2016) A comprehensive survey on cross-modal retrieval. arXiv:1607.06215 [cs.MM]
Wei Y, Zhao Y, Lu C, Wei S, Liu L, Zhu Z, Yan S (2017) Cross-modal retrieval with cnn visual features: a new baseline. IEEE Trans Cybern 47(2):449–460
Wu Y, Wang S, Huang Q (2017) Online asymmetric similarity learning for cross-modal retrieval. In: IEEE conference on computer vision and pattern recognition (CVPR)
Xu D, Yan S (2009) Semi-supervised bilinear subspace learning. IEEE Trans Image Process 18(7):1671–1676
Yang J, Yan S, Huang TS (2008) Ubiquitously supervised subspace learning. IEEE Trans Image Process 18(2):241–249
Zhang Y, Schneider JG (2011) Multi-label output codes using canonical correlation analysis. In: The 14th international conference on artificial intelligence and statistics
Zhang Y, Zhou Z (2010) Multilabel dimensionality reduction via dependence maximization. ACM Trans Knowl Discov Data (TKDD) 4(3):14
Zhang X, Yu Y, White M, Huang R, Schuurmans D (2011) Convex sparse coding, subspace learning and semi-supervised extensions. In: Proceedings of the thirty-first AAAI conference on artificial intelligence
Zhang D, Islam MM, Lu G (2012) A review on automatic image annotation techniques. Pattern Recognit 45:346–362
Zhang L, Ma B, Li G, Huang Q, Tian Q (2017) Generalized semi-supervised and structured subspace learning for cross-modal retrieval. IEEE Trans Multimedia 19 (6):1220–1233
Zhao F, Huang Y, Wang L, Tan T (2015) Deep semantic ranking based hashing for multi-label image retrieval. In: IEEE conference on computer vision and pattern recognition (CVPR)
Zheng Y, Zhang Y, Larochelle H (2014) Topic modeling of multimodal data: an autoregressive approach. In: IEEE conference on computer vision and pattern recognition (CVPR)
Zhu S, Ji X, Xu W, Gong Y (2005) Multi-labelled classification using maximum entropy method. In: The 28th annual international ACM SIGIR conference on research and development in information retrieval
Zhu Z, Cheng J, Zhao Y, Ye J (2016) Lsslp-local structure sensitive label propagation. Inf Sci 332:19–32
Zhuang Y, Wang Y, Wu F, Zhang Y, Lu W (2013) Supervised coupled dictionary learning with group structures for multi-modal retrieval. In: Proceedings of the thirty-first AAAI conference on artificial intelligence
Acknowledgements
This work was jointly supported by National Natural Science Foundation of China (NO.61572068, NO.61532005), National Key Research and Development of China (NO.2016YFB0800404) and the Fundamental Research Funds for the Central Universities (No.2018JBZ001).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Xu, M., Zhu, Z. & Zhao, Y. Towards learning a semantic-consistent subspace for cross-modal retrieval. Multimed Tools Appl 78, 389–412 (2019). https://doi.org/10.1007/s11042-018-6578-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-6578-0