Skip to main content
Log in

Towards learning a semantic-consistent subspace for cross-modal retrieval

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

A great many of approaches have been developed for cross-modal retrieval, among which subspace learning based ones dominate the landscape. Concerning whether using the semantic label information or not, subspace learning based approaches can be categorized into two paradigms, unsupervised and supervised. However, for multi-label cross-modal retrieval, supervised approaches just simply exploit multi-label information towards a discriminative subspace, without considering the correlations between multiple labels shared by multi-modalities, which often leads to an unsatisfactory retrieval performance. To address this issue, in this paper we propose a general framework, which jointly incorporates semantic correlations into subspace learning for multi-label cross-modal retrieval. By introducing the HSIC-based regularization term, the correlation information among multiple labels can be not only leveraged but also the consistency between the modality similarity from each modality is well preserved. Besides, based on the semantic-consistency projection, the semantic gap between the low-level feature space of each modality and the shared high-level semantic space can be balanced by a mid-level consistent one, where multi-label cross-modal retrieval can be performed effectively and efficiently. To solve the optimization problem, an effective iterative algorithm is designed, along with its convergence analysis theoretically and experimentally. Experimental results on real-world datasets have shown the superiority of the proposed method over several existing cross-modal subspace learning methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Akaho S (2007) A kernel method for canonical correlation analysis. In: The international meeting of the psychometric society (IMPS)

  2. Carneiro G, Chan AB, Moreno PJ, Vasconcelos N (2007) Supervised learning of semantic classes for image annotation and retrieval. IEEE Trans Pattern Anal Mach Intell 29(3):394–410

    Article  Google Scholar 

  3. Chen X, Yuan X, Chen Q, Yan S, Chua TS (2011) Multi-label visual classification with label exclusive context. In: IEEE international conference on computer vision (ICCV)

  4. Chen Y, Wang L, Wang W, Zhang Z (2012) Continuum regression for cross-modal multimedia retrieval. In: IEEE international conference on image processing (ICIP)

  5. Chua TS, Tang J, Hong R, Li H, Luo Z, Zhang Y (2009) Nus-wide: a real-world web image database from national university of singapore. In: ACM international conference on image and video

  6. Cui C, Lin P, Nie X, Yin Y, Zhu Q (2017) Hybrid textual-visual relevance learning for content-based image retrieval. J Vis Commun Image Represent 48:367–374

    Article  Google Scholar 

  7. Diethe T, Hardoon DR, Shawe-Taylor J (2008) Multiview fisher discriminative analysis. In: NIPS workshop on learning from multiple sources

  8. Everingham M, Gool LV, Williams CK, Winn J, Zisserman A (2010) The pascal visual object classes (voc) challenge. Int J Comput Vis 88(2):303–338

    Article  Google Scholar 

  9. Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Mikolov T (2013) Devise: a deep visual-semantic embedding model. In: Advances in neural information processing systems (NIPS)

  10. Gong Y, Ke Q, Isard M, Lazebnik S (2014) A multi-view embedding space for modeling internet images, tags, and their semantics. Int J Comput Vis 106 (2):210–233

    Article  Google Scholar 

  11. Gretton A, Bousquet O, Smola A, Scholkopf B (2005) Measuring statistical dependence with Hilbert-Schmidt norms. In: International conference on algorithmic learning theory. Springer, Berlin

  12. Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664

    Article  Google Scholar 

  13. He R, Zhang M, Wang L, Ji Y, Yin Q (2015) Cross-modal subspace learning via pairwise constraints. IEEE Trans Image Process 24(12):5543–5556

    Article  MathSciNet  Google Scholar 

  14. Higham NJ (2002) Accuracy and stability of numerical algorithms. Society for Industrial and Applied Mathematics

  15. Hotelling H (1936) Relations between two sets of variates. Biometrika 28 (3/4):321–377

    Article  Google Scholar 

  16. Hwang SJ, Grauman K (2010) Accounting for the relative importance of objects in image retrieval. In: Proceedings of the British machine vision conference (BMVC)

  17. Ji S, Yu S, Ye J (2010) A shared-subspace learning framework for multi-label classification. ACM Trans Knowl Discov Data (TKDD) 4(2):1–29

    Article  Google Scholar 

  18. Jia Y, Salzmann M, Darrell T (2011) Learning cross-modality similarity for multinomial data. In: IEEE international conference on computer vision (ICCV)

  19. Jiang S, Song X, Huang Q (2014) Relative image similarity learning with contextual information for internet cross-media retrieval. Multimed Syst 20(6):645–657

    Article  Google Scholar 

  20. Kan M, Shan S, Zhang H, Lao S, Chen X (2016) Multi-view discrinative analysis. IEEE Trans Pattern Anal Mach Intell 38(1):188–194

    Article  Google Scholar 

  21. Kang F, Jin R, Sukthankar R (2006) Correlated label propagation with application to multi-label learning. In: IEEE conference on computer vision and pattern recognition (CVPR)

  22. Li Z, Liu J, Yang Y, Zhou X, Lu H (2014) Clustering-guided sparse structural learning for unsupervised feature selection. IEEE Trans Knowl Data Eng (TKDE) 26(9):2138–2150

    Article  Google Scholar 

  23. Liao R, Zhu J, Qin Z (2014) Nonparametric bayesian upstream supervised multi-modal topic models. In: ACM international conference on web search and data mining

  24. Liu Y, Jin R, Yang L (2006) Semi-supervised multi-label learning by constrained non-negative matrix factorization. In: Proceedings of the thirty-first AAAI conference on artificial intelligence

  25. Pereira JC, Doyle G, Rasiwasia N, Lanckriet GRG, Levy R, Vasconcelos N (2014) On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Trans Pattern Anal Mach Intell 36(3):521–535

    Article  Google Scholar 

  26. Ranjan V, Rasiwasia N, Jawahar C (2015) Multi-label cross-modal retrieval. In: IEEE international conference on computer vision (ICCV)

  27. Rashtchian C, Young P, Hodosh M, Hockenmaier J (2010) Collecting image annotations using amazon’s mechanical turk. In: The NAACL HLT 2010 workshop on creating speech and language data with Amazon’s mechanical Turk

  28. Rasiwasia N, Pereira JC, Coviello E, Doyle G, Lanckriet GRG, ahd Nuno Vasconcelos RL (2010) A new appraoch to cross-modal multimedia retrieval. In: International conference on machine learning (international conference on machine learning (ICML))

  29. Rosipal R, Trejo LJ (2003) Kernel partial least square regression in reproducing kernel Hilbert space. Pattern Recognit 36(9):1961–1971

    Article  Google Scholar 

  30. Sharma A, Jacobs DW (2011) Bypassing synthesis: PLS for face recognition with pose, low-resolution and sketch. In: IEEE conference on computer vision and pattern recognition (CVPR)

  31. Sharma A, Kumar A, Daume H III (2012) Generalized multi-view analysis: a discriminative latent space. In: IEEE conference on computer vision and pattern recognition (CVPR)

  32. Shu X, Qi G, Tang J, Wang J (2015) Weakly-shared deep transfer newworks for heterogeneous-domain knowledge propagation. In: ACM international conference on multimedia

  33. Song G, Wang S, Huang Q, Tian Q (2017) Multimodal similarity gaussian process latent variable model. IEEE Trans Image Process 26(9):4168–4181

    Article  MathSciNet  Google Scholar 

  34. Tae-Kyun K, Kittler J, Cipolla R (2007) Discriminative learning and recognition of image set classes using canonical correlation. IEEE Trans Pattern Anal Mach Intell 29(6):1005–1018

    Article  Google Scholar 

  35. Tang J, Shu X, Li Z, Qi G, Wang J (2016) Generalized deep transfer networks for knowledge propagation in heterogeneous domains. ACM Trans Multimed Comput Commun Appl 12(4s):1–22

    Article  Google Scholar 

  36. Tenenbaum JB, Freeman WT (2000) Separating style and content with bilinear models. Neural Comput 12(6):1247–1283

    Article  Google Scholar 

  37. Udupa R, Khapra M (2010) Improving the multilingual user experience of wikipedia using cross-language name search. In: Human language technologies: the 2010 annual conference of the North American chapter of the association for computational linguistics

  38. Wang S, Jiang S (2015) Instre:a new benchmark for instance-level object retrieval and recognition. ACM Trans Multimed Comput Commun Appl 11(3):1–37

    Article  Google Scholar 

  39. Wang Y, Wu F, Song J, Li X, Zhuang Y (2014) Multi-modal mutual topic reinforce modeling for cross-media retrieval. In: ACM international conference on multimedia

  40. Wang K, He R, Wang L, Wang W, Tan T (2016) Joint feature selection and subspace learning for cross-modal retrieval. IEEE Trans Pattern Anal Mach Intell 38(10):2010–2023

    Article  Google Scholar 

  41. Wang K, Yin Q, Wang W, Wu S, Wang L (2016) A comprehensive survey on cross-modal retrieval. arXiv:1607.06215 [cs.MM]

  42. Wei Y, Zhao Y, Lu C, Wei S, Liu L, Zhu Z, Yan S (2017) Cross-modal retrieval with cnn visual features: a new baseline. IEEE Trans Cybern 47(2):449–460

    Google Scholar 

  43. Wu Y, Wang S, Huang Q (2017) Online asymmetric similarity learning for cross-modal retrieval. In: IEEE conference on computer vision and pattern recognition (CVPR)

  44. Xu D, Yan S (2009) Semi-supervised bilinear subspace learning. IEEE Trans Image Process 18(7):1671–1676

    Article  MathSciNet  Google Scholar 

  45. Yang J, Yan S, Huang TS (2008) Ubiquitously supervised subspace learning. IEEE Trans Image Process 18(2):241–249

    Article  MathSciNet  Google Scholar 

  46. Zhang Y, Schneider JG (2011) Multi-label output codes using canonical correlation analysis. In: The 14th international conference on artificial intelligence and statistics

  47. Zhang Y, Zhou Z (2010) Multilabel dimensionality reduction via dependence maximization. ACM Trans Knowl Discov Data (TKDD) 4(3):14

    Google Scholar 

  48. Zhang X, Yu Y, White M, Huang R, Schuurmans D (2011) Convex sparse coding, subspace learning and semi-supervised extensions. In: Proceedings of the thirty-first AAAI conference on artificial intelligence

  49. Zhang D, Islam MM, Lu G (2012) A review on automatic image annotation techniques. Pattern Recognit 45:346–362

    Article  Google Scholar 

  50. Zhang L, Ma B, Li G, Huang Q, Tian Q (2017) Generalized semi-supervised and structured subspace learning for cross-modal retrieval. IEEE Trans Multimedia 19 (6):1220–1233

    Article  Google Scholar 

  51. Zhao F, Huang Y, Wang L, Tan T (2015) Deep semantic ranking based hashing for multi-label image retrieval. In: IEEE conference on computer vision and pattern recognition (CVPR)

  52. Zheng Y, Zhang Y, Larochelle H (2014) Topic modeling of multimodal data: an autoregressive approach. In: IEEE conference on computer vision and pattern recognition (CVPR)

  53. Zhu S, Ji X, Xu W, Gong Y (2005) Multi-labelled classification using maximum entropy method. In: The 28th annual international ACM SIGIR conference on research and development in information retrieval

  54. Zhu Z, Cheng J, Zhao Y, Ye J (2016) Lsslp-local structure sensitive label propagation. Inf Sci 332:19–32

    Article  Google Scholar 

  55. Zhuang Y, Wang Y, Wu F, Zhang Y, Lu W (2013) Supervised coupled dictionary learning with group structures for multi-modal retrieval. In: Proceedings of the thirty-first AAAI conference on artificial intelligence

Download references

Acknowledgements

This work was jointly supported by National Natural Science Foundation of China (NO.61572068, NO.61532005), National Key Research and Development of China (NO.2016YFB0800404) and the Fundamental Research Funds for the Central Universities (No.2018JBZ001).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhenfeng Zhu.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xu, M., Zhu, Z. & Zhao, Y. Towards learning a semantic-consistent subspace for cross-modal retrieval. Multimed Tools Appl 78, 389–412 (2019). https://doi.org/10.1007/s11042-018-6578-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-018-6578-0

Keywords

Navigation