Skip to main content
Log in

Semi-supervised cross-modal learning for cross modal retrieval and image annotation

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

Multimedia data are usually associated with multiple modalities represented by heterogeneous features. Recently, many information retrieval tasks are not only restricted to the case of a single modal and the contend-based cross modal retrieval has become one of the popular research fields. The premise of cross modal retrieval is discovering the relationships between different modalities efficiently. Though some approaches have been proposed to address this challenging problem, they either ignores the precious labels, or heavily depends on the completely labeled training data. In addition, for features with relatively high dimensionality, it is of great importance to select the most informative ones. In this paper, we propose a semi-supervised algorithm for cross modal learning. Our algorithm can make full use of both a small number of labeled and an abundant unlabeled data to establish connections between modalities via a shared semantic space discovering. On the other hand, our algorithm automatically filter out the noisy and redundant features to further improve our model. Finally, we give an efficient solution to the objective function. The experiments on two publicly available datasets demonstrate that the proposed method is competitive with or even superior to the state-of-art counterparts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6

Similar content being viewed by others

References

  1. Argyriou, A., Evgeniou, T., Pontil, M.: Multi-task feature learning. Adv. Neural Inf. Proces. Syst. 191(41), 41–50 (2007). MIT

    Google Scholar 

  2. Bandla, S., Grauman, K.: Active learning of an action detector from untrimmed videos. In: IEEE International Conference on Computer Vision (ICCV). IEEE (2013)

  3. Barnard, K., Duygulu, P., Forsyth, D., De Freitas, N., Blei, D.M., Jordan, M.I.: Matching words and pictures. J. Mach. Learn. Res. 3, 1107–1135 (2003). JMLR. org

    MATH  Google Scholar 

  4. Belkin, M., Niyogi, P., Sindhwani, V.: Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. J. Mach. Learn. Res. 7, 2399–2434 (2006). JMLR org

    MathSciNet  MATH  Google Scholar 

  5. Berg, T.L., Berg, A.C., Edwards, J., Maire, M., White, R., Teh, Y.W., Earned-Miller, E.G., Forsyth, D.A.: Names and faces in the news. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR) 2, 848–854 (2004)

    Google Scholar 

  6. Blei, D.M., Jordan, M.I.: Modeling annotated data. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 127–134. ACM (2003)

  7. Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1360 (2001). Taylor & Francis

    Article  MathSciNet  MATH  Google Scholar 

  8. Fazel, M.: Matrix Rank Minimization with Applications. PhD thesis, Stanford University (2002)

  9. Fazel, M., Hindi, H., Boyd, S.P.: A rank minimization heuristic with application to minimum order system approximation. Proc. 2001 Am. Control Conf. 6 (1), 4734–4739 (2001). IEEE

    Google Scholar 

  10. Grave, E., Obozinski, G., Bach, F., et al.: Trace Lasso: a trace norm regularization for correlated designs. NIPS 3(2), 5–5 (2011)

    Google Scholar 

  11. Guillaumin, M., Verbeek, J., Schmid, C.: Multimodal semi-supervised learning for image classification. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 902–909. IEEE (2010)

  12. Hwang, S.J., Grauman, K.: Reading between the lines: Object localization using implicit cues from image tags. IEEE Trans. Pattern Anal. Mach. Intell. 34(6), 1145–1158 (2012). IEEE

    Article  Google Scholar 

  13. Jia, Y., Salzmann, M., Trevor, D.: Learning cross-modality similarity for multinomial data. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 2407–2414. IEEE (2011)

  14. Jingdong, W., Ting, Z., Jingkuan, S., Nicu, S., Tao, S.H.: A survey on learning to hash. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 769–790 (2018)

    Article  Google Scholar 

  15. Jingkuan, S., Hanwang, Z., Xiangpeng, L., Lianli, G., Meng, W., Richang, H.: Self-supervised video hashing with hierarchical binary auto-encoder. IEEE Trans. Image Process. 20(3), 233–50 (2018)

    MathSciNet  MATH  Google Scholar 

  16. Li, A., Shan, S., Chen, X., Gao, W: Face recognition based on non-corresponding region matching. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 1060–1067. IEEE (2011)

  17. Li, Z., Qin, L., Cheng, H., Zhang, X., Zhou, X.: TRIP: An interactive retrieving-inferring data imputation approach. IEEE Trans. Knowl. Data Eng. 27(9), 2550–2563 (2015)

    Article  Google Scholar 

  18. Li, Z., Sharaf, M.A., Sitbon, L., Du, X., Zhou, X.: CoRE: A context-aware relation extraction method for relation completion. IEEE Trans. Knowl. Data Eng. 26 (4), 836–49 (2014)

    Article  Google Scholar 

  19. Li, Z., Sitbon, L., Wang, L., Zhou, X., Du, X.: AML: efficient approximate membership localization within a Web-based join framework. IEEE Trans. Knowl. Data Eng. 25(2), 298–310 (2013)

    Article  Google Scholar 

  20. Liu, J., Ji, S., Ye, J.: Multi-task feature learning via efficient l 2, 1-norm minimization. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pp. 339–348 (2009)

  21. Ma, Z., Yang, Y., Cai, Y., Sebe, N., Hauptmann, A.G.: Knowledge adaptation for ad hoc multimedia event detection with few exemplars. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 469–478. ACM (2012)

  22. Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 689–696 (2011)

  23. Nie, F., Huang, H., Cai, X., Ding, C.: Efficient and robust feature selection via joint l2, 1-norms minimization. Adv. Neural Inf. Proces. Syst. 23(1), 1813–1821 (2010)

    Google Scholar 

  24. Obozinski, G., Taskar, B., Jordan, M.I.: Joint covariate selection and joint subspace selection for multiple classification problems. Stat. Comput. 20(2), 231–252 (2010). Springer

    Article  MathSciNet  Google Scholar 

  25. Putthividhy, D., Attias, H.T., Nagarajan, S.S.: Topic regression multi-modal latent dirichlet allocation for image annotation. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3408–3415. IEEE (2010)

  26. Rasiwasia, N., Costa Pereira, J., Coviello, E., Doyle, G., Lanckriet, G.R.G., Levy, R., Vasconcelos, N.: A new approach to cross-modal multimedia retrieval. In: Proceedings of the International Conference on Multimedia, pp. 251–260. ACM (2010)

  27. Sharma, A., Jacobs, D.W.: Bypassing synthesis: PLS for face recognition with pose, low-resolution and sketch. In: Proceedings 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 593–600. IEEE (2011)

  28. Sharma, A., Kumar, A., Daume, H., Jacobs, D.W.: Generalized multiview analysis: a discriminative latent spacer. In: Proceeding of 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2160–2167. IEEE (2012)

  29. Socher, R., Fei-Fei, L.: Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 966–973. IEEE (2010)

  30. Song, J., Gao, L., Liu, L., Zhu, X., Sebe, N.: Quantization-based hashing: a general framework for scalable image and video retrieval. Pattern Recogn. 75, 1339–1351 (2018)

    Article  Google Scholar 

  31. Wei, Z., Ke, Z., Pan, G., Xiangyang, X.: Multi-view embedding learning for incompletely labeled data. In: Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, pp. 1910–1916. AAAI Press (2013)

  32. Yang, Y., Nie, F., Xu, D., Luo, J., Zhuang, Y., Pan, Y.: A multimedia retrieval framework based on semi-supervised ranking and relevance feedback. IEEE Trans. Pattern Anal. Mach. Intell. 34(4), 723–742 (2012). IEEE

    Article  Google Scholar 

  33. Yang, Y., Song, J., Huang, Z., Ma, Z., Sebe, N., Hauptmann, A.G.: Multi-feature fusion via hierarchical regression for multimedia analysis. IEEE Trans. Multimedia 15(3), 572–581 (2013). IEEE

    Article  Google Scholar 

  34. Zhou, N., Zhu, J.: Group variable selection via a hierarchical lasso and its oracle property, arXiv:1006.2871 (2010)

  35. Zhu, L., Huang, Z., Liu, X., He, X., Sun, J., Zhou, X.: Discrete multimodal hashing with canonical views for robust mobile landmark search. IEEE Trans. Multimedia 19(9), 2066–2079 (2017)

    Article  Google Scholar 

Download references

Acknowledgments

This work is supported in part by the National Natural Science Foundation of China under Grant No.61672254 and 61300222, Key project of National Natural Science Foundation of China Grant No U1536203, Natural Science Foundation of Hubei Province Grant No.2015CFB687, the Fundamental Research Funds for the Central Universities, HUST:2016YXMS088. The authors appreciate the valuable suggestions from the anonymous reviewers and the Editors.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fuhao Zou.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Special Issue on Deep vs. Shallow: Learning for Emerging Web-scale Data Computing and Applications

Guest Editors: Jingkuan Song, Shuqiang Jiang, Elisa Ricci, and Zi Huang

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zou, F., Bai, X., Luan, C. et al. Semi-supervised cross-modal learning for cross modal retrieval and image annotation. World Wide Web 22, 825–841 (2019). https://doi.org/10.1007/s11280-018-0581-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-018-0581-2

Keywords

Navigation