Skip to main content
Log in

Analyzing semantic correlation for cross-modal retrieval

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

With the development of multimedia technology, effective cross-modal retrieval methods are increasingly demanded. The key point of cross-modal retrieval is analyzing the correlation of heterogeneous modalities. There are mainly two types of correlation: content correlation and semantic correlation. Semantic correlation is constructed at a high level of abstraction which is more close to the human understanding than content correlation. In this paper, we investigate a semantic model to construct the semantic correlation for cross-modal retrieval. We assume that the semantic correlation of multimedia data from different modalities can be conditionally generated by semantic concepts in a probabilistic generation framework. The cross-modal semantic generation model (CMSGM) is proposed based on this assumption. We consider three cases of the cross-modal retrieval task. The first is the ideal case that all manifest concepts exist in training data for constructing the correlation, and we propose manifest CMSGM (M-CMSGM) which directly uses CMSGM on the manifest semantic concepts for retrieval. The second is the case that there are no manifest concepts in training data, and latent CMSGM (L-CMSGM) based on latent semantic concepts is proposed for this case, where the latent semantic concepts are learned by asymmetric spectral clustering. The last is the most general case that some of the manifest concepts exist, and we combine M-CMSGM and L-CMSGM to get combinative CMSGM (C-CMSGM) to solve this case. Experimental results on Wikipedia featured articles and MIR Flickr show that our methods have better performance compared with previous state-of-the-art methods. And C-CMSGM can maintain good performance in the case that manifest concepts are lacking, which confirms the robustness and practicality of C-CMSGM.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Chandrika, P., Jawahar, C.V.: Multi modal semantic indexing for image retrieval. In: Proceedings of the ACM International Conference on Image and Video Retrieval, ACM, pp 342–349 (2010)

  2. Wang, X.-J., et al.: Multi-model similarity propagation and its application for web image retrieval”. In: Proceedings of the 12th annual ACM international conference on Multimedia. ACM (2004)

  3. Hoi, S.C.H., Lyu, M.R.: A multimodal and multilevel ranking scheme for large-scale video retrieval. IEEE Trans. Multimed. 10(4), 607–619 (2008)

    Article  Google Scholar 

  4. Hotelling, H.: Relations between two sets of variates. Biometrika 28(3/4), 321–377 (1936)

    Article  MATH  Google Scholar 

  5. Lavrenko, V., Manmatha, R., Jeon, J.: A model for learning the semantics of pictures. Advances in neural Information Processing Systems (2003)

  6. Zhang, S., et al.: Automatic image annotation using group sparsity. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2010)

  7. Jeon, J., Lavrenko, V., Manmatha, R.: Automatic image annotation and retrieval using cross-media relevance models. In: Proceedings of the 26th annual international ACM SIGIR Conference on Research and development in information retrieval. ACM (2003)

  8. Feng, S.L., Manmatha, R., Lavrenko, V.: Multiple Bernoulli relevance models for image and video annotation. In: Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2. IEEE (2004)

  9. Blei, D.M., Jordan, M.I.: Modeling annotated data. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval. ACM (2003)

  10. Blei, David M., Ng, Andrew Y., Jordan, Michael I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  11. Monay, F., Gatica-Perez, D.: Modeling semantic aspects for cross-media image indexing. IEEE Trans. Pattern Anal. Mach. Intell. 29(10), 1802–1817 (2007)

    Article  Google Scholar 

  12. Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42(1–2), 177–196 (2001)

    Article  MATH  Google Scholar 

  13. Grangier, D., Bengio, S.: A discriminative kernel-based approach to rank images from text queries. IEEE Trans. Pattern Anal. Mach. Intell. 30(8), 1371–1384 (2008)

    Article  Google Scholar 

  14. Hertz, T., Bar-Hillel, A., Weinshall, D.: Learning distance functions for image retrieval. In: Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (. CVPR), vol. 2. IEEE (2004)

  15. Makadia, A., Pavlovic, V., Kumar, S.: A new baseline for image annotation. In: Conference on Computer Vision, ECCV 2008. Springer, Berlin, pp. 316–329

  16. Guillaumin, M., et al.: Tagprop: Discriminative metric learning in nearest neighbor models for image auto-annotation. IEEE 12th International Conference on Computer Vision. IEEE (2009)

  17. Yang, Y., et al.: Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval. IEEE Trans. Multimed. 10(3), 437–446 (2008)

    Article  Google Scholar 

  18. Kidron, E., Schechner, Y.Y., Elad, M.: Pixels that sound. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1. IEEE (2005)

  19. Hwang, S.J., Grauman, K.: Learning the relative importance of objects from tagged images for retrieval and cross-modal search. Int. J. Comput. Vis. 100(2), 134–153 (2012)

    Article  MathSciNet  Google Scholar 

  20. Lai, P.L., Fyfe, C.: Kernel and nonlinear canonical correlation analysis. Int. J. Neural Syst. 10(05), 365–377 (2000)

    Article  Google Scholar 

  21. Zhuang, Y.-T., Yang, Y., Wu, F.: Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval. IEEE Trans. Multimed. 10(2), 221–229 (2008)

    Article  Google Scholar 

  22. Yang, Y., et al.: Ranking with local regression and global alignment for cross media retrieval. In: Proceedings of the 17th ACM international conference on Multimedia. ACM (2009)

  23. Rasiwasia, N., et al.: A new approach to cross-modal multimedia retrieval. In: Proceedings of the international conference on Multimedia. ACM (2010)

  24. Chu, W.-T., Chen, H.-Y.: Toward better retrieval and presentation by exploring cross-media correlations. Multimed. Syst. 10(3), 183–198 (2005)

    Article  MathSciNet  Google Scholar 

  25. Jia, Y., Salzmann, M., Darrell, T.: Learning cross-modality similarity for multinomial data. In: 2011 IEEE International Conference on Computer Vision (ICCV). IEEE (2011)

  26. Caicedo, J.C., et al.: Multimodal representation, indexing, automated annotation and retrieval of image collections via non-negative matrix factorization. Neurocomputing 76(1), 50–60 (2012)

    Article  MathSciNet  Google Scholar 

  27. Gao, Y., et al.: Visual–textual joint relevance learning for tag-based social image search. IEEE Trans. Image Process. 22(1), 363–376 (2013)

    Article  MathSciNet  Google Scholar 

  28. Von Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416 (2007)

    Article  MathSciNet  Google Scholar 

  29. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)

    MATH  Google Scholar 

  30. Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 27 (2011)

    Article  Google Scholar 

  31. Hsu, C.-W., Lin, C.-J.: A comparison of methods for multiclass support vector machines. IEEE Trans. Neural Netw. 13(2), 415–425 (2002)

    Article  Google Scholar 

  32. Huiskes, M.J., Lew, M.S.: The MIR Flickr retrieval evaluation. In: Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval. ACM (2008)

  33. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2. IEEE (2006)

  34. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005. CVPR 2005, vol. 1. IEEE (2005)

  35. Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comput. Vis. 42(3), 145–175 (2001)

    Article  MATH  Google Scholar 

  36. Xie, Liang, Pan, Peng, Lu, Yansheng.: A semantic model for cross-modal and multi-modal retrieval. Proceedings of the 3rd ACM conference on International conference on multimedia retrieval. ACM (2013)

  37. Lu, Z., Ip, H.H.S., Peng, Y..: Exhaustive and efficient constraint propagation: a semi-supervised learning perspective and its applications. arXiv preprint arXiv:1109.4684 (2011)

  38. Zhai, X., Peng, Y., Xiao, J.: Cross-media retrieval by intra-media and inter-media correlation mining. Multimedia Systems: 1–12

  39. Atrey, P.K., Anwar Hossain, M., Saddik, AEl, Kankanhalli, M.S.: Multimodal fusion for multimedia analysis: a survey. Multimed. Syst. 16(6), 345–379 (2010)

    Article  Google Scholar 

  40. Jiang, S., Song, X., Huang, Q.: Relative image similarity learning with contextual information for Internet cross-media retrieval. Multimed. Syst. 1–13 (2013)

  41. Yang, Y., Ma, Z., Hauptmann, A., Sebe, N.: Feature selection for multimedia analysis by sharing information among multiple tasks. IEEE Trans. Multimed. 15(3), 661–669 (2013)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Peng Pan.

Additional information

Communicated by L. Zhang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xie, L., Pan, P. & Lu, Y. Analyzing semantic correlation for cross-modal retrieval. Multimedia Systems 21, 525–539 (2015). https://doi.org/10.1007/s00530-014-0397-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-014-0397-6

Keywords

Navigation