Abstract
With the development of multimedia technology, effective cross-modal retrieval methods are increasingly demanded. The key point of cross-modal retrieval is analyzing the correlation of heterogeneous modalities. There are mainly two types of correlation: content correlation and semantic correlation. Semantic correlation is constructed at a high level of abstraction which is more close to the human understanding than content correlation. In this paper, we investigate a semantic model to construct the semantic correlation for cross-modal retrieval. We assume that the semantic correlation of multimedia data from different modalities can be conditionally generated by semantic concepts in a probabilistic generation framework. The cross-modal semantic generation model (CMSGM) is proposed based on this assumption. We consider three cases of the cross-modal retrieval task. The first is the ideal case that all manifest concepts exist in training data for constructing the correlation, and we propose manifest CMSGM (M-CMSGM) which directly uses CMSGM on the manifest semantic concepts for retrieval. The second is the case that there are no manifest concepts in training data, and latent CMSGM (L-CMSGM) based on latent semantic concepts is proposed for this case, where the latent semantic concepts are learned by asymmetric spectral clustering. The last is the most general case that some of the manifest concepts exist, and we combine M-CMSGM and L-CMSGM to get combinative CMSGM (C-CMSGM) to solve this case. Experimental results on Wikipedia featured articles and MIR Flickr show that our methods have better performance compared with previous state-of-the-art methods. And C-CMSGM can maintain good performance in the case that manifest concepts are lacking, which confirms the robustness and practicality of C-CMSGM.
Similar content being viewed by others
References
Chandrika, P., Jawahar, C.V.: Multi modal semantic indexing for image retrieval. In: Proceedings of the ACM International Conference on Image and Video Retrieval, ACM, pp 342–349 (2010)
Wang, X.-J., et al.: Multi-model similarity propagation and its application for web image retrieval”. In: Proceedings of the 12th annual ACM international conference on Multimedia. ACM (2004)
Hoi, S.C.H., Lyu, M.R.: A multimodal and multilevel ranking scheme for large-scale video retrieval. IEEE Trans. Multimed. 10(4), 607–619 (2008)
Hotelling, H.: Relations between two sets of variates. Biometrika 28(3/4), 321–377 (1936)
Lavrenko, V., Manmatha, R., Jeon, J.: A model for learning the semantics of pictures. Advances in neural Information Processing Systems (2003)
Zhang, S., et al.: Automatic image annotation using group sparsity. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2010)
Jeon, J., Lavrenko, V., Manmatha, R.: Automatic image annotation and retrieval using cross-media relevance models. In: Proceedings of the 26th annual international ACM SIGIR Conference on Research and development in information retrieval. ACM (2003)
Feng, S.L., Manmatha, R., Lavrenko, V.: Multiple Bernoulli relevance models for image and video annotation. In: Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2. IEEE (2004)
Blei, D.M., Jordan, M.I.: Modeling annotated data. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval. ACM (2003)
Blei, David M., Ng, Andrew Y., Jordan, Michael I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Monay, F., Gatica-Perez, D.: Modeling semantic aspects for cross-media image indexing. IEEE Trans. Pattern Anal. Mach. Intell. 29(10), 1802–1817 (2007)
Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42(1–2), 177–196 (2001)
Grangier, D., Bengio, S.: A discriminative kernel-based approach to rank images from text queries. IEEE Trans. Pattern Anal. Mach. Intell. 30(8), 1371–1384 (2008)
Hertz, T., Bar-Hillel, A., Weinshall, D.: Learning distance functions for image retrieval. In: Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (. CVPR), vol. 2. IEEE (2004)
Makadia, A., Pavlovic, V., Kumar, S.: A new baseline for image annotation. In: Conference on Computer Vision, ECCV 2008. Springer, Berlin, pp. 316–329
Guillaumin, M., et al.: Tagprop: Discriminative metric learning in nearest neighbor models for image auto-annotation. IEEE 12th International Conference on Computer Vision. IEEE (2009)
Yang, Y., et al.: Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval. IEEE Trans. Multimed. 10(3), 437–446 (2008)
Kidron, E., Schechner, Y.Y., Elad, M.: Pixels that sound. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1. IEEE (2005)
Hwang, S.J., Grauman, K.: Learning the relative importance of objects from tagged images for retrieval and cross-modal search. Int. J. Comput. Vis. 100(2), 134–153 (2012)
Lai, P.L., Fyfe, C.: Kernel and nonlinear canonical correlation analysis. Int. J. Neural Syst. 10(05), 365–377 (2000)
Zhuang, Y.-T., Yang, Y., Wu, F.: Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval. IEEE Trans. Multimed. 10(2), 221–229 (2008)
Yang, Y., et al.: Ranking with local regression and global alignment for cross media retrieval. In: Proceedings of the 17th ACM international conference on Multimedia. ACM (2009)
Rasiwasia, N., et al.: A new approach to cross-modal multimedia retrieval. In: Proceedings of the international conference on Multimedia. ACM (2010)
Chu, W.-T., Chen, H.-Y.: Toward better retrieval and presentation by exploring cross-media correlations. Multimed. Syst. 10(3), 183–198 (2005)
Jia, Y., Salzmann, M., Darrell, T.: Learning cross-modality similarity for multinomial data. In: 2011 IEEE International Conference on Computer Vision (ICCV). IEEE (2011)
Caicedo, J.C., et al.: Multimodal representation, indexing, automated annotation and retrieval of image collections via non-negative matrix factorization. Neurocomputing 76(1), 50–60 (2012)
Gao, Y., et al.: Visual–textual joint relevance learning for tag-based social image search. IEEE Trans. Image Process. 22(1), 363–376 (2013)
Von Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416 (2007)
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 27 (2011)
Hsu, C.-W., Lin, C.-J.: A comparison of methods for multiclass support vector machines. IEEE Trans. Neural Netw. 13(2), 415–425 (2002)
Huiskes, M.J., Lew, M.S.: The MIR Flickr retrieval evaluation. In: Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval. ACM (2008)
Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2. IEEE (2006)
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005. CVPR 2005, vol. 1. IEEE (2005)
Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comput. Vis. 42(3), 145–175 (2001)
Xie, Liang, Pan, Peng, Lu, Yansheng.: A semantic model for cross-modal and multi-modal retrieval. Proceedings of the 3rd ACM conference on International conference on multimedia retrieval. ACM (2013)
Lu, Z., Ip, H.H.S., Peng, Y..: Exhaustive and efficient constraint propagation: a semi-supervised learning perspective and its applications. arXiv preprint arXiv:1109.4684 (2011)
Zhai, X., Peng, Y., Xiao, J.: Cross-media retrieval by intra-media and inter-media correlation mining. Multimedia Systems: 1–12
Atrey, P.K., Anwar Hossain, M., Saddik, AEl, Kankanhalli, M.S.: Multimodal fusion for multimedia analysis: a survey. Multimed. Syst. 16(6), 345–379 (2010)
Jiang, S., Song, X., Huang, Q.: Relative image similarity learning with contextual information for Internet cross-media retrieval. Multimed. Syst. 1–13 (2013)
Yang, Y., Ma, Z., Hauptmann, A., Sebe, N.: Feature selection for multimedia analysis by sharing information among multiple tasks. IEEE Trans. Multimed. 15(3), 661–669 (2013)
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by L. Zhang.
Rights and permissions
About this article
Cite this article
Xie, L., Pan, P. & Lu, Y. Analyzing semantic correlation for cross-modal retrieval. Multimedia Systems 21, 525–539 (2015). https://doi.org/10.1007/s00530-014-0397-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00530-014-0397-6