Abstract
Due to the rapid development of multimedia applications, cross-media semantics learning is becoming increasingly important nowadays. One of the most challenging issues for cross-media semantics understanding is how to mine semantic correlation between different modalities. Most traditional multimedia semantics analysis approaches are based on unimodal data cases and neglect the semantic consistency between different modalities. In this paper, we propose a novel multimedia representation learning framework via latent semantic factorization (LSF). First, the posterior probability under the learned classifiers is served as the latent semantic representation for different modalities. Moreover, we explore the semantic representation for a multimedia document, which consists of image and text, by latent semantic factorization. Besides, two projection matrices are learned to project images and text into a same semantic space which is more similar with the multimedia document. Experiments conducted on three real-world datasets for cross-media retrieval, demonstrate the effectiveness of our proposed approach, compared with state-of-the-art methods.




Similar content being viewed by others
References
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
Bouchard G, Yin D, Guo S (2013) Convex collective matrix factorization. In Artificial Intelligence and Statistics 31:144–152
Chang X, Yang Y (2016) Semi-supervised feature analysis by mining correlations among multiple tasks. IEEE Transactions on Neural Networks and Learning Systems. https://doi.org/10.1109/TNNLS.2016.2582746
Chang X, Nie F, Wang S, Yang Y, Zhou X, Zhang C (2016) Compound rank-k projections for bilinear analysis. IEEE Trans Neural Netw Learn Syst 27(7):1502–1513
Chang X, Yu YL, Yang Y, Xing EP (2016) Semantic pooling for complex event analysis in untrimmed videos. IEEE Trans Pattern Anal Mach Intell 39(8):1617-1632
Chang X, Nie F, Yang Y, Zhang C, Huang H (2016) Convex sparse pca for unsupervised feature learning. ACM Trans Knowl Discov Data 11(1):3
Chang X, Ma Z, Yang Y, Zeng Z, Hauptmann AG (2017) Bi-level semantic representation analysis for multimedia event detection. IEEE Trans Cybern 47(5):1180–1197
Chang X, Ma Z, Lin M, Yang Y, Hauptmann AG (2017) Feature interaction augmented sparse learning for fast Kinect motion detection. IEEE Trans Image Process 26(8):3911–3920
Gong Y, Ke Q, Isard M, Lazebnik S (2014) A multi-view embedding space for modeling internet images, tags, and their semantics. Int J Comput Vis 106(2):210–233
Huang L, Peng Y (2016) Cross-media retrieval via semantic entity projection. In: proceedings, part I, of the 22nd international conference on multimedia modeling, vol 9516, pp 276–288
Jacobs DW, Daume H, Kumar A, Sharma A (2012) Generalized multiview analysis: a discriminative latent space. IEEE Conf Comput Vis Pattern Recognit 157:2160–2167
Jiang A, Li H, Li Y, Wang M (2015) Learning discriminative representations for semantic cross media retrieval. Comput Sci 1511:1–11
Krapac J, Allan M, Verbeek J, Jurie F (2010) Improving web image search results using query-relative classifiers. Comput Vis Pattern Recognit 119:1094–1101
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Proces Syst 25(2):1097–1105
Lan ZZ, Bao L, Yu SI, Liu W, Hauptmann A (2012) Double fusion for multimedia event detection. Advances in Multimed Model 7131:173–185
Li D, Dimitrova N, Li M, Sethi IK (2003) Multimedia content processing through cross-modal association. In: proceedings of the 11th ACM international conference on multimedia, ACM, pp 604–611
Li B, Li J, Zhang XP (2015) Nonparametric discriminant multi-manifold learning for dimensionality reduction. Neurocomputing 152(3):121–126
Li B, Du J, Zhang XP (2016) Feature extraction using maximum nonparametric margin projection. Neurocomputing 188(5):225–232
Liong VE, Lu J, Tan YP, Zhou J (2017) Deep coupled metric learning for cross-modal matching. IEEE Trans Multimed 19(6):1234–1244
Ma Z, Nie F, Yang Y, Uijlings JRR (2012) Web image annotation via subspace-sparsity collaborated feature selection. IEEE Trans Multimed 14(4):1021–1030
Mcgurk H, Macdonald J (1976) Hearing lips and seeing voices. Nature 264(5588):746–748
Nie T, Shen D, Kou Y, Yu G, Yue D (2011) An entity relation extraction model based on semantic pattern matching. In: web information systems and applications conference (WISA), pp 7–12
Pereira JC, Coviello E, Doyle G, Rasiwasia N, Lanckriet GRG, Levy R et al (2013) On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Trans Pattern Anal Mach Intell 36(3):521–535
Putthividhy D, Attias HT, Nagarajan SS (2010) Topic regression multi-modal latent Dirichlet allocation for image annotation. Comput Vis Pattern Recognit 238:3408–3415
Rafailidis D, Crestani F (2016) Cluster-based joint matrix factorization hashing for cross-modal retrieval. International ACM SIGIR conference on Research and Development in information retrieval, pp 781–784
Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet GR, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: proceedings of the 18th ACM international conference on multimedia, ACM, pp 251–260
Singh AP, Kumar G, Gupta R (2008) Relational learning via collective matrix factorization. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 40(46):650–658
Tao D, Tang X, Li X, Wu X (2006) Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval. IEEE Trans Pattern Anal Mach Intell 28(7):1088–1099
Wang K, He R, Wang W, Wang L, Tan T (2013) Learning coupled feature spaces for cross-modal matching. In: proceedings of 2013 I.E. international conference on computer vision IEEE, pp 2088–2095
Wang Y, Wu F, Song J, Li X, Zhuang Y (2014) Multi-modal mutual topic reinforce modeling for cross-media retrieval. In: proceedings of the 22nd ACM international conference on multimedia, ACM, pp 307–316
Wang C, Yang H, Meinel C (2016) A deep semantic framework for multimodal representation learning. Multimed Tools Appl 15(75):9255–9276
Wang C, Yang H, Meinel C (2016) A deep semantic framework for multimodal representation learning. Multimed Tools Appl 75(15):9255–9276
Wei Y, Zhao, Y, Zhu Z, Wei S, Xiao Y, Feng J, et al (2015) Modality-dependent cross-media retrieval. ACM Trans Intell Syst Technol 7(4):57
Xie L, Zhu L, Chen G (2016) Unsupervised multi-graph cross-modal hashing for large-scale multimedia retrieval. Multimed Tools Appl 75(15):9185–9204
Xue Z, Li G, Zhang W, Pang J, Huang Q (2014) Topic detection in cross-media: a semi-supervised co-clustering approach. Int J Multimed Inf Retr 3(3):193–205
Yan F, Mikolajczyk K (2015) Deep correlation for matching images and text. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3441–3450
Yang Y, Zhuang YT, Wu F, Pan YH (2008) Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval. IEEE Trans Multimedia 10(3):437–446
Yang Y, Nie F, Xu D, Luo J, Zhuang Y, Pan Y (2012) A multimedia retrieval framework based on semi-supervised ranking and relevance feedback. IEEE Trans Pattern Anal Mach Intell 34(4):723–742
Yang Y, Ma Z, Hauptmann AG, Sebe N (2013) Feature selection for multimedia analysis by sharing information among multiple tasks. IEEE Trans Multimedia 15(3):661–669
Zha ZJ, Wang M, Zheng YT, Yang Y, et al (2012) Interactive video indexing with statistical active learning. IEEE Trans Multimedia 14(1):17–27
Zhang D, Li WJ (2014) Large-scale supervised multimodal hashing with semantic correlation maximization. In: AAAI, vol 1, no. 2, pp 2177–2183
Zhang H, Yu J, Wang M, Liu Y (2012) Semi-supervised distance metric learning based on local linear regression for data clustering. Neurocomputing 93:100–105
Zhang H, Liu Y, Ma Z (2013) Fusing inherent and external knowledge with nonlinear learning for cross-media retrieval. Neurocomputing 119:10–16
Zhang H, Yan Z, Sun C, Wei S (2015) Based on entities behavior patterns of heterogeneous data semantic conflict detection. In: web information system and application conference (WISA), pp 169–174
Zhang H, Zhang W, Liu W, Xu X, Fan H (2016) Multiple kernel visual-auditory representation learning for retrieval. Multimed Tools Appl 75(15):9169–9184
Zhang H, Wu P, Beck A, Zhang Z, Gao X (2016) Adaptive incremental learning of image semantics with application to social robot. Neurocomputing 173:93–101
Zhou J, Ding G, Guo Y (2014) Latent semantic sparse hashing for cross-modal similarity search. International ACM SIGIR conference on Research & Development in information retrieval, pp 415–424
Zhuang YT, Yang Y, Wu F (2008) Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval. IEEE Trans Multimedia 10(2):221–229
Zhuang Y, Wang Y, Wu F, Zhang Y, Lu W (2013) Supervised coupled dictionary learning with group structures for multi-modal retrieval. In: Proceedings of the 27th AAAI Conference on Artificial Intelligence, pp 1070–1076
Acknowledgements
This research is supported by the National Natural Science Foundation of China (No. 61373109, No. 61602349), the Hubei Chengguang Talented Youth Development Foundation (No. 2015B22), Natural Science Foundation Hubei Province (No.ZRMS2016000155) and Science and technology research project of Hubei Provincial Department of Education (No.Q20161113).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhang, H., Huang, Y., Xu, X. et al. Latent semantic factorization for multimedia representation learning. Multimed Tools Appl 77, 3353–3368 (2018). https://doi.org/10.1007/s11042-017-5135-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-017-5135-6