Skip to main content
Log in

Latent semantic factorization for multimedia representation learning

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Due to the rapid development of multimedia applications, cross-media semantics learning is becoming increasingly important nowadays. One of the most challenging issues for cross-media semantics understanding is how to mine semantic correlation between different modalities. Most traditional multimedia semantics analysis approaches are based on unimodal data cases and neglect the semantic consistency between different modalities. In this paper, we propose a novel multimedia representation learning framework via latent semantic factorization (LSF). First, the posterior probability under the learned classifiers is served as the latent semantic representation for different modalities. Moreover, we explore the semantic representation for a multimedia document, which consists of image and text, by latent semantic factorization. Besides, two projection matrices are learned to project images and text into a same semantic space which is more similar with the multimedia document. Experiments conducted on three real-world datasets for cross-media retrieval, demonstrate the effectiveness of our proposed approach, compared with state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  2. Bouchard G, Yin D, Guo S (2013) Convex collective matrix factorization. In Artificial Intelligence and Statistics 31:144–152

  3. Chang X, Yang Y (2016) Semi-supervised feature analysis by mining correlations among multiple tasks. IEEE Transactions on Neural Networks and Learning Systems. https://doi.org/10.1109/TNNLS.2016.2582746

    Article  MathSciNet  Google Scholar 

  4. Chang X, Nie F, Wang S, Yang Y, Zhou X, Zhang C (2016) Compound rank-k projections for bilinear analysis. IEEE Trans Neural Netw Learn Syst 27(7):1502–1513

  5. Chang X, Yu YL, Yang Y, Xing EP (2016) Semantic pooling for complex event analysis in untrimmed videos. IEEE Trans Pattern Anal Mach Intell 39(8):1617-1632

    Article  Google Scholar 

  6. Chang X, Nie F, Yang Y, Zhang C, Huang H (2016) Convex sparse pca for unsupervised feature learning. ACM Trans Knowl Discov Data 11(1):3

    Article  Google Scholar 

  7. Chang X, Ma Z, Yang Y, Zeng Z, Hauptmann AG (2017) Bi-level semantic representation analysis for multimedia event detection. IEEE Trans Cybern 47(5):1180–1197

    Article  Google Scholar 

  8. Chang X, Ma Z, Lin M, Yang Y, Hauptmann AG (2017) Feature interaction augmented sparse learning for fast Kinect motion detection. IEEE Trans Image Process 26(8):3911–3920

    Article  MathSciNet  MATH  Google Scholar 

  9. Gong Y, Ke Q, Isard M, Lazebnik S (2014) A multi-view embedding space for modeling internet images, tags, and their semantics. Int J Comput Vis 106(2):210–233

    Article  Google Scholar 

  10. Huang L, Peng Y (2016) Cross-media retrieval via semantic entity projection. In: proceedings, part I, of the 22nd international conference on multimedia modeling, vol 9516, pp 276–288

    Chapter  Google Scholar 

  11. Jacobs DW, Daume H, Kumar A, Sharma A (2012) Generalized multiview analysis: a discriminative latent space. IEEE Conf Comput Vis Pattern Recognit 157:2160–2167

  12. Jiang A, Li H, Li Y, Wang M (2015) Learning discriminative representations for semantic cross media retrieval. Comput Sci 1511:1–11 

  13. Krapac J, Allan M, Verbeek J, Jurie F (2010) Improving web image search results using query-relative classifiers. Comput Vis Pattern Recognit 119:1094–1101

    Google Scholar 

  14. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Proces Syst 25(2):1097–1105

    Google Scholar 

  15. Lan ZZ, Bao L, Yu SI, Liu W, Hauptmann A (2012) Double fusion for multimedia event detection. Advances in Multimed Model 7131:173–185

    Article  Google Scholar 

  16. Li D, Dimitrova N, Li M, Sethi IK (2003) Multimedia content processing through cross-modal association. In: proceedings of the 11th ACM international conference on multimedia, ACM, pp 604–611

  17. Li B, Li J, Zhang XP (2015) Nonparametric discriminant multi-manifold learning for dimensionality reduction. Neurocomputing 152(3):121–126

    Article  Google Scholar 

  18. Li B, Du J, Zhang XP (2016) Feature extraction using maximum nonparametric margin projection. Neurocomputing 188(5):225–232

    Article  Google Scholar 

  19. Liong VE, Lu J, Tan YP, Zhou J (2017) Deep coupled metric learning for cross-modal matching. IEEE Trans Multimed 19(6):1234–1244

    Article  Google Scholar 

  20. Ma Z, Nie F, Yang Y, Uijlings JRR (2012) Web image annotation via subspace-sparsity collaborated feature selection. IEEE Trans Multimed 14(4):1021–1030

    Article  Google Scholar 

  21. Mcgurk H, Macdonald J (1976) Hearing lips and seeing voices. Nature 264(5588):746–748

    Article  Google Scholar 

  22. Nie T, Shen D, Kou Y, Yu G, Yue D (2011) An entity relation extraction model based on semantic pattern matching. In: web information systems and applications conference (WISA), pp 7–12

  23. Pereira JC, Coviello E, Doyle G, Rasiwasia N, Lanckriet GRG, Levy R et al (2013) On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Trans Pattern Anal Mach Intell 36(3):521–535

    Article  Google Scholar 

  24. Putthividhy D, Attias HT, Nagarajan SS (2010) Topic regression multi-modal latent Dirichlet allocation for image annotation. Comput Vis Pattern Recognit 238:3408–3415

    Google Scholar 

  25. Rafailidis D, Crestani F (2016) Cluster-based joint matrix factorization hashing for cross-modal retrieval. International ACM SIGIR conference on Research and Development in information retrieval, pp 781–784

  26. Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet GR, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: proceedings of the 18th ACM international conference on multimedia, ACM, pp 251–260

  27. Singh AP, Kumar G, Gupta R (2008) Relational learning via collective matrix factorization. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 40(46):650–658

  28. Tao D, Tang X, Li X, Wu X (2006) Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval. IEEE Trans Pattern Anal Mach Intell 28(7):1088–1099

    Article  Google Scholar 

  29. Wang K, He R, Wang W, Wang L, Tan T (2013) Learning coupled feature spaces for cross-modal matching. In: proceedings of 2013 I.E. international conference on computer vision IEEE, pp 2088–2095

  30. Wang Y, Wu F, Song J, Li X, Zhuang Y (2014) Multi-modal mutual topic reinforce modeling for cross-media retrieval. In: proceedings of the 22nd ACM international conference on multimedia, ACM, pp 307–316

  31. Wang C, Yang H, Meinel C (2016) A deep semantic framework for multimodal representation learning. Multimed Tools Appl 15(75):9255–9276

    Article  Google Scholar 

  32. Wang C, Yang H, Meinel C (2016) A deep semantic framework for multimodal representation learning. Multimed Tools Appl 75(15):9255–9276

    Article  Google Scholar 

  33. Wei Y, Zhao, Y, Zhu Z, Wei S, Xiao Y, Feng J, et al (2015) Modality-dependent cross-media retrieval. ACM Trans Intell Syst Technol 7(4):57

    Article  Google Scholar 

  34. Xie L, Zhu L, Chen G (2016) Unsupervised multi-graph cross-modal hashing for large-scale multimedia retrieval. Multimed Tools Appl 75(15):9185–9204

    Article  Google Scholar 

  35. Xue Z, Li G, Zhang W, Pang J, Huang Q (2014) Topic detection in cross-media: a semi-supervised co-clustering approach. Int J Multimed Inf Retr 3(3):193–205

    Article  Google Scholar 

  36. Yan F, Mikolajczyk K (2015) Deep correlation for matching images and text. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3441–3450

  37. Yang Y, Zhuang YT, Wu F, Pan YH (2008) Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval. IEEE Trans Multimedia 10(3):437–446

  38. Yang Y, Nie F, Xu D, Luo J, Zhuang Y, Pan Y (2012) A multimedia retrieval framework based on semi-supervised ranking and relevance feedback. IEEE Trans Pattern Anal Mach Intell 34(4):723–742

  39. Yang Y, Ma Z, Hauptmann AG, Sebe N (2013) Feature selection for multimedia analysis by sharing information among multiple tasks. IEEE Trans Multimedia 15(3):661–669

    Article  Google Scholar 

  40. Zha ZJ, Wang M, Zheng YT, Yang Y, et al (2012) Interactive video indexing with statistical active learning. IEEE Trans Multimedia 14(1):17–27

    Article  Google Scholar 

  41. Zhang D, Li WJ (2014) Large-scale supervised multimodal hashing with semantic correlation maximization. In: AAAI, vol 1, no. 2, pp 2177–2183

  42. Zhang H, Yu J, Wang M, Liu Y (2012) Semi-supervised distance metric learning based on local linear regression for data clustering. Neurocomputing 93:100–105

    Article  Google Scholar 

  43. Zhang H, Liu Y, Ma Z (2013) Fusing inherent and external knowledge with nonlinear learning for cross-media retrieval. Neurocomputing 119:10–16

    Article  Google Scholar 

  44. Zhang H, Yan Z, Sun C, Wei S (2015) Based on entities behavior patterns of heterogeneous data semantic conflict detection. In: web information system and application conference (WISA), pp 169–174

  45. Zhang H, Zhang W, Liu W, Xu X, Fan H (2016) Multiple kernel visual-auditory representation learning for retrieval. Multimed Tools Appl 75(15):9169–9184

    Article  Google Scholar 

  46. Zhang H, Wu P, Beck A, Zhang Z, Gao X (2016) Adaptive incremental learning of image semantics with application to social robot. Neurocomputing 173:93–101

    Article  Google Scholar 

  47. Zhou J, Ding G, Guo Y (2014) Latent semantic sparse hashing for cross-modal similarity search. International ACM SIGIR conference on Research & Development in information retrieval, pp 415–424

  48. Zhuang YT, Yang Y, Wu F (2008) Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval. IEEE Trans Multimedia 10(2):221–229

    Article  Google Scholar 

  49. Zhuang Y, Wang Y, Wu F, Zhang Y, Lu W (2013) Supervised coupled dictionary learning with group structures for multi-modal retrieval. In: Proceedings of the 27th AAAI Conference on Artificial Intelligence, pp 1070–1076

Download references

Acknowledgements

This research is supported by the National Natural Science Foundation of China (No. 61373109, No. 61602349), the Hubei Chengguang Talented Youth Development Foundation (No. 2015B22), Natural Science Foundation Hubei Province (No.ZRMS2016000155) and Science and technology research project of Hubei Provincial Department of Education (No.Q20161113).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hong Zhang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, H., Huang, Y., Xu, X. et al. Latent semantic factorization for multimedia representation learning. Multimed Tools Appl 77, 3353–3368 (2018). https://doi.org/10.1007/s11042-017-5135-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-017-5135-6

Keywords

Navigation