Skip to main content
Log in

Multimedia integrated annotation based on common space learning

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Multimedia automatic annotation, which assigns text labels to multimedia objects, has been widely studied. However, existing methods usually focus on modeling two types of media data or pairwise correlation. In fact, heterogeneous media are complementary to each other and optimizing them simultaneously can further improve accuracy. In this paper, a novel common space learning (CSL) algorithm for multimedia integrated annotation is presented, by which heterogeneous media data can be projected into a unified space and multimedia annotation is transformed to the nearest neighbor search in the space. Optimizing these heterogeneous media simultaneously makes the heterogeneous media complementary to each other and aligned in the common space. We solve the proposed CSL as an optimization problem mainly considering the following issues. First, different types of media objects with the similar labels should be closer in the common space. Second, the media similarity of the original space and the common space should be consistent. We attempt to solve the optimization problem in a sparse and semi-supervised learning framework, thus more unlabeled data can be integrated into the learning process, which can boost the performance of space learning. In addition, we proposed an iterative optimization algorithm to solve the problem. Since the projected samples in the common space share the same representation, the labels for new media object are assigned by a simple nearest neighbor voting mechanism. To the best of our knowledge, our method has made the first attempt to multimedia integrated annotation. Experiments on data sets with up to four media types (image, sound, video and 3D model) show the effectiveness of our proposed approach, as compared with the state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Atrey P K, Hossain M A, El Saddik A (2010) Multimodal fusion for multimedia analysis: a survey. Multimed Syst 16:345–379

    Article  Google Scholar 

  2. Battiato S, Farinella GM, Guarnera GC (2007) Data mining learning bootstrap through semantic thumbnail analysis. In: Proceedings of electronic imaging, p 65060P

  3. Battiato S, Farinella G M, Giuffrida G (2009) Using visual and text features for direct marketing on multimedia messaging services domain. Multimed Tools Appl 42:5–30

    Article  Google Scholar 

  4. Battiato S, Farinella GM, Guarnera GC (2010) Bags of phrases with codebooks alignment for near duplicate image detection. In: Proceedings of the 2nd ACM workshop on multimedia in forensics, security and intelligence, pp 65–70

  5. Bredin H, Chollet G (2007) Audio-visual speech synchrony measure for talking-face identity verification. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing, pp II-233–II-236

  6. Chen DY, Tian XP, Shen YT, Ouhyoung M (2003) On visual similarity based 3D model retrieval. In: Proceedings of computer graphics forum, pp 223–232

  7. Chen L, Xu D, Tsang I W, Luo J (2012) Tag-based image retrieval improved by augmented features and group-based refinement. IEEE Trans Multimed 14:1057–1067

    Article  Google Scholar 

  8. Feng SL, Manmatha R, Lavrenko V (2004) Multiple Bernoulli relevance models for image and video annotation. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, II–II

  9. Feng Z, Feng S, Jin R, Jain AK (2014) Image tag completion by noisy matrix recovery. In: Proceedings of the European conference on computer vision, pp 424–438

  10. Gao Y, Wang M, Zha Z J, Shen J, Li X, Wu X (2013) Visual-textual joint relevance learning for tag-based social image search. IEEE Trans Image Process 22:363–376

    Article  MathSciNet  Google Scholar 

  11. Gemmeke, Jort F (2017) Audio set: an ontology and human-labeled dartaset for audio events. In: IEEE ICASSP

  12. Guillaumin M, Mensink T, Verbeek J (2009) Tagprop: discriminative metric learning in nearest neighbor models for image auto-annotation. In: Proceedings of the IEEE 12th international conference on computer vision, pp 309–316

  13. Hardoon D, Sandor S, John S (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16:2639–2664

    Article  Google Scholar 

  14. Hotelling H (1936) Relations between two sets of variates. Biometrika 28:321–377

    Article  Google Scholar 

  15. Hu Y, Cheng X, Chia L T (2009) Coherent phrase model for efficient image near-duplicate retrieval. IEEE Trans Multimed 11:1434–1445

    Article  Google Scholar 

  16. Jeon J, Lavrenko V, Manmatha R (2003) Automatic image annotation and retrieval using cross-media relevance models. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval, pp 119–126

  17. Kalayeh MM, Idrees H, Shah M (2014) NMF-KNN: image annotation using weighted multi-view non-negative matrix factorization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 184–191

  18. Khoshneshin M, Street WN (2010) Collaborative filtering via euclidean embedding. In: Proceedings of the 4th ACM conference on recommender systems, pp 87–94

  19. Kidron E, Schechner Y Y, Elad M (2005) Pixels that sound. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 88–95

  20. Kuo Y H, Cheng W H, Lin H T, Hsu W H (2012) Unsupervised semantic feature discovery for image object retrieval and tag refinement. IEEE Trans Multimed 14:1079–1090

    Article  Google Scholar 

  21. Lee S, De Neve W, Ro Y M (2014) Visually weighted neighbor voting for image tag relevance learning. Multimed Tools Appl 72:1363–1386

    Article  Google Scholar 

  22. Li X, Snoek CG (2013) Classifying tag relevance with relevant positive and negative examples. In: Proceedings of the 21st ACM international conference on multimedia, pp 485–488

  23. Li D, Dimitrova N, Li M, Sethi IK (2003) Multimedia content processing through cross-modal association. In: Proceedings of the 11th ACM international conference on multimedia, pp 604–611

  24. Li X, Snoek C G, Worring M (2009) Learning social tag relevance by neighbor voting. IEEE Trans Multimed 11:1310–1322

    Article  Google Scholar 

  25. Li X, Uricchio T, Ballan L, Bertini M, Snoek C G, Bimbo A D (2016) Socializing the semantic gap: a comparative survey on image tag assignment, refinement, and retrieval. ACM Comput Surv 49:14

    Google Scholar 

  26. Liu D, Yan SH, Rui Y (2010) Unified tag analysis with multi-edge graph. In: Proceedings of the ACM multimedia international conference, pp 25–34

  27. Liu Y, Zhao WL, Ngo CW (2010) Coherent bag-of audio words model for efficient large-scale video copy detection. In: Proceedings of the ACM international conference on image and video retrieval, pp 89–96

  28. Liu J, Zhang Y, Li Z, Lu H (2013) Correlation consistency constrained probabilistic matrix factorization for social tag refinement. Neurocomputing 119:3–9

    Article  Google Scholar 

  29. Liu A-A, Su Y-T, Jia P-P, Gao Z, Hao T, Yang Z-X (2015) Multipe/single-view human action recognition via part-induced multitask structural learning. IEEE Trans Cybern 45(6):1194–1208

    Article  Google Scholar 

  30. Liu A A, Nie W Z, Gao Y, Su Y T (2016) Multi-modal clique-graph matching for view-based 3D model retrieval. IEEE Trans Image Process 25(5):2103–2116

    Article  MathSciNet  Google Scholar 

  31. Liu A-A, Xu N, Nie W, Su Y, Wong Y, Kankanhalli M (2016) Benchmarking a multimodal and multiview and interactive dataset for human action recognition. IEEE Trans Cybern 44(4):1–1

    Google Scholar 

  32. Liu A-A, Su Y-T, Nie W-Z, Kankanhalli M (2017) Hierarchical clustering multi-task learning for joint human action grouping and recognition. IEEE Trans Pattern Anal Mach Intell 39(1):102–114

    Article  Google Scholar 

  33. Lyndon SK, Malcolm S, Kilian W (2009) Reliable tags using image similarity: mining specificity and expertise from large-scale multimedia databases. In: Proceedings of ACM MM workshop on web-scale multimedia corpus, pp 17–24

  34. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:1301.3781

  35. Monay F, Gatica-Perez D (2004) PLSA-based image auto-annotation: constraining the latent space. In: Proceedings of the 12th annual ACM international conference on multimedia, pp 348–351

  36. Nie F, Huang H, Cai X, Ding CH (2010) Efficient and robust feature selection via joint 2,1-norms minimization. In: Proceedings of the neural information processing systems, pp 1813–1821

  37. Nie W, Liu A, Su Y (2016) Cross-domain semantic transfer from large-scale social media. Multimed Syst 22(1):75–85

    Article  Google Scholar 

  38. Pan JY, Yang HJ, Faloutsos C (2004) Automatic multimedia cross-modal correlation discovery. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 653–658

  39. Pols LCW (1966) Spectral analysis and identification of dutch vowels in monosyllabic words. Doctoral dissertion, pp 26–27

  40. Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet GR, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: Proceedings of the 18th ACM international conference on multimedia, pp 251–260

  41. Richter F, Romberg S, Horster E, Lienhart R (2012) Leveraging community metadata for multimodal image ranking. Multimed Tools Appl 56:35–62

    Article  Google Scholar 

  42. Rui XG, Li MJ, Li ZW (2007) Bipartite graph reinforcement model for web image annotation. In: Proceedings of the ACM international multimedia conference and exhibition, pp 585–594

  43. Sang J, Xu C, Liu J (2012) User-aware image tag refinement via ternary semantic analysis. IEEE Trans Multimed 14:883–895

    Article  Google Scholar 

  44. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556

  45. Truong BQ, Sun A, Bhowmick SS (2012) Content is still king: the effect of neighbor voting schemes on tag relevance for social image retrieval. In: Proceedings of the 2nd ACM international conference on multimedia retrieval, p 9

  46. Verbeek J, Guillaumin M, Mensink T, Schmid C (2010) Image annotation with Tagprop on the Mirflickr set. In: Proceedings of the international conference on multimedia information retrieval, pp 537–546

  47. Wang M, Ni B, Hua XS (2012) Assistive tagging: a survey of multimedia tagging with human-computer joint exploration. ACM Comput Surv 44:25–25

    Article  Google Scholar 

  48. Wang J, Zhou J, Xu H, Mei T, Hua X S, Li S (2014) Image tag refinement by regularized latent dirichlet allocation. Comput Vis Image Underst 124:61–70

    Article  Google Scholar 

  49. Wu L, Jin R, Jain A K (2013) Tag completion for image retrieval. IEEE Trans Pattern Anal Mach Intell 35:716–727

    Article  Google Scholar 

  50. Wu P, Hoi S C H, Xia H (2013) Online multimodal deep similarity learning with application to ImageRetrieval. In: Proceedings of the 21st ACM international conference on multimedia, pp 153–162

  51. Xia H, Wu P, Hoi S C H (2013) Online multi-modal distance learning for scalable multimedia retrieval. In: Proceedings of the 6th ACM international conference on web search and data mining, pp 455–464

  52. Xu X, Shimada A, Taniguchi RI (2014) Tag completion with defective tag assignments via image-tag re-weighting. In: Proceedings of the IEEE international conference on multimedia and expo, pp 1–6

  53. Yakhnenko O, Honavar V (2008) Annotating images and image objects using a hierarchical dirichlet process model. In: Proceedings of the 9th international workshop on multimedia data mining: held in conjunction with the ACM SIGKDD, pp 1–7

  54. Yang Y, Zhuang Y T, Wu F (2008) Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval. IEEE Trans Multimed 10:437–446

    Article  Google Scholar 

  55. Yang Y, Xu D, Nie F (2009) Ranking with local regression and globaGl alignment for cross media retrieval. In: Proceedings of the 17th ACM international conference on multimedia, pp 175–184

  56. Zhou B, Jagadeesh V, Piramuthu R (2015) Conceptlearner: discovering visual concepts from weakly labeled image collections. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1492–1500

  57. Zhu G, Yan S, Ma Y (2010) Image tag refinement towards low-rank, content-tag prior and error sparsity. In: Proceedings of the 18th ACM international conference on multimedia, pp 461–470

  58. Zhu X, Nejdl W, Georgescu M (2014) An adaptive teleportation random walk model for learning social tag relevance. In: Proceedings of the 37th international ACM SIGIR conference on research & development in information retrieval, pp 223–232

  59. Zhuang Y T, Yang Y, Wu F (2008) Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval. IEEE Trans Multimed 10:221–229

    Article  Google Scholar 

  60. Znaidia A, Shabou A, Le Borgne H (2012) Bag-of-multimedia-words for image classification, (ICPR). In: Proceedings of the 21st IEEE international conference on pattern recognition, pp 1509–1512

Download references

Acknowledgements

Special thanks should go to the collaborators in the Lab for Media Search of National University of Singapore, for their instructive advice and useful suggestions on this work. This work is supported by the Natural Science Foundation of China (No.61502094,61402099,61402016), Natural Science Foundation of Heilongjiang Province of China (No.F2016002,F2015020) and Beijing Natural Science Foundation (No.4154067).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Feng Tian.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tian, F., Liu, X., Liu, Z. et al. Multimedia integrated annotation based on common space learning. Multimed Tools Appl 78, 437–456 (2019). https://doi.org/10.1007/s11042-017-5068-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-017-5068-0

Keywords

Navigation