Multimedia integrated annotation based on common space learning

Tian, Feng; Liu, Xianmei; Liu, Zhuoxuan; Sun, Ning; Wang, Mei; Wang, Haochang; Zhang, Fengquan

doi:10.1007/s11042-017-5068-0

Multimedia integrated annotation based on common space learning

Published: 09 August 2017

Volume 78, pages 437–456, (2019)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Feng Tian ORCID: orcid.org/0000-0002-5916-5809^1,2,
Xianmei Liu¹,
Zhuoxuan Liu¹,
Ning Sun¹,
Mei Wang¹,
Haochang Wang¹ &
…
Fengquan Zhang³

314 Accesses
4 Citations
Explore all metrics

Abstract

Multimedia automatic annotation, which assigns text labels to multimedia objects, has been widely studied. However, existing methods usually focus on modeling two types of media data or pairwise correlation. In fact, heterogeneous media are complementary to each other and optimizing them simultaneously can further improve accuracy. In this paper, a novel common space learning (CSL) algorithm for multimedia integrated annotation is presented, by which heterogeneous media data can be projected into a unified space and multimedia annotation is transformed to the nearest neighbor search in the space. Optimizing these heterogeneous media simultaneously makes the heterogeneous media complementary to each other and aligned in the common space. We solve the proposed CSL as an optimization problem mainly considering the following issues. First, different types of media objects with the similar labels should be closer in the common space. Second, the media similarity of the original space and the common space should be consistent. We attempt to solve the optimization problem in a sparse and semi-supervised learning framework, thus more unlabeled data can be integrated into the learning process, which can boost the performance of space learning. In addition, we proposed an iterative optimization algorithm to solve the problem. Since the projected samples in the common space share the same representation, the labels for new media object are assigned by a simple nearest neighbor voting mechanism. To the best of our knowledge, our method has made the first attempt to multimedia integrated annotation. Experiments on data sets with up to four media types (image, sound, video and 3D model) show the effectiveness of our proposed approach, as compared with the state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Microsoft COCO: Common Objects in Context

Learning with Noisy Correspondence

Article 13 April 2024

Learning to Prompt for Vision-Language Models

Article 31 July 2022

References

Atrey P K, Hossain M A, El Saddik A (2010) Multimodal fusion for multimedia analysis: a survey. Multimed Syst 16:345–379
Article Google Scholar
Battiato S, Farinella GM, Guarnera GC (2007) Data mining learning bootstrap through semantic thumbnail analysis. In: Proceedings of electronic imaging, p 65060P
Battiato S, Farinella G M, Giuffrida G (2009) Using visual and text features for direct marketing on multimedia messaging services domain. Multimed Tools Appl 42:5–30
Article Google Scholar
Battiato S, Farinella GM, Guarnera GC (2010) Bags of phrases with codebooks alignment for near duplicate image detection. In: Proceedings of the 2nd ACM workshop on multimedia in forensics, security and intelligence, pp 65–70
Bredin H, Chollet G (2007) Audio-visual speech synchrony measure for talking-face identity verification. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing, pp II-233–II-236
Chen DY, Tian XP, Shen YT, Ouhyoung M (2003) On visual similarity based 3D model retrieval. In: Proceedings of computer graphics forum, pp 223–232
Chen L, Xu D, Tsang I W, Luo J (2012) Tag-based image retrieval improved by augmented features and group-based refinement. IEEE Trans Multimed 14:1057–1067
Article Google Scholar
Feng SL, Manmatha R, Lavrenko V (2004) Multiple Bernoulli relevance models for image and video annotation. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, II–II
Feng Z, Feng S, Jin R, Jain AK (2014) Image tag completion by noisy matrix recovery. In: Proceedings of the European conference on computer vision, pp 424–438
Gao Y, Wang M, Zha Z J, Shen J, Li X, Wu X (2013) Visual-textual joint relevance learning for tag-based social image search. IEEE Trans Image Process 22:363–376
Article MathSciNet Google Scholar
Gemmeke, Jort F (2017) Audio set: an ontology and human-labeled dartaset for audio events. In: IEEE ICASSP
Guillaumin M, Mensink T, Verbeek J (2009) Tagprop: discriminative metric learning in nearest neighbor models for image auto-annotation. In: Proceedings of the IEEE 12th international conference on computer vision, pp 309–316
Hardoon D, Sandor S, John S (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16:2639–2664
Article Google Scholar
Hotelling H (1936) Relations between two sets of variates. Biometrika 28:321–377
Article Google Scholar
Hu Y, Cheng X, Chia L T (2009) Coherent phrase model for efficient image near-duplicate retrieval. IEEE Trans Multimed 11:1434–1445
Article Google Scholar
Jeon J, Lavrenko V, Manmatha R (2003) Automatic image annotation and retrieval using cross-media relevance models. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval, pp 119–126
Kalayeh MM, Idrees H, Shah M (2014) NMF-KNN: image annotation using weighted multi-view non-negative matrix factorization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 184–191
Khoshneshin M, Street WN (2010) Collaborative filtering via euclidean embedding. In: Proceedings of the 4th ACM conference on recommender systems, pp 87–94
Kidron E, Schechner Y Y, Elad M (2005) Pixels that sound. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 88–95
Kuo Y H, Cheng W H, Lin H T, Hsu W H (2012) Unsupervised semantic feature discovery for image object retrieval and tag refinement. IEEE Trans Multimed 14:1079–1090
Article Google Scholar
Lee S, De Neve W, Ro Y M (2014) Visually weighted neighbor voting for image tag relevance learning. Multimed Tools Appl 72:1363–1386
Article Google Scholar
Li X, Snoek CG (2013) Classifying tag relevance with relevant positive and negative examples. In: Proceedings of the 21st ACM international conference on multimedia, pp 485–488
Li D, Dimitrova N, Li M, Sethi IK (2003) Multimedia content processing through cross-modal association. In: Proceedings of the 11th ACM international conference on multimedia, pp 604–611
Li X, Snoek C G, Worring M (2009) Learning social tag relevance by neighbor voting. IEEE Trans Multimed 11:1310–1322
Article Google Scholar
Li X, Uricchio T, Ballan L, Bertini M, Snoek C G, Bimbo A D (2016) Socializing the semantic gap: a comparative survey on image tag assignment, refinement, and retrieval. ACM Comput Surv 49:14
Google Scholar
Liu D, Yan SH, Rui Y (2010) Unified tag analysis with multi-edge graph. In: Proceedings of the ACM multimedia international conference, pp 25–34
Liu Y, Zhao WL, Ngo CW (2010) Coherent bag-of audio words model for efficient large-scale video copy detection. In: Proceedings of the ACM international conference on image and video retrieval, pp 89–96
Liu J, Zhang Y, Li Z, Lu H (2013) Correlation consistency constrained probabilistic matrix factorization for social tag refinement. Neurocomputing 119:3–9
Article Google Scholar
Liu A-A, Su Y-T, Jia P-P, Gao Z, Hao T, Yang Z-X (2015) Multipe/single-view human action recognition via part-induced multitask structural learning. IEEE Trans Cybern 45(6):1194–1208
Article Google Scholar
Liu A A, Nie W Z, Gao Y, Su Y T (2016) Multi-modal clique-graph matching for view-based 3D model retrieval. IEEE Trans Image Process 25(5):2103–2116
Article MathSciNet Google Scholar
Liu A-A, Xu N, Nie W, Su Y, Wong Y, Kankanhalli M (2016) Benchmarking a multimodal and multiview and interactive dataset for human action recognition. IEEE Trans Cybern 44(4):1–1
Google Scholar
Liu A-A, Su Y-T, Nie W-Z, Kankanhalli M (2017) Hierarchical clustering multi-task learning for joint human action grouping and recognition. IEEE Trans Pattern Anal Mach Intell 39(1):102–114
Article Google Scholar
Lyndon SK, Malcolm S, Kilian W (2009) Reliable tags using image similarity: mining specificity and expertise from large-scale multimedia databases. In: Proceedings of ACM MM workshop on web-scale multimedia corpus, pp 17–24
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:1301.3781
Monay F, Gatica-Perez D (2004) PLSA-based image auto-annotation: constraining the latent space. In: Proceedings of the 12th annual ACM international conference on multimedia, pp 348–351
Nie F, Huang H, Cai X, Ding CH (2010) Efficient and robust feature selection via joint ℓ _2,1-norms minimization. In: Proceedings of the neural information processing systems, pp 1813–1821
Nie W, Liu A, Su Y (2016) Cross-domain semantic transfer from large-scale social media. Multimed Syst 22(1):75–85
Article Google Scholar
Pan JY, Yang HJ, Faloutsos C (2004) Automatic multimedia cross-modal correlation discovery. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 653–658
Pols LCW (1966) Spectral analysis and identification of dutch vowels in monosyllabic words. Doctoral dissertion, pp 26–27
Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet GR, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: Proceedings of the 18th ACM international conference on multimedia, pp 251–260
Richter F, Romberg S, Horster E, Lienhart R (2012) Leveraging community metadata for multimodal image ranking. Multimed Tools Appl 56:35–62
Article Google Scholar
Rui XG, Li MJ, Li ZW (2007) Bipartite graph reinforcement model for web image annotation. In: Proceedings of the ACM international multimedia conference and exhibition, pp 585–594
Sang J, Xu C, Liu J (2012) User-aware image tag refinement via ternary semantic analysis. IEEE Trans Multimed 14:883–895
Article Google Scholar
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Truong BQ, Sun A, Bhowmick SS (2012) Content is still king: the effect of neighbor voting schemes on tag relevance for social image retrieval. In: Proceedings of the 2nd ACM international conference on multimedia retrieval, p 9
Verbeek J, Guillaumin M, Mensink T, Schmid C (2010) Image annotation with Tagprop on the Mirflickr set. In: Proceedings of the international conference on multimedia information retrieval, pp 537–546
Wang M, Ni B, Hua XS (2012) Assistive tagging: a survey of multimedia tagging with human-computer joint exploration. ACM Comput Surv 44:25–25
Article Google Scholar
Wang J, Zhou J, Xu H, Mei T, Hua X S, Li S (2014) Image tag refinement by regularized latent dirichlet allocation. Comput Vis Image Underst 124:61–70
Article Google Scholar
Wu L, Jin R, Jain A K (2013) Tag completion for image retrieval. IEEE Trans Pattern Anal Mach Intell 35:716–727
Article Google Scholar
Wu P, Hoi S C H, Xia H (2013) Online multimodal deep similarity learning with application to ImageRetrieval. In: Proceedings of the 21st ACM international conference on multimedia, pp 153–162
Xia H, Wu P, Hoi S C H (2013) Online multi-modal distance learning for scalable multimedia retrieval. In: Proceedings of the 6th ACM international conference on web search and data mining, pp 455–464
Xu X, Shimada A, Taniguchi RI (2014) Tag completion with defective tag assignments via image-tag re-weighting. In: Proceedings of the IEEE international conference on multimedia and expo, pp 1–6
Yakhnenko O, Honavar V (2008) Annotating images and image objects using a hierarchical dirichlet process model. In: Proceedings of the 9th international workshop on multimedia data mining: held in conjunction with the ACM SIGKDD, pp 1–7
Yang Y, Zhuang Y T, Wu F (2008) Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval. IEEE Trans Multimed 10:437–446
Article Google Scholar
Yang Y, Xu D, Nie F (2009) Ranking with local regression and globaGl alignment for cross media retrieval. In: Proceedings of the 17th ACM international conference on multimedia, pp 175–184
Zhou B, Jagadeesh V, Piramuthu R (2015) Conceptlearner: discovering visual concepts from weakly labeled image collections. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1492–1500
Zhu G, Yan S, Ma Y (2010) Image tag refinement towards low-rank, content-tag prior and error sparsity. In: Proceedings of the 18th ACM international conference on multimedia, pp 461–470
Zhu X, Nejdl W, Georgescu M (2014) An adaptive teleportation random walk model for learning social tag relevance. In: Proceedings of the 37th international ACM SIGIR conference on research & development in information retrieval, pp 223–232
Zhuang Y T, Yang Y, Wu F (2008) Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval. IEEE Trans Multimed 10:221–229
Article Google Scholar
Znaidia A, Shabou A, Le Borgne H (2012) Bag-of-multimedia-words for image classification, (ICPR). In: Proceedings of the 21st IEEE international conference on pattern recognition, pp 1509–1512

Download references

Acknowledgements

Special thanks should go to the collaborators in the Lab for Media Search of National University of Singapore, for their instructive advice and useful suggestions on this work. This work is supported by the Natural Science Foundation of China (No.61502094,61402099,61402016), Natural Science Foundation of Heilongjiang Province of China (No.F2016002,F2015020) and Beijing Natural Science Foundation (No.4154067).

Author information

Authors and Affiliations

School of Computer and Information Technology, Northeast Petroleum University, DaQing, 163318, China
Feng Tian, Xianmei Liu, Zhuoxuan Liu, Ning Sun, Mei Wang & Haochang Wang
School of Computing, National University of Singapore, Singapore, 119077, Singapore
Feng Tian
School of Computer Science, North China University of Technology, Beijing, 100144, China
Fengquan Zhang

Authors

Feng Tian
View author publications
You can also search for this author in PubMed Google Scholar
Xianmei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Zhuoxuan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Ning Sun
View author publications
You can also search for this author in PubMed Google Scholar
Mei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Haochang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Fengquan Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Feng Tian.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tian, F., Liu, X., Liu, Z. et al. Multimedia integrated annotation based on common space learning. Multimed Tools Appl 78, 437–456 (2019). https://doi.org/10.1007/s11042-017-5068-0

Download citation

Received: 26 April 2017
Revised: 11 June 2017
Accepted: 27 July 2017
Published: 09 August 2017
Issue Date: January 2019
DOI: https://doi.org/10.1007/s11042-017-5068-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multimedia integrated annotation based on common space learning

Abstract

Access this article

Similar content being viewed by others

Microsoft COCO: Common Objects in Context

Learning with Noisy Correspondence

Learning to Prompt for Vision-Language Models

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multimedia integrated annotation based on common space learning

Abstract

Access this article

Similar content being viewed by others

Microsoft COCO: Common Objects in Context

Learning with Noisy Correspondence

Learning to Prompt for Vision-Language Models

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation