Abstract
This paper presents a novel method for multimedia document content analysis through modeling multimodal data correlations. We hypothesize that the correlation of different modalities from the same data source can help achieve better multimedia content understanding results than one which explores a single modality. We turn this task into two parts: multimedia data fusion and multimodal correlation propagation. During the first stage, we re-organize the training multimedia data into Modality semAntic Documents (MADs) after extracting quantized multimodal features, and then use multivariate Gaussian distributions to characterize the continuous quantity by latent topic modeling. Model parameters are asymmetrically learned to initialize multimodal correlations in the latent topic space. Accordingly, during the second stage, we construct a Multimodal Correlation Network (MCN) based on the initialized multimodal correlations, and a new mechanism of propagating inter-modality correlations and intra-modality similarities in MCN is further proposed to take the complementary from cross-modalities to facilitate multimedia content analysis. The experimental results of image-audio data retrieval on a 10-categories dataset and content-oriented web page recommendation on the USTODAY dataset show the effectiveness of our method.
Similar content being viewed by others
References
AbdelRaouf A, Higgins CA, Pridmore TP, Khalil M I (2010) Building a multi-modal arabic corpus. Int J Doc Anal Recognit 13 (4):285–302
Barnard K, Duygulu P, Forsyth D, Freitas N, Blei DM, Jordan MI (2003) Matching words and pictures. J Mach Learn Res 3:1107–1135
Beal MJ, Attias H, Jojic N (2002) Audio-video sensor fusion with probabilistic graphical models. In: ECCV, pp 736–752
Carson C, Belongie S, Greenspan H, Malik J (2002) Blobworld: image segmentation using expectation-maximization and its application to image querying. IEEE Trans Pattern Anal Mach Intell 24 (8):1026–1038
Erol B, Berker K, Joshi S (2008) Multimedia clip generation from documents for browsing on mobile devices. IEEE Trans Multimed 10 (5):711–723
Evangelopoulos G, Zlatintsi A, Skoumas G, Rapantzikos K, Potamianos A, Maragos P, Avrithis Y (2009) Video event detection and summarization using audio, visual and text saliency. In: Conference on IEEE International, ICASSP, pp 3553–3556
Foote J (1997) Content-based retrieval of music and audio. Multimedia storage and archiving systems II. In: Proceedings of SPIE, 3229, pp 138–147
He JY, Weerkamp W, Larson M, Rijke M (2009) An effective coherence measure to determine topical consistency in user-generated content. Int J Doc Anal Recognit 12 (3):185–203
Goto H (2008) Redefining the DCT-based feature for scene text detection. Int J Doc Anal Recognit 11:1–8
Kyperountas M, Kotropoulos C, Pitas I (2007) Enhanced eigen-audioframes for audiovisual scene change detection. IEEE Trans Multimed 9 (4):785–797
Lu XN, Kataria Saurabh BWJ, Wang JZ, Mitra P, Giles CL (2009) Automated analysis of images in documents for intelligent document search. Int J Doc Anal Recognit 12 (2):65–81
Liang J, DeMenthon D, Doermann D (2008) Geometric rectification of camera-captured document images. IEEE Trans Pattern Anal Mach Intell 30 (4):591–605
Li ZX, Shi ZP, Liu X, Shi ZZ (2010) Automatic image annotation with continous PLSA. In: ICASSP, pp 806–809
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60 (2):91–110
Lu T, Tai CL, Yang HF, Cai SJ (2009) A novel knowledge-based system for interpreting complex engineering drawings: theory, representation, and implementation. IEEE Trans Pattern Anal Mach Intell 31 (8):1444–1457
Lu MM, Xie L, Fu ZH, Jiang DM, Zhang YN (2010) Multimodal feature integration for story boundary detection in broadcast news. In: ISCSLP, pp 420–425
Mesaros A, Heittola T, Klapuri AP (2011) Latent semantic analysis in sound event detection. In: Proceedings of EUSIPCO, pp 1307–1311
Monay F, Gatica-Perez D (2004) PLSA-based image auto-annotation: constrainting the latent space. In: ACM Multimedia’04, pp 348–351
Monay F, Daniel GP (2007) Modeling semantic aspects for cross-media image indexing. IEEE Trans Pattern Anal Mach Intell 29 (10):1802–1817
Mitschick A (2010) Ontology-based indexing and contextualization of multimedia documents for personal information management applications. Int J Adv Softw 3 (1–2):31–40
Lu T, Tai CL, Yang HF, Cai SJ (2009) A novel knowledge-based system for interpreting complex engineering drawings: theory, representation, and implementation. IEEE Trans Pattern Anal Mach Intell 31(8):1444–1457
Nguyen N V, Ogier J M, Charneau F (2012) PEDIVHANDI: multimodal indexation and retrieval system for lecture videos. In: ACCV’12, pp 382–393
Iria J, Magalhae~s J (2009) Exploiting cross-media correlations in the categorization of multimedia web documents. In: IJCAI’09 workshop on cross-media information access and mining
Karaoglu S, Gemert J, Gevers T (2012) Object reading: text recognition for object recognition. In: ECCV’12, pp 456–465
Ma X L, Lu T, Xu F M, Su F (2012) Anomaly detection with spatic-temporal context using depth images. In: Internatial conference on pattern recognition, pp 2590–2593
Su F, Yang L, Lu T, Wang GY (2011) Environmental sound classification for scene recognition using local discriminant bases and HMM. In: ACM Multimedia’11, pp 1389–1392
Lin WX, Lu T, Su F (2012) A novel multi-modal integration and propagation model for cross-media information retrieval. In: Multimedia modeling’12, pp 740–749
Jin YK, Lu T, Su F (2012) Movie keyframe retrieval based on cross-media correlation detection and context model. In: IEA/AIE’12, pp 816–825
Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. Mach Learn 42:117–196
Jourdan M, Bes F (2001) A new step towards multimedia documents generation. In: International conference on media futures, pp 25–28
Scherp A (2008) Canonical processes for creating personalized semantically with multimedia presentations. Multimedia Syst 14 (6):415–425
Blei D M, Jordan MI (2003) Modeling annotated data. In: SIGIR, pp 127–134
Sidhom S, David A (2006) Automatic indexing of multimedia documents as a starting point to annotation process. In: Conference on 9th international ISKO knowledge organization for a global learning society
Staab S, Scherp A, Arndt R, Troncy R, Grzegorzek M, Saathoff C, Schenk S, Hardman L (2008) Semantic multimedia. In: Reasoning web. Springer, Venis
Saathoff C, Scherp A (2010) Unlocking the semantics of multimedia presentations in the web with the multimedia metadata ontology. In: Proceedings of WWW’10, pp 831–840
Peng J, Qin XL (2010) Keyframe-based video summary using visual attention clues. IEEE Trans Multimed 17 (2):64–73
Poignant J, Besacier L, Quenot G, Thollard F (2012) From text detection in videos to person identification. In: ICME, pp 854–859
Snoek CGM, Worring M, Smeulders AWM (2005) Early versus late fuion in semantic video analysis. In: ACM Multimedia’05, pp 399–402
Wang JD, Zeng HJ, Chen Z, Lu HJ, Tao L, Ma WY (2003) ReCoM: reinforcement clustering of multi-type interrelated data objects. In: SIGIR, pp 274–281
Wang XJ, Ma WY, Xue GR, Li X (2004) Multi-model similarity propagation and its application for web image retrieval. In: ACM Multimedia’04, pp 944–951
Wang JJ, Chng ES, Xu CS, Lu HQ, Tian Q (2007) Generation of personalized music sports video using multimodal cues. IEEE Trans Multimed 9 (3):1520–9210
Wang LM, Wu YR, Lu T, Chen K (2011) Multiclass object detection by combining local appearances and context. In: ACM Multimedia’11, pp 1161–1164
Weiss W, Burger T, Villa Robert PP, Halb W (2009) Statement-based semantic annotation of media resources. In: Proceedings of SAMT, pp 52–64
Westerveld T, et al (2003) A probabilisitc multimedia retrieval model and its evaluation. EURASIP J Appl Signal Process: 186–198
Yamamoto M, Hikino K, Kijima S, Hirakawa M (2005) Towards understanding of multimedia documents: a trial of picture book analysis and generation. In: IEEE international symposium on multimedia, pp 29–36
Yang Y, Zhuang YT, Wu F, Pan YH (2008) Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval. IEEE Trans Multimed 10:437–446
Yang Y, Wu F, Xu D, Zhuang YT, Chia LT (2010) Cross-media retrieval using query dependent search methods. Pattern Recognit 43 (8):2927–2936
Yin WC, Lu T, Su F (2013) A novel multi-view object class detection framework for document image content analysis. In: Conference on international document analysis, , Washington, US, pp 1095–1099
Zhang H, Zhuang Y T, Wu F (2007) Cross-modal correlation learning for clustering on image-audio dataset. In: ACM Multimedia’07, pp 273–276
Zhu Y, Chen K, Sun Q (2005) Multimodal content-based structure analysis of Karaoke music. In: ACM Multimedia’05, pp 638–647
Zhu Q, Yeh MC, Cheng K (2006) Multimodal fusion using learned text concepts for image categorization. In: ACM Multimedia’06, pp 211–220
Zhuang YT, Yang Y, Wu F (2008) Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval. IEEE Trans Multimed 10:221–229
Acknowledgment
The work described in this paper was supported by the Natural Science Foundation of China under Grant No. 61272218 and No. 61321491, the 973 Program of China under Grant No. 2010CB327903, and the Program for New Century Excellent Talents under NCET-11-0232. The authors thank the anonymous reviewers for their constructive comments, which helped to improve the paper.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lu, T., Jin, Y., Su, F. et al. Content-oriented multimedia document understanding through cross-media correlation. Multimed Tools Appl 74, 8105–8135 (2015). https://doi.org/10.1007/s11042-014-2044-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-014-2044-9