Skip to main content
Log in

Content-oriented multimedia document understanding through cross-media correlation

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

This paper presents a novel method for multimedia document content analysis through modeling multimodal data correlations. We hypothesize that the correlation of different modalities from the same data source can help achieve better multimedia content understanding results than one which explores a single modality. We turn this task into two parts: multimedia data fusion and multimodal correlation propagation. During the first stage, we re-organize the training multimedia data into Modality semAntic Documents (MADs) after extracting quantized multimodal features, and then use multivariate Gaussian distributions to characterize the continuous quantity by latent topic modeling. Model parameters are asymmetrically learned to initialize multimodal correlations in the latent topic space. Accordingly, during the second stage, we construct a Multimodal Correlation Network (MCN) based on the initialized multimodal correlations, and a new mechanism of propagating inter-modality correlations and intra-modality similarities in MCN is further proposed to take the complementary from cross-modalities to facilitate multimedia content analysis. The experimental results of image-audio data retrieval on a 10-categories dataset and content-oriented web page recommendation on the USTODAY dataset show the effectiveness of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

References

  1. AbdelRaouf A, Higgins CA, Pridmore TP, Khalil M I (2010) Building a multi-modal arabic corpus. Int J Doc Anal Recognit 13 (4):285–302

    Article  Google Scholar 

  2. Barnard K, Duygulu P, Forsyth D, Freitas N, Blei DM, Jordan MI (2003) Matching words and pictures. J Mach Learn Res 3:1107–1135

    MATH  Google Scholar 

  3. Beal MJ, Attias H, Jojic N (2002) Audio-video sensor fusion with probabilistic graphical models. In: ECCV, pp 736–752

  4. Carson C, Belongie S, Greenspan H, Malik J (2002) Blobworld: image segmentation using expectation-maximization and its application to image querying. IEEE Trans Pattern Anal Mach Intell 24 (8):1026–1038

    Article  Google Scholar 

  5. Erol B, Berker K, Joshi S (2008) Multimedia clip generation from documents for browsing on mobile devices. IEEE Trans Multimed 10 (5):711–723

    Article  Google Scholar 

  6. Evangelopoulos G, Zlatintsi A, Skoumas G, Rapantzikos K, Potamianos A, Maragos P, Avrithis Y (2009) Video event detection and summarization using audio, visual and text saliency. In: Conference on IEEE International, ICASSP, pp 3553–3556

  7. Foote J (1997) Content-based retrieval of music and audio. Multimedia storage and archiving systems II. In: Proceedings of SPIE, 3229, pp 138–147

  8. He JY, Weerkamp W, Larson M, Rijke M (2009) An effective coherence measure to determine topical consistency in user-generated content. Int J Doc Anal Recognit 12 (3):185–203

    Article  Google Scholar 

  9. Goto H (2008) Redefining the DCT-based feature for scene text detection. Int J Doc Anal Recognit 11:1–8

    Article  Google Scholar 

  10. Kyperountas M, Kotropoulos C, Pitas I (2007) Enhanced eigen-audioframes for audiovisual scene change detection. IEEE Trans Multimed 9 (4):785–797

    Article  Google Scholar 

  11. Lu XN, Kataria Saurabh BWJ, Wang JZ, Mitra P, Giles CL (2009) Automated analysis of images in documents for intelligent document search. Int J Doc Anal Recognit 12 (2):65–81

    Article  Google Scholar 

  12. Liang J, DeMenthon D, Doermann D (2008) Geometric rectification of camera-captured document images. IEEE Trans Pattern Anal Mach Intell 30 (4):591–605

    Article  Google Scholar 

  13. Li ZX, Shi ZP, Liu X, Shi ZZ (2010) Automatic image annotation with continous PLSA. In: ICASSP, pp 806–809

  14. Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60 (2):91–110

    Article  Google Scholar 

  15. Lu T, Tai CL, Yang HF, Cai SJ (2009) A novel knowledge-based system for interpreting complex engineering drawings: theory, representation, and implementation. IEEE Trans Pattern Anal Mach Intell 31 (8):1444–1457

    Article  Google Scholar 

  16. Lu MM, Xie L, Fu ZH, Jiang DM, Zhang YN (2010) Multimodal feature integration for story boundary detection in broadcast news. In: ISCSLP, pp 420–425

  17. Mesaros A, Heittola T, Klapuri AP (2011) Latent semantic analysis in sound event detection. In: Proceedings of EUSIPCO, pp 1307–1311

  18. Monay F, Gatica-Perez D (2004) PLSA-based image auto-annotation: constrainting the latent space. In: ACM Multimedia’04, pp 348–351

  19. Monay F, Daniel GP (2007) Modeling semantic aspects for cross-media image indexing. IEEE Trans Pattern Anal Mach Intell 29 (10):1802–1817

    Article  Google Scholar 

  20. Mitschick A (2010) Ontology-based indexing and contextualization of multimedia documents for personal information management applications. Int J Adv Softw 3 (1–2):31–40

    Google Scholar 

  21. Lu T, Tai CL, Yang HF, Cai SJ (2009) A novel knowledge-based system for interpreting complex engineering drawings: theory, representation, and implementation. IEEE Trans Pattern Anal Mach Intell 31(8):1444–1457

    Article  Google Scholar 

  22. Nguyen N V, Ogier J M, Charneau F (2012) PEDIVHANDI: multimodal indexation and retrieval system for lecture videos. In: ACCV’12, pp 382–393

  23. Iria J, Magalhae~s J (2009) Exploiting cross-media correlations in the categorization of multimedia web documents. In: IJCAI’09 workshop on cross-media information access and mining

  24. Karaoglu S, Gemert J, Gevers T (2012) Object reading: text recognition for object recognition. In: ECCV’12, pp 456–465

  25. Ma X L, Lu T, Xu F M, Su F (2012) Anomaly detection with spatic-temporal context using depth images. In: Internatial conference on pattern recognition, pp 2590–2593

  26. Su F, Yang L, Lu T, Wang GY (2011) Environmental sound classification for scene recognition using local discriminant bases and HMM. In: ACM Multimedia’11, pp 1389–1392

  27. Lin WX, Lu T, Su F (2012) A novel multi-modal integration and propagation model for cross-media information retrieval. In: Multimedia modeling’12, pp 740–749

  28. Jin YK, Lu T, Su F (2012) Movie keyframe retrieval based on cross-media correlation detection and context model. In: IEA/AIE’12, pp 816–825

  29. Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. Mach Learn 42:117–196

    Article  Google Scholar 

  30. Jourdan M, Bes F (2001) A new step towards multimedia documents generation. In: International conference on media futures, pp 25–28

  31. Scherp A (2008) Canonical processes for creating personalized semantically with multimedia presentations. Multimedia Syst 14 (6):415–425

    Article  Google Scholar 

  32. Blei D M, Jordan MI (2003) Modeling annotated data. In: SIGIR, pp 127–134

  33. Sidhom S, David A (2006) Automatic indexing of multimedia documents as a starting point to annotation process. In: Conference on 9th international ISKO knowledge organization for a global learning society

  34. Staab S, Scherp A, Arndt R, Troncy R, Grzegorzek M, Saathoff C, Schenk S, Hardman L (2008) Semantic multimedia. In: Reasoning web. Springer, Venis

  35. Saathoff C, Scherp A (2010) Unlocking the semantics of multimedia presentations in the web with the multimedia metadata ontology. In: Proceedings of WWW’10, pp 831–840

  36. Peng J, Qin XL (2010) Keyframe-based video summary using visual attention clues. IEEE Trans Multimed 17 (2):64–73

    Google Scholar 

  37. Poignant J, Besacier L, Quenot G, Thollard F (2012) From text detection in videos to person identification. In: ICME, pp 854–859

  38. Snoek CGM, Worring M, Smeulders AWM (2005) Early versus late fuion in semantic video analysis. In: ACM Multimedia’05, pp 399–402

  39. Wang JD, Zeng HJ, Chen Z, Lu HJ, Tao L, Ma WY (2003) ReCoM: reinforcement clustering of multi-type interrelated data objects. In: SIGIR, pp 274–281

  40. Wang XJ, Ma WY, Xue GR, Li X (2004) Multi-model similarity propagation and its application for web image retrieval. In: ACM Multimedia’04, pp 944–951

  41. Wang JJ, Chng ES, Xu CS, Lu HQ, Tian Q (2007) Generation of personalized music sports video using multimodal cues. IEEE Trans Multimed 9 (3):1520–9210

    Google Scholar 

  42. Wang LM, Wu YR, Lu T, Chen K (2011) Multiclass object detection by combining local appearances and context. In: ACM Multimedia’11, pp 1161–1164

  43. Weiss W, Burger T, Villa Robert PP, Halb W (2009) Statement-based semantic annotation of media resources. In: Proceedings of SAMT, pp 52–64

  44. Westerveld T, et al (2003) A probabilisitc multimedia retrieval model and its evaluation. EURASIP J Appl Signal Process: 186–198

  45. Yamamoto M, Hikino K, Kijima S, Hirakawa M (2005) Towards understanding of multimedia documents: a trial of picture book analysis and generation. In: IEEE international symposium on multimedia, pp 29–36

  46. Yang Y, Zhuang YT, Wu F, Pan YH (2008) Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval. IEEE Trans Multimed 10:437–446

    Article  Google Scholar 

  47. Yang Y, Wu F, Xu D, Zhuang YT, Chia LT (2010) Cross-media retrieval using query dependent search methods. Pattern Recognit 43 (8):2927–2936

    Article  MATH  Google Scholar 

  48. Yin WC, Lu T, Su F (2013) A novel multi-view object class detection framework for document image content analysis. In: Conference on international document analysis, , Washington, US, pp 1095–1099

  49. Zhang H, Zhuang Y T, Wu F (2007) Cross-modal correlation learning for clustering on image-audio dataset. In: ACM Multimedia’07, pp 273–276

  50. Zhu Y, Chen K, Sun Q (2005) Multimodal content-based structure analysis of Karaoke music. In: ACM Multimedia’05, pp 638–647

  51. Zhu Q, Yeh MC, Cheng K (2006) Multimodal fusion using learned text concepts for image categorization. In: ACM Multimedia’06, pp 211–220

  52. Zhuang YT, Yang Y, Wu F (2008) Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval. IEEE Trans Multimed 10:221–229

    Article  Google Scholar 

Download references

Acknowledgment

The work described in this paper was supported by the Natural Science Foundation of China under Grant No. 61272218 and No. 61321491, the 973 Program of China under Grant No. 2010CB327903, and the Program for New Century Excellent Talents under NCET-11-0232. The authors thank the anonymous reviewers for their constructive comments, which helped to improve the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tong Lu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lu, T., Jin, Y., Su, F. et al. Content-oriented multimedia document understanding through cross-media correlation. Multimed Tools Appl 74, 8105–8135 (2015). https://doi.org/10.1007/s11042-014-2044-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-014-2044-9

Keywords

Navigation