Skip to main content
Log in

A multimodal alignment framework for spoken documents

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

We present a multimodal document alignment framework, which highlights existing alignment relationships between documents that are discussed and recorded during multimedia events such as meetings. These relationships that should help indexing the archives of these events are detected using various techniques from natural language processing and information retrieval. The main alignment strategies studied are based on thematic, quotation and reference relationships. At the analysis level, the alignment framework was applied at several levels of granularity of documents, requiring specific document segmentation techniques. Our framework that is language independent was evaluated on corpora in French and English, including meetings and scientific presentations. The satisfactory evaluation results obtained at several stages show the importance of our approach in bridging the gap between meeting documents, independently from the language and domain. They highlight also the utility of the multimodal alignment in advanced applications, e.g. multimedia document browsing, content-based / temporal-based searching, etc.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

References

  1. AMIDA project homepage. http:/www.ercim.eu/activity/projects/amida.html. Accessed 20 March 2011

  2. Anderson R, Hoyer C, Prince C, Su J, Videon F, Wolfman S (2004) Speech, ink, and slides: the interaction of content channels. In: Proceedings of ACM multimedia. New York, USA, pp 796–803

  3. Anderson R, Hoyer C, Wolfman S A (2005) A study of diagrammatic ink in lecture. In: Proceedings of computers and graphics, pp 480–489

  4. Anderson R, Davis P, Linnell N, Prince C, Razmov V, Videon F (2007) Classroom Presenter: Enhancing Interactive Education with Digital Ink. IEEE Computer 40–9:56–61

    Article  Google Scholar 

  5. Barras C, Geoffrois E, Wu Z, Liberman M (1998) Transcriber: a free tool for segmenting, labelling and transcribing speech. In: Proceedings of LREC’98. Spain, pp 1373–1376

  6. Behera A, Lalanne D, Ingold R (2008) DocMIR: an automatic document-based indexing system for meeting retrieval. Int J Multimed Tools Appl 37–2:135–167

    Article  Google Scholar 

  7. Bloechle J.L, Rigamonti M, Hadjar K, Lalanne D, Ingold R (2006) XCDF: a canonical and structured document format. In: Proceedings of DAS, the 7th IAPR International Workshop on document analysis systems. New Zealand

  8. Brotherton JA (2001) eClass: building, observing and understanding the impact of capture and access in an educational setting, PhD Thesis. Georgia Institute of Technology, USA

  9. Brotherton JA, Bhalodia JR, Abowd GD (1998) Automated capture, integration, and visualization of multiple media streams. In: Proceedings of IEEE multimedia, pp 54–63

  10. Chiu P, Foote J, Girgensohn A, Boreczky J (2000) Automatically linking multimedia meeting documents by image matching. In: Proceedings of Hypertext’00, ACM Press, Texas, USA, pp 244–245

  11. Chiu P, Kapuskar A, Reitmeier A, Wilcox L (2000) Room with a Rear View: Meeting Capture in a Multimedia Conference Room. IEEE Multimedia 7–4:48–54

    Google Scholar 

  12. Chiu P, Girgensohn A, Liu Q (2004) Stained-glass visualization for highly condensed video summaries. In Proceedings of IEEE International Conference on Multimedia and Expo ICME’04. Taipei, Taiwan

  13. CMU Sphinx system. http://cmusphinx.sourceforge.net/html/cmusphinx.php. Accessed 7 December 2010

  14. Corral D (2005) Including a thesaurus in similarity calculation. A Bachelor Thesis in Computer Science. University of Fribourg, Switzerland

  15. Cutler R, Rui Y, Gupta A, Cadiz J, Tashev I, He L, Colburn A, Zhang Z, Liu Z, Silverberg S (2002) Distributed meetings: a meeting capture and broadcasting system. In: Proceedings of ACM multimedia. France, pp 503–512

  16. Elsweiler D, Ruthven I, Jones C (2007) Towards memory supporting personal information management tools. Am Soc Inf Sci Technol 58–7:924–946

    Article  Google Scholar 

  17. Girgensohn A, Borczkyj WL (2001) Keyframe-based user interfaces for digital video. IEEE Computer 34–9:61–67

    Article  Google Scholar 

  18. Gruenstein A, Seneff A (2007) Releasing a multimodal dialogue system into thewild: user support mechanisms. In: Proceedings of the 8th SIGdial workshop on discourse and dialogue, pp 111–119

  19. Hearst M (1994). Multi-paragraph segmentation of expository text. In: Proceedings of ACL, the 32nd Annual Meeting of the Association for Computational Linguistics. USA, pp 9–16

  20. HTK tool. http://htk.eng.cam.ac.uk/links/asr_tool.shtml. Accessed 7 December 2010

  21. Kornfield EM, Manmatha R, Allan J (2004) Text alignment with handwritten documents. In: Proceedings of DIAL, document image analysis for libraries. San Jose, California, USA, pp 195–211

  22. Lalanne D, Von Rotz D, Ingold R (2005) IM2.DI, Integration de Documents dans des Archives Multimedias de Reunions. In : Flash Informatique, Ecole Polytechnique Federale de Lausanne, FI2/05, pp 15–18

  23. Le QA, Popescu-Belis A (2009) Automatic vs. human question answering over multimedia meeting recordings. In: Proceedings of Interspeech’09 (10th Annual Conference of the International Speech Communication Association). Brighton, UK, pp 624–627

  24. Le Meur JY, Bourillot D (2005) INDICO, un Logiciel de Pointe pour la Gestion de Conference. In: Flash Informatique, Ecole Polytechnique Fédérale de Lausanne, FI2/05, pp 12–14

  25. Little S, Geurts J, Hunter J (2002) Dynamic generation of intelligent multimedia presentations through semantic inferencing. In: Proceedings of ECDL, the 6th European Conference on Research and Advanced Technology for Digital Libraries. Rome, Italy, pp 158–175

  26. Macedo AA, Da Graca CPM, Camacho-Guerrero JA (2001) Latent semantic linking over homogeneous repositories. In; Proceedings of DocEng, the ACM symposium on document engenieer. USA, pp 144–151

  27. Macedo AA, Camacho-Guerrero JA, Cattelan RG, Inacio VR, Da Graca CPM (2004) Interaction alternatives for linking everyday presentations. In: Proceedings of ACM hypertext. USA, pp 112–113

  28. Matrakas M.D, Bortolozzi F (2000) Segmentation and validation of commercial documents logical structure. In: Proceedings of ITCC, International Conference on information technology: coding and computing. USA, pp 242–246

  29. Mekhaldi D (2006) A study on multimodal document alignment: bridging the gap between textual documents and spoken language. PhD Thesis, N° 1521. Fribourg, Switzerland

  30. Mekhaldi D (2007) Multimodal document alignment: towards a fully-indexed multimedia archive. In: Proceedings of multimedia informtation retrieval workshop, SIGIR’07. The Netherlands

  31. Mekhaldi D, Lalanne D (2010) Multimodal document alignment: feature-based validation to strengthen thematic links. J Multimed Proc Technol (JMPT) 1(1):30–46

    Google Scholar 

  32. Mekhaldi D, Lalanne D, Ingold R (2004) Thematic segmentation of meetings through document/speech alignment. In: Proceedings of 12th Annual Conference ACM Multimedia 2004. New York, USA, pp 804–811

  33. Mekhaldi D, Lalanne D, Ingold R (2005) From searching to browsing through multimodal documents linking. In: Proceedings of ICDAR, the 8th International Conference on Document Analysis and Recognition. Korea, pp 924–928

  34. Memoir project homepage. http://dagda.shef.ac.uk/memoir/. Accessed 13 February 2009

  35. Moore D (2002) The IDIAP smart meeting room. Technical report. IDIAP-Com. Martigny, Switzerland

  36. Morde A, Kashi RS, Brown MB, Grove D, Flanagan JL (2002) A multimodal system for accessing driving directions. In: Proceedings of document analysis systems. Princeton, NJ, USA, pp 595–601

  37. Mukhopadhyay S, Smith B (1999) Passive capture and structuring of lectures. In Proceedings of the 17th ACM International Conference on multimedia. Florida, USA, pp 477–487

  38. Olligschlaeger AM, Hauptmann AG (1999) Multimodal information systems and GIS: the informedia digital video library. In: Proceedings of ESRI user conference. California, USA

  39. Ponte JM, Croft WB (1997) Text segmentation by topic. In: Proceedings of ECDL’97. Italy, pp 113–125

  40. Popescu-Belis A, Lalanne D (2004) Reference Resolution over a Restricted Domain: References to Documents. In: Proceedings of ACL Workshop on Reference Resolution and its Applications. Barcelona, Spain, pp 71–78.

  41. Popescu-Belis A, Georgescul M, Clark A, Armstrong S (2004) Building and using a corpus of shallow dialogue annotated meetings. In: Proceedings of LREC’04. Portugal, pp 1451–1454

  42. Popescu-Belis A, Kilgour J, Poller P, Nanchen A, Boertjes E, de Wit J (2010) Automatic content linking: speech-based just-in-time retrieval for multimedia archives. In: Proceedings of SIGIR’10, 33rd Annual International ACM SIGIR Conference on research and development on information retrieval, demonstration session. Geneva, Switzerland

  43. QALLME project. http://qallme.itc.it/. Accessed 7 December 2010

  44. Saetre R, Tveit A, Steigedal TS, Laegreid A (2005) Semantic annotation of biomedical literature using google. In: Proceedings of DMBIO’05. Singapore, pp 327–337

  45. Scansoft system. http://scansoft.crystal-product.com/. Accessed 7 December 2010

  46. Schultz T, Waibel A, Bett M, Metze F, Pan Y, Ries K, Schaaf T, Soltau H, Westphal M, Yu H, Zechner K (2002) The ISL meeting room system. In: Proceedings of HSC, the workshop on hands-free speech communication. Kyoto, Japan

  47. Tang L, Kender, J (2005) Educational video understanding: mapping handwritten text to textbook chapters. In: Proceedings of ICDAR, the 8th International Conference on document analysis and recognition. Seoul, Korea, pp 919–923

  48. The Quranic Arabic Corpus.homepage. http://corpus.quran.com/. Accessed 25 March 2011

  49. The Smart meeting room recorded data. http://diuf.unifr.ch/im2/. Accessed 7 December 2010

  50. Von Rotz D, Bourillot D, Abou Khaled O, Scheurer R, Lalanne D, Ingold R, Le Meur J-Y, Baron T (2006) SMAC—Smart Multimedia Archive for Conferences. In: Flash Informatique FI1/06, Ecole Polytechnique Fédérale de Lausanne, ISSN 1420-7192, pp 3–10

  51. Wahlster W, Andre E, Finkler W, Profitlich HJ, Rist T (1993) Plan-based Integration of Natural Language and Graphics Generation. In Artificial Intelligence 63:387–427

    Article  Google Scholar 

  52. WordNet thesaurus. http://WordNet.princeton.edu/. Accessed 7 December 2010

  53. Yu JH (2004) Alignment of Bilingual web pages based on the MT evaluation method of BLEU. In: Student Workshop of COCLING 14, conference on computational linguistics and speech processing. Taipei, Taiwan

  54. Zhang B, Andre M, Calado P, Cristo M (2004) Combining structural and citation-based evidence for text classification. In: Proceedings of CIKM, the 13th conference on information and knowledge management. Washington D.C., USA 2004, pp 162–163

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dalila Mekhaldi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mekhaldi, D., Lalanne, D. & Ingold, R. A multimodal alignment framework for spoken documents. Multimed Tools Appl 61, 353–388 (2012). https://doi.org/10.1007/s11042-011-0842-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-011-0842-x

Keywords

Navigation