Abstract
Though everyday interaction is predominantly multimodal, a purpose-developed framework for describing the semantic interplay between verbal and non-verbal communication is still lacking. This lack not only indicates one’s poor understanding of multimodal human behaviour, but also weakens any attempt to model such behaviour computationally. In this article, we present COSMOROE, a corpus-based framework for describing semantic interrelations between images, language and body movements. We argue that in viewing such relations from a message-formation perspective rather than a communicative goal one, one may develop a framework with descriptive power and computational applicability. We test COSMOROE for compliance to these criteria, by using it for annotating a corpus of TV travel programmes; we present all particulars of the annotation process and conclude with a discussion on the usability and scope of such annotated corpora.
Similar content being viewed by others
References
André, E., Rist, T.: The design of illustrated documents as a planning task. In: Maybury, M. (ed.) Intelligent Multimedia Interfaces, pp. 94–116, Chap. 4. AAAI Press/MIT Press, Cambridge, MA (1993)
André, E., Rist, T.: Referring to world objects with text and pictures. In: Proceedings of the Computational Linguistics Conference, pp. 530–534 (1994)
Barnard K., Duygulu P., Forsyth D., Freitas N., Blei D., Jordan M.: Matching words and pictures. J. Mach. Learn. Res. 3, 1107–1135 (2003)
Barras, C., Geoffrois, E., Wu, Z., Liberman, M.: Transcriber: a free tool for segmenting, labeling and transcribing speech. In: Proceedings of the First International Conference on Language Resources and Evaluation, pp. 1373–1376 (1998)
Barthes, R.: Image, Music, Text. Flamingo (1984)
Bateman, J., Delin, J., Allen, P.: Constraints on layout in multimodal document generation. In: Proceedings of the Workshop on Coherence in Generated Multimedia, First International Natural Language Generation Conference (2000)
Bateman, J., Delin, J., Henschel, R.: Multimodality and empiricism: preparing for a corpus-based approach to the study of multimodal meaning-making. In: Perspectives on Multimodality, pp. 65–89. John Benjamins, Amsterdam (2004)
Bernsen N.: Why are analogue graphics and natural language both needed in hci? In: Paterno, F. (ed.) Interactive Systems: Design, specification and verification. Focus on Computer Graphics, pp. 235–251. Springer, Berlin (1995)
Bordegoni M., Faconti G., Feiner S., Maybury M., Rist T., Ruggieri S., Trahanias P., Wilson M.: A standard reference model for intelligent multimedia presentation systems. Computer Standards Interfaces 18(6/7), 477–496 (1997)
Carletta J.: Assessing agreement on classification tasks: the kappa statistic. Comput. Linguist. 22(2), 249–254 (1996)
Carlson, L., Marcu, D., Okurowski, M.: Building a discourse-tagged corpus in the framework of rhetorical structure theory. In: Current Directions in Discourse and Dialogue, pp. 85–112. Kluwer, Dordrecht (2003)
de Carolis, B., Pelachaud, C., Poggi, I.: Verbal and nonverbal discourse planning, proceedings of fourth international conference on autonomous agents. In: Proceedings of the Workshop on Achieving Human-Like Behaviour in Interactive Animated Agents, Fourth International Conference on Autonomous Agents (2000)
Cassell, J.: A framework for gesture generation and interpretation. In: Computer Vision in Human–Machine Interaction, Chap. 11. Cambridge University Press, London (1998)
Chen, L., Liu, Y., Harper, M., Maia, E., McRoy, S.: Evaluating factors impacting the accuracy of forced alignments in a multimodal corpus. In: Proceedings of the 4th Language Resources and Evaluation Conference (2004)
Corio, M., Lapalme, G.: Integrated generation of graphics and text: a corpus study. In: Proceedings of the Association of Computational Linguistics Workshop on Content Visualisation and Intermedia Representation, pp. 63–68 (1998)
Corio, M., Lapalme, G.: Generation of texts for information graphics. In: Proceedings of the European Workshop on Natural Languge Generation, pp. 49–58 (1999)
Crewson P.: Fundamental of clinical research for radiologists: reader agreement studies. Am. J. Roentgenol. 184, 1391–1397 (2005)
Dasiopoulou, S., Papastathis, V., Mezaris, V., Kompatsiaris, I., Strintzis, M.: An ontology framework for knowledge-assisted semantic video analysis and annotation. In: Proceedings of the International Workshop on Knowledge Markup and Semantic Annotation (2004)
Everingham, M., Gool, L.V., Williams, C., Zisserman, A.: Pascal visual object classes challenge results. World Wide Web (http://www.pascal-network.org/challenges/VOC/voc) (2005)
Fasciano, M., Lapalme, G.: Intentions in the co-ordinated generation of graphics and text from tabular data. Knowl. Inform. Syst. 2(3) (2000)
Feiner, S., McKeown, K.: Automating the generation of co-ordinated multimedia explanations. In: Maybury, M. (ed.) Intelligent Multimedia Interfaces, pp. 117–138, chap. 5. AAAI Press/MIT Press, Cambridge, MA (1993)
Fellbaum,C. (ed.):WordNet:An Electronic Lexical Database. The MIT Press, Cambridge, MA (1998)
Green, N.: An empirical study of multimedia argumentation. In: Proceedings of the International Conference on Computational Sciences-Part I, pp. 1009–1018. Springer, Berlin (2001)
Gut, U., Looks, K., Thies, A., Trippel, T., Gibbon, D.: Cogest conversational gesture transcription system. Tech. rep., University of Bielefeld (2002)
Jackendoff R.: Consciousness and the Computational Mind. MIT Press, Cambridge (1987)
Kendon A.: Gesture: Visible Action as Utterance. Cambridge University Press, London (2004)
Kipp, M.: Gesture generation by imitation—from human behavior to computer character animation. Boca Raton, Florida: Dissertation.com (2004)
Kipp, M.: Spatiotemporal coding in anvil. In: Proceedings of the 6th Language Resources and Evaluation Conference (2008)
Lin, C., Tseng, B., Smith, J.: Video collaborative annotation forum: Establishing ground-truth labels on large multimedia datasets. TRECVID Proceedings (2003)
Lindley, C., Davis, J., Nack, F., Rutledge, L.: The application of rhetorical structure theory to interactive news program generation from digital archives. Technical Report INS-R0101, Centrum voor Wiskunde en Informatica (2001)
Magno-Caldognetto, E., Poggio, I., Cosi, P., Cavicchio, F., Merola, G.: Multimedia score—an anvil-based annotation scheme for multimodal audio-video analysis. In: Proceedings of the LREC Workshop on Multimodal Corpora: Models of Human Behaviour for the Specification and Evaluation Of Multimodal Input And Output Interfaces, pp. 29–33 (2004)
Mann W., Thompson S.: Rhetorical structure theory: description and construction of text structures. In: Kempen, G.(eds) Natural Language Generation: New results in Artificial Intelligence, Psychology and Linguistics, pp. 85–95. Nijhoff, Dodrecht (1987)
Marsh E., Domas-White M.: A taxonomy of relationships between image and text. J. Document. 59(6), 647–672 (2003)
Martin, J., Grimard, S., Alexandri, K.: On the annotation of multimodal behavior and computation of cooperation between modalities. In: Proceedings of the International Conference on Autonomous Agents workshop on Representing, Annotating, Evaluating Non-verbal and Verbal Communicative Acts to Achieve Contextual Embodied Agents, pp. 1–7 (2001)
Martin, J., Julia, L., Cheyer, A.: A theoretical framework for multimodal user studies. In: Proceedings of the Second International Conference on Cooperative Multimodal Communication, pp. 104–110 (1998)
Martin, J., Kipp, M.: Annotating and measuring multimodal behaviour—tycoon metrics in the anvil tool. In: Proceedings of the Language Resources and Evaluation Conference 2002, pp. 31–35 (2002)
Martinec R., Salway A.: A system for image–text relations in new (and old) media. Vis. Commun. 4(3), 339–374 (2005)
Maybury, M. (ed.): Intelligent Multimedia Interfaces. AAAI Press/MIT Press, Cambridge, MA (1993)
Maybury, M.,Wahlster,W. (eds.): Intelligent User Interfaces. Morgan Kaufmann Publishers, San Francisco, CA (1998)
McNeil D.: Gesture and Thought. The University of Chicago Press, Chicago, IL (2005)
Minsky, M.: The Society of Mind. Simon and Schuster Inc., NY, USA (1986)
Moore J., Paris C.: Planning text for advisory dialogues: capturing intentional and rhetorical information. Comput. Linguist. 19(4), 651–695 (1993)
Moore J., Pollack M.: Problem for RST: the need for multi-level discourse analysis. Comput. Linguist. 18(4), 537–544 (1992)
Nicholas, N.: Parameters for rhetorical structure theory ontology. In: University of Melbourne Working Papers in Linguistics, vol. 15, pp. 77–93. University of Melbourne, Melbourne (1995)
Pastra, K.: The language of caricature: language and drawing interaction. Final year project, Department of Greek Philology and Linguistics, University of Athens (1999) (in Greek)
Pastra, K.: Viewing vision–language integration as a double-grounding case. In: Proceedings of the AAAI Fall Symposium on Achieving Human-Level Intelligence through Integrated Systems and Research, pp. 62–67 (2004)
Pastra, K.: Vision–language integration: a double-grounding case. Ph.D. thesis, University of Sheffield (2005)
Pastra, K.: Beyond multimedia integration: corpora and annotations for cross-media decision mechanisms. In: Proceedings of the 5th Language Resources and Evaluation Conference, pp. 499–504 (2006)
Pastra, K., Piperidis, S.: Video search: new challenges in the pervasive digital video era. J. Virtual Reality Broadcast. 3(11) (2006)
Pastra K., Saggion H., Wilks Y.: Intelligent indexing of crime-scene photographs. IEEE Intell. Syst. 18(1), 55–61 (2003)
Pastra, K., Wilks, Y.: Vision–language integration in AI: a reality check. In: Proceedings of the 16th European Conference in Artificial Intelligence, pp. 937–941 (2004)
Radev, D.: A common theory of information fusion from multiple text sources. step one: cross document structure. In: Proceedings of the 1st SIGdial Workshop on Discourse and Dialogue, pp. 74–83 (2000)
Rocchi, C., Zancanaro, M.: Generation of video documentaries from discourse structures. In: Proceedings of the 9th European Workshop on Natural Language Generation (EWNLG 9) (2003)
Sanders T., Spooren W., Noordman L.: Toward a taxonomy of coherence relations. Discourse Process. 15, 1–35 (1992)
Simou, N., Tzouvaras, V., Avrithis, Y., Stamou, G., Kollias, S.: A visual descriptor ontology for multimedia reasoning. In: Proceedings of the workshop on Image Analysis for Multimedia Interactive Services (WIAMIS) (2005)
Srikanth, M., Varner, J., Bowden, M., Moldovan, D.: Exploiting ontologies for authomatic image annotation. In: Proceedings of the ACM Special Interest Group in Information Retrieval (SIGIR), pp. 552–558 (2005)
Taboada M., Mann W.: Rhetorical structure theory: looking back and moving ahead. Discourse Stud. 8(3), 423–459 (2006)
Wachsmuth, S., Stevenson, S., Dickinson, S.: Towards a framework for learning structured shape models from text-annotated images. In: Proceedings of the HLT-NAACL Workshop on Learning Word Meaning from non-linguistic Data (2003)
Whittaker, S., Walker, M.: Toward a theory of multi-modal interaction. In: Proceedings of the National Conference on Artificial Intelligence Workshop on Multi-modal Interaction (1991)
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by B. Bailey.
Rights and permissions
About this article
Cite this article
Pastra, K. COSMOROE: a cross-media relations framework for modelling multimedia dialectics. Multimedia Systems 14, 299–323 (2008). https://doi.org/10.1007/s00530-008-0142-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00530-008-0142-0