Skip to main content

Advertisement

Log in

InSight Interaction: a multimodal and multifocal dialogue corpus

  • Project Notes
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Research on the multimodal aspects of interactional language use requires high-quality multimodal resources. In contrast to the vast amount of available written language corpora and collections of transcribed spoken language, truly multimodal corpora including visual as well as auditory data are scarce. In this paper, we first discuss a few notable exceptions that do provide high-quality and multiple-angle video recordings of face-to-face conversations. We then present a new multimodal corpus design that adds two dimensions to the existing resources. First, the recording set-up was designed in such a way as to have a full view of the dialogue partners’ gestural behaviour, including hand gestures, facial expressions and body posture. Second, by recording the participant perspective and behaviour during conversation, using head-mounted scene cameras and eye-trackers, we obtained a 3D landscape of the conversation, with detailed production information (scene camera and sound) and indices of cognitive processing (eye movements for gaze analysis) for both participants. In its current form, the resulting InSight Interaction Corpus consists of 15 recorded face-to-face interactions of 20 min each, of which five have been transcribed and annotated for a range of linguistic and gestural features, using the ELAN multimodal annotation tool.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Notes

  1. Unless explicitly mentioned differently, we use gesture in this paper to refer to hand gestures, i.e. we do not use gesture as a cover term to refer to any type of body movement, such as head movement, posture or eye movements.

  2. The mobile eye-trackers provide two types of data: video files from a scene-camera, and data files on the basis of the eye-movements (containing simple x and y co-ordinates that together constitute the exact location of the fixation point, at a rate of 30 Hz). The right images in Fig. 4 are an overlay of the video files from the scene-camera with the gaze co-ordinates from the data files.

  3. The average length of the video files was 6.36 min. The average number of dropped frames per video file was 38, with nearly all of the dropped frames occurring in clusters of 3–7 frames.

  4. On average we had an anchor point in the video files from the eye-trackers every 21.38 s. The exact number and position of the anchor points depended on the content of the video data: the onset or offset of hand gestures were particularly frequently used as anchor points because those actions were clear signals in each of the video files.

  5. Especially compared to studies only using video files to determine eye gaze, such as Kendon (2004), Paggio et al. (2010), and Streeck (2009).

References

  • Adolphs, S., Knight, D., & Carter, R. (2011). Capturing context for heterogeneous corpus analysis: Some first steps. International Journal of Corpus Linguistics, 16, 305–324.

    Article  Google Scholar 

  • Allwood, J. (2008). Multimodal corpora. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics: An international handbook (29th ed., pp. 207–225). Berlin: Mouton de Gruyter.

    Google Scholar 

  • Allwood, J., Cerrato, L., Jokinen, K., Navarretta, C., & Paggio, P. (2007). The MUMIN coding scheme for the annotation of feedback, turn management, and sequencing phenomena. In J. Martin, P. Paggio, P. Kuenlein, R. Stiefelhagen, & F. Pianesi (Eds.), Multimodal corpora for modelling human multimodal behaviour (41st ed., pp. 273–287). Heidelberg: Springer.

    Google Scholar 

  • Bavelas, J., Coates, L., & Johnson, T. (2002). Listener responses as a collaborative process: The role of gaze. Journal of Communication, 52, 566–580.

    Article  Google Scholar 

  • Bertrand, R., Blache, P., Espesser, R., Ferré, G., Meunier, C., Priego-Valverde, B., et al. (2008). Le CID—Corpus of interactional data—Annotation et exploitation multimodale de parole conversationnelle. Traitement automatique des langues, 49, 105–134.

    Google Scholar 

  • Blache, P., Bertrand, R., & Ferré, G. (2008). Creating and exploiting multimodal annotated corpora. In Proceedings of the sixth international conference on language resources and evaluation (LREC).

  • Boersma, P., & Weenink, D. (2009). PRAAT: Doing phonetics by computer (version 5.3.05). http://www.praat.org/. Accessed February 27, 2012.

  • Brennan, S., Chen, X., Dickinson, C., Neider, M., & Zelinsky, G. (2008). Coordinating cognition: The costs and benefits of shared gaze during collaborative search. Cognition, 106, 1465–1477.

    Article  Google Scholar 

  • Brugman, H., & Russel, A. (2004). Annotating multimedia/multi-modal resources with ELAN. In Proceedings of the fourth international conference on language resources and evaluation (LREC).

  • Campbell, N. (2009). Tools and resources for visualising conversational speech interaction. In M. Kipp, J. Martin, P. Paggio, & D. Heylen (Eds.), Multimodal corpora: From models of natural interaction to systems and applications (pp. 231–234). Heidelberg: Springer.

    Google Scholar 

  • Cavicchio, F., & Poesio, M. (2009). Multimodal corpora annotation: Validation methods to assess coding scheme reliability. In M. Kipp, J. Martin, P. Paggio, & D. Heylen (Eds.), Multimodal corpora: From models of natural interaction to systems and applications (pp. 109–121). Heidelberg: Springer.

  • Chen, L., Travis-Rose, R., Parrill, F., Han, X., Tu, J., Huang, Z., et al. (2006). VACE multimodal meeting corpus. Lecture Notes in Computer Science, 3869, 40–51.

    Article  Google Scholar 

  • Edlund, J., Beskow, J., Elenius, K., Hellmer, K., Strömbergsson, S., & House, D. (2010). Spontal: A Swedish spontaneous dialogue corpus of audio, video and motion capture. In Proceedings of the seventh international conference on language resources and evaluation (LREC).

  • Fanelli, G., Gall, J., Romsdorfer, H., Weise, T., & Van Gool, L. (2010). 3D vision technology for capturing multimodal corpora: Chances and challenges. In Proceedings of the seventh international conference on language resources and evaluation (LREC).

  • Feyaerts, K., Oben, B., Brône, G., & Speelman, D. (2011). Corpus interactional humour. http://www.arts.kuleuven.be/ling/midi/corpora-tools.

  • Gerwing, J., & Allison, M. (2009). The relationship between verbal and gestural contributions in conversation: A comparison of three methods. Gesture, 9, 312–336.

    Article  Google Scholar 

  • Hadelich, K., & Crocker, M. (2006). Gaze alignment of interlocutors in conversational dialogues. In Proceedings of the 2006 symposium on eye tracking research and applications.

  • Hanna, J., & Brennan, S. (2007). Speakers’ eye gaze disambiguates referring expressions early during face-to-face conversation. Journal of Memory and Language, 57, 596–615.

    Article  Google Scholar 

  • Herrera, D., Novick, D., Jan, D., & Traum, D. (2010). The UTEP-ICT cross-cultural multiparty multimodal dialog corpus. In Proceedings of the seventh international conference on language resources and evaluation (LREC).

  • Jacob, R., & Karn, K. (2003). Eye tracking in human-computer interaction and usability research: Ready to deliver the promises. In R. Radach, J. Hyönä, & H. Deubel (Eds.), The mind’s eye: Cognitive and applied aspects of eye movement research (pp. 573–605). Oxford: Elsevier Science.

    Chapter  Google Scholar 

  • Jokinen, K. (2010). Non-verbal signals for turn-taking and feedback. In Proceedings of the seventh international conference on language resources and evaluation (LREC).

  • Jokinen, K., Nishida, M., & Yamamoto, S. (2009) Eye gaze experiments for conversation monitoring. In Proceedings of the 3rd international universal communication symposium.

  • Kendon, A. (2004). Gesture: Visible action as utterance. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Kimbara, I. (2006). On gestural mimicry. Gesture, 6, 39–61.

    Article  Google Scholar 

  • Kipp, M., Neff, M., & Albrecht, I. (2007). An annotation scheme for conversational gestures: How to economically capture timing and form. Journal on Language Resources and Evaluation, 41, 325–339.

    Article  Google Scholar 

  • Knight, D. (2011). The future of multimodal corpora. Revista Brasileira de Linguistica Aplicada, 11, 391–415.

    Google Scholar 

  • Knight, D., Adolphs, S., Tennent, P., & Carter, R. (2008) The Nottingham multi-modal corpus: A demonstration. In Proceedings of the sixth international conference on language resources and evaluation (LREC).

  • Knight, D., Evans, D., Carter, R., & Adolphs, S. (2009). HeadTalk, HandTalk and the corpus: Towards a framework for multi-modal, multi-media corpus development. Corpora, 4, 1–32.

    Article  Google Scholar 

  • Lausberg, H., & Sloetjes, H. (2009). Coding gestural behavior with the NEUROGES-ELAN system. Behavior Research Methods, Instruments, & Computers, 41, 841–849.

    Article  Google Scholar 

  • Massaro, D., & Beskow, J. (2002). Multimodal speech perception: A paradigm for speech science. In B. Granstrom, D. House, & I. Karlsson (Eds.), Multimodality in language and speech systems (pp. 45–71). Dordrecht: Kluwer Academic.

    Chapter  Google Scholar 

  • McNeill, D. (1992). Hand and mind: What gestures reveal about thought. Chicago: University of Chicago Press.

    Google Scholar 

  • McNeill, D. (2005). Gesture and thought. Chicago: University of Chicago Press.

    Book  Google Scholar 

  • McNeill, D. (2008). Unexpected metaphors. In A. Cienki & C. Müller (Eds.), Metaphor and gesture (pp. 155–170). Amsterdam: John Benjamins.

    Google Scholar 

  • Oostdijk, N. (2000). The spoken Dutch corpus. Overview and first evaluation. In Proceedings LREC 2000, Genoa, Italy.

  • Paggio, P., Allwood, J., Ahlsén, E., Jokinen, K., & Navarretta, C. (2010). The NOMCO multimodal Nordic resource—Goals and characteristics. In Proceedings LREC 2010, Valletta, Malta.

  • Pickering, M., & Garrod, S. (2004). Toward a mechanistic psychology of dialogue. Behavioral and Brain Sciences, 27, 169–226.

    Google Scholar 

  • Pickering, M., & Garrod, S. (2006). Alignment as the basis for successful communication. Research on Language and Computation, 4, 203–228.

    Article  Google Scholar 

  • Pine, K., Lufkin, N., & Messer, D. (2004). More gestures than answers: Children learning about balance. Developmental Psychology, 40, 1059–1067.

    Article  Google Scholar 

  • Rayner, K. (1998). Eye movements in reading and information processing: 20 years of research. Psychological Bulletin, 85, 618–660.

    Article  Google Scholar 

  • Selting, M. (2000). The construction of units in conversational talk. Language in Society, 29, 477–517.

    Article  Google Scholar 

  • Selting, M., Auer, P., Barden, B., Couper-Kuhlen, E., Günther, S., Quasthoff, U., et al. (1998). Gesprachsanalytisches transkriptionssystem (GAT). Linguistische Berichte, 173, 91–122.

    Google Scholar 

  • Staudte, M., Heloir, A., Crocker, M., & Kipp, M. (2011). On the importance of gaze and speech alignment for efficient communication. In Proceedings of the 9th international gesture workshop.

  • Streeck, J. (2009). Gesturecraft—The manufacture of meaning. Amsterdam/Philadelphia: John Benjamins.

    Book  Google Scholar 

  • Tanenhaus, M., & Brown-Schmidt, S. (2008). Language processing in the natural world. In B. Moore, L. Tyler, W. Marslen-Wilson (Eds.), The perception of speech: From sound to meaning. Philosophical Transactions of the Royal Society B: Biological Sciences, 363, 1105–1122.

  • Van den Bosch, A., et al. (2007). An efficient memory-based morphosyntactic tagger and parser for Dutch. In Selected papers of the 17th computational linguistics in the Netherlands meeting.

  • Van Son, R., Wesseling, W., Sanders, E., & Van Der Heuvel, H. (2008). The IFADV corpus: A free dialog video corpus. In Proceedings of the sixth international conference on language resources and evaluation (LREC).

  • Vertegaal, R., Slagter, R., Van der Veer, G., & Nijholt, A. (2001). Eye gaze patterns in conversations: There is more to conversational agents than meets the eyes. In Proceedings of the Conference on Human Factors in Computing Systems.

Download references

Acknowledgments

This work was partially supported by Grant Number 3H090339 STIM/09/03 of the University of Leuven.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Geert Brône.

Appendix

Appendix

See Table 2.

Table 2 Annotation parameters for gesture and eye gaze

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Brône, G., Oben, B. InSight Interaction: a multimodal and multifocal dialogue corpus. Lang Resources & Evaluation 49, 195–214 (2015). https://doi.org/10.1007/s10579-014-9283-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-014-9283-2

Keywords

Navigation