Skip to main content

Evaluating Multimodal Behavior Schemas with VoxWorld

  • Conference paper
  • First Online:
Digital Human Modeling and Applications in Health, Safety, Ergonomics and Risk Management (HCII 2023)

Abstract

The ability to understand and model human-object interactions is becoming increasingly important in advancing the field of human-computer interaction (HCI). To maintain more effective dialogue, embodied agents must utilize situated reasoning - the ability to ground objects in a shared context and understand their roles in the conversation [35]. In this paper, we argue that developing a unified multimodal annotation schema for human actions, in addition to gesture and speech, is a crucial next step towards this goal. We develop a new approach for visualizing such schemas, such as Gesture AMR [5] and VoxML [33], by simulating their output with VoxWorld [21] in the context of a collaborative problem-solving task. We discuss the implications of this method, including proposing a novel testing paradigm using the generated simulation to validate these annotations for their accuracy and completeness.

This work was supported in part by NSF grant DRL 2019805, to Dr. Pustejovsky at Brandeis University. We would like to express our thanks to Nikhil Krishnaswamy for his comments on the multimodal framework motivating the development of the simulation platform, FibWorld. The views expressed herein are ours alone.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We are currently developing a much richer specification for action annotation (Action AMR), for both collaborative tasks as well as procedural texts and narratives.

References

  1. Banarescu, L., et al.: Abstract meaning representation (AMR) 1.0 specification. In: Parsing on Freebase from Question-Answer Pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle: ACL, pp. 1533–1544 (2012)

    Google Scholar 

  2. Bonial, C., et al.: Dialogue-AMR: abstract meaning representation for dialogue. In: Proceedings of the 12th Language Resources and Evaluation Conference, pp. 684–695 (2020)

    Google Scholar 

  3. Bradford, M., et al.: Challenges and opportunities in annotating a multimodal collaborative problem-solving task (2022)

    Google Scholar 

  4. Brugman, H., Russel, A.: Annotating multi-media/multi-modal resources with ELAN. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004). European Language Resources Association (ELRA), Lisbon, Portugal (2004). http://www.lrec-conf.org/proceedings/lrec2004/pdf/480.pdf

  5. Brutti, R., Donatelli, L., Lai, K., Pustejovsky, J.: Abstract meaning representation for gesture. In: Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France, pp. 1576–1583. European Language Resources Association (2022). https://aclanthology.org/2022.lrec-1.169

  6. Cassell, J., Sullivan, J., Churchill, E., Prevost, S.: Embodied Conversational Agents. MIT Press, Cambridge (2000)

    Book  Google Scholar 

  7. Copestake, A., Flickinger, D., Pollard, C., Sag, I.A.: Minimal recursion semantics: an introduction. Res. Lang. Comput. 3(2–3), 281–332 (2005)

    Article  Google Scholar 

  8. Evans, L., Rzeszewski, M.: Hermeneutic relations in VR: immersion, embodiment, presence and HCI in VR gaming. In: Fang, X. (ed.) HCII 2020. LNCS, vol. 12211, pp. 23–38. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-50164-8_2

    Chapter  Google Scholar 

  9. Foster, M.E.: Enhancing human-computer interaction with embodied conversational agents. In: Stephanidis, C. (ed.) UAHCI 2007. LNCS, vol. 4555, pp. 828–837. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-73281-5_91

    Chapter  Google Scholar 

  10. Gu, C., et al.: Ava: a video dataset of spatio-temporally localized atomic visual actions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6047–6056 (2018)

    Google Scholar 

  11. Gupta, S., Malik, J.: Visual semantic role labeling. arXiv preprint arXiv:1505.04474 (2015)

  12. Hassanin, M., Khan, S., Tahtali, M.: Visual affordance and function understanding: a survey. ACM Comput. Surv. (CSUR) 54(3), 1–35 (2021)

    Article  Google Scholar 

  13. Helfrich, P., Rieb, E., Abrami, G., Lücking, A., Mehler, A.: Treeannotator: versatile visual annotation of hierarchical text relations. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)

    Google Scholar 

  14. Henlein, A., Gopinath, A., Krishnaswamy, N., Mehler, A., Pustejovsky, J.: Grounding human-object interaction to affordance behavior in multimodal datasets. Front. Artif. Intell. 6, 1084740 (2023)

    Article  Google Scholar 

  15. Kamp, H., Van Genabith, J., Reyle, U.: Discourse representation theory. In: Gabbay, D., Guenthner, F. (eds.) Handbook of Philosophical Logic, pp. 125–394. Springer, Dordrecht (2011). https://doi.org/10.1007/978-94-007-0485-5_3

    Chapter  Google Scholar 

  16. Karau, S.J., Williams, K.D.: Social loafing: a meta-analytic review and theoretical integration. J. Pers. Soc. Psychol. 65(4), 681 (1993)

    Article  Google Scholar 

  17. Kipp, M., Neff, M., Albrecht, I.: An annotation scheme for conversational gestures: how to economically capture timing and form. Lang. Resour. Eval. 41, 325–339 (2007)

    Article  Google Scholar 

  18. Knight, K., et al.: Abstract meaning representation (AMR) annotation release 1.2.6. Web download (2019)

    Google Scholar 

  19. Kopp, S., et al.: Towards a common framework for multimodal generation: the behavior markup language. In: Gratch, J., Young, M., Aylett, R., Ballin, D., Olivier, P. (eds.) IVA 2006. LNCS (LNAI), vol. 4133, pp. 205–217. Springer, Heidelberg (2006). https://doi.org/10.1007/11821830_17

    Chapter  Google Scholar 

  20. Kopp, S., Wachsmuth, I.: Gesture in Embodied Communication and Human-Computer Interaction, vol. 5934. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12553-9

    Book  Google Scholar 

  21. Krishnaswamy, N., et al.: Situational awareness in human computer interaction: Diana’s world (2020)

    Google Scholar 

  22. Krishnaswamy, N., et al.: Diana’s world: a situated multimodal interactive agent. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13618–13619 (2020)

    Google Scholar 

  23. Krishnaswamy, N., Pickard, W., Cates, B., Blanchard, N., Pustejovsky, J.: The voxworld platform for multimodal embodied agents. In: Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp. 1529–1541 (2022)

    Google Scholar 

  24. Krishnaswamy, N., Pustejovsky, J.: Voxsim: a visual platform for modeling motion language. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations, pp. 54–58 (2016)

    Google Scholar 

  25. Lücking, A., Bergmann, K., Hahn, F., Kopp, S., Rieser, H.: The bielefeld speech and gesture alignment corpus (SaGA) (2010). https://doi.org/10.13140/2.1.4216.1922

  26. Marshall, P., Hornecker, E.: Theories of embodiment in HCI. In: The SAGE Handbook of Digital Technology Research, vol. 1, pp. 144–158 (2013)

    Google Scholar 

  27. Martin, J.C., Niewiadomski, R., Devillers, L., Buisine, S., Pelachaud, C.: Multimodal complex emotions: gesture expressivity and blended facial expressions. Int. J. Humanoid Rob. 3(03), 269–291 (2006)

    Article  Google Scholar 

  28. Nakayama, H., Kubo, T., Kamura, J., Taniguchi, Y., Liang, X.: doccano: text annotation tool for human (2018). https://github.com/doccano/doccano

  29. Palmer, M., Gildea, D., Kingsbury, P.: The proposition bank: an annotated corpus of semantic roles. Comput. Linguist. 31(1), 71–106 (2003)

    Article  Google Scholar 

  30. Podlasov, A., Tan, S., O’Halloran, K.: Interactive state-transition diagrams for visualization of multimodal annotation. Intell. Data Anal. 16, 683–702 (2012). https://doi.org/10.3233/IDA-2012-0544

    Article  Google Scholar 

  31. Pustejovsky, J., Krishnaswamy, N.: Embodied human computer interaction. Künstliche Intelligenz (2021)

    Google Scholar 

  32. Pustejovsky, J.: Unifying linguistic annotations: a timeml case study. In: Proceedings of Text, Speech, and Dialogue Conference (2006)

    Google Scholar 

  33. Pustejovsky, J., Krishnaswamy, N.: Voxml: a visualization modeling language. In: Proceedings of LREC (2016)

    Google Scholar 

  34. Pustejovsky, J., Krishnaswamy, N.: Voxml: a visualization modeling language. arXiv preprint arXiv:1610.01508 (2016)

  35. Pustejovsky, J., Krishnaswamy, N.: Multimodal semantics for affordances and actions. In: Kurosu, M. (ed.) HCII 2022. LNCS, vol. 13302, pp. 137–160. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-05311-5_9

    Chapter  Google Scholar 

  36. Pustejovsky, J., Stubbs, A.: Natural Language Annotation for Machine Learning: A Guide to Corpus-Building for Applications. O’Reilly Media, Inc. (2012)

    Google Scholar 

  37. Reallusion Inc.: Character Creator 4 (2022). https://www.reallusion.com/character-creator/

  38. Sadhu, A., Gupta, T., Yatskar, M., Nevatia, R., Kembhavi, A.: Visual semantic role labeling for video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5589–5600 (2021)

    Google Scholar 

  39. Schaffer, S., Reithinger, N.: Conversation is multimodal: thus conversational user interfaces should be as well. In: Proceedings of the 1st International Conference on Conversational User Interfaces, pp. 1–3 (2019)

    Google Scholar 

  40. Stenetorp, P., Pyysalo, S., Topić, G., Ohta, T., Ananiadou, S., Tsujii, J.: Brat: a web-based tool for NLP-assisted text annotation. In: Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 102–107 (2012)

    Google Scholar 

  41. Van Gysel, J.E., et al.: Designing a uniform meaning representation for natural language processing. KI-Künstliche Intelligenz, pp. 1–18 (2021)

    Google Scholar 

  42. Wahlster, W.: Dialogue systems go multimodal: the smartkom experience. In: Wahlster, W. (ed.) SmartKom: Foundations of Multimodal Dialogue Systems, pp. 3–27. Springer, Heidelberg (2006). https://doi.org/10.1007/3-540-36678-4_1

    Chapter  Google Scholar 

  43. Wolfert, P., Robinson, N., Belpaeme, T.: A review of evaluation practices of gesture generation in embodied conversational agents. IEEE Trans. Hum.-Mach. Syst. (2022)

    Google Scholar 

  44. Yang, S., Gao, Q., Liu, C., Xiong, C., Zhu, S.C., Chai, J.: Grounded semantic role labeling. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 149–159 (2016)

    Google Scholar 

  45. Yatskar, M., Zettlemoyer, L., Farhadi, A.: Situation recognition: Visual semantic role labeling for image understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5534–5542 (2016)

    Google Scholar 

  46. Ziem, A.: Do we really need a multimodal construction grammar? Linguist. Vanguard 3(s1) (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christopher Tam .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tam, C., Brutti, R., Lai, K., Pustejovsky, J. (2023). Evaluating Multimodal Behavior Schemas with VoxWorld. In: Duffy, V.G. (eds) Digital Human Modeling and Applications in Health, Safety, Ergonomics and Risk Management. HCII 2023. Lecture Notes in Computer Science, vol 14028. Springer, Cham. https://doi.org/10.1007/978-3-031-35741-1_41

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-35741-1_41

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-35740-4

  • Online ISBN: 978-3-031-35741-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics