Skip to main content
Log in

Embodied Human Computer Interaction

  • Technical Contribution
  • Published:
KI - Künstliche Intelligenz Aims and scope Submit manuscript

Abstract

In this paper, we argue that embodiment can play an important role in the design and modeling of systems developed for Human Computer Interaction. To this end, we describe a simulation platform for building Embodied Human Computer Interactions (EHCI). This system, VoxWorld, enables multimodal dialogue systems that communicate through language, gesture, action, facial expressions, and gaze tracking, in the context of task-oriented interactions. A multimodal simulation is an embodied 3D virtual realization of both the situational environment and the co-situated agents, as well as the most salient content denoted by communicative acts in a discourse. It is built on the modeling language VoxML (Pustejovsky and Krishnaswamy in VoxML: a visualization modeling language, proceedings of LREC, 2016), which encodes objects with rich semantic typing and action affordances, and actions themselves as multimodal programs, enabling contextually salient inferences and decisions in the environment. VoxWorld enables an embodied HCI by situating both human and artificial agents within the same virtual simulation environment, where they share perceptual and epistemic common ground. We discuss the formal and computational underpinnings of embodiment and common ground, how they interact and specify parameters of the interaction between humans and artificial agents, and demonstrate behaviors and types of interactions on different classes of artificial agents.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Notes

  1. This recalls the question of how to best model situated action [16, 97].

  2. See Sect. 5 for details on integrating various sensor types and their relationships with the particulars of the artificial agent’s embodiment.

  3. as = argument structure; qs = qualia structure.

  4. Beginning in [52], voxemes have been denoted [[voxeme]].

  5. It should be noted that Gibsonian affordances might be construed as the goal of an activity in some contexts.

  6. TTR encodes actions (such as put and grasp above) as finite-state sequences of subevents (cf. [72]), but the computational effect of applying the updating functions over the current RobotState, given an action, are similar to our interpretation of events as state-transformers; e.g., mapping from RobotState to RobotState.

  7. VoxSim source can be found here.

  8. Shared aural perception is possible, while haptic technology is rapidly advancing. We expect that much of the semantics presented here would be suitable for modeling extra-visual shared perception. This is the topic of ongoing research, beginning with haptics in VR.

  9. This is similar in many respects to the representations introduced in [20, 27] and [37] for modeling action and control with robots.

  10. The theory of semiotic schemas introduced in [83] attempts to encode the perceptual context of a linguistic utterance as well, to resolve reference.

  11. Forward kinematics computes the position of the end-effector from the joint parameters. Inverse kinematics computes the joint parameters from the position of the effector.

  12. \([\![S ]\!]= ([\![\mathbf{NP} ]\!][\![\mathbf{GP} ]\!]).\)

  13. \([\![\mathbf{GP}_1 ]\!]= \lambda j. ([\![\mathbf{D}_{Obj} ]\!];\lambda j'.(([\![\mathbf{G}_{af} ]\!]j')j)).\)

  14. \([\![\mathbf{GP}_2 ]\!]= \lambda k. ([\![\mathbf{D}_{Loc} ]\!]; \lambda j. ([\![\mathbf{D}_{Obj} ]\!];\lambda j'.(([\![\mathbf{G}_{af} ]\!]j')j)k)).\)

  15. \([\![\mathbf{GP}_3 ]\!]= \lambda k. ([\![\mathbf{D}_{Dir} ]\!]; \lambda j. ([\![\mathbf{D}_{Obj} ]\!];\lambda j'.(([\![\mathbf{G}_{af} ]\!]j')j)k)).\)

  16. \([\alpha ]_{\sigma } (x_i \vee e_i)\), \([\beta ]_{\sigma } (x_i \vee e_i).\)

  17. \([\alpha ]_{\sigma } ([\beta ]_{\sigma } (x_i \vee e_i))\), \([\beta ]_{\sigma } ([\alpha ]_{\sigma } (x_i \vee e_i)).\)

  18. \([\beta ]_{\sigma } ([\alpha ]_{\sigma } ([\beta ]_{\sigma } (x_i \vee e_i))) \), \([\alpha ]_{\sigma } ([\beta ]_{\sigma } ([\alpha ]_{\sigma } (x_i \vee e_i))).\)

  19. \([(\alpha \cup \beta )^*]_{\sigma } \varphi. \)

  20. A video demo can be viewed here http://www.voxicon.net/wp-content/uploads/2020/07/DARPA-CwC-Brandeis-CSU-July-2020.mp4.

  21. VoxML encodes relations using a number of common spatial reasoning calculi, including the Region Connection Calculus [82], where this would be encoded EC(ysfc).

References

  1. Anderson ML (2003) Embodied cognition: a field guide. Artif Intell 149(1):91–130

    Google Scholar 

  2. Asher N (1998) Common ground, corrections and coordination. J Semant

  3. Asher N (2008) A type driven theory of predication with complex types. Fund Inf 84(2):151–183

    MathSciNet  MATH  Google Scholar 

  4. Asher N, Lascarides A (2003) Logics of conversation. Cambridge University Press, Cambridge

    Google Scholar 

  5. Asher N, Pogodalla S (2010) Sdrt and continuation semantics. In: JSAI international symposium on artificial intelligence, Springer, New York, pp 3–15

  6. Asher N, Pustejovsky J (2006) A type composition logic for generative lexicon. J Cognit Sci 6:1–38

    Google Scholar 

  7. Baker CL, Jara-Ettinger J, Saxe R, Tenenbaum JB (2017) Rational quantitative attribution of beliefs, desires and percepts in human mentalizing. Nat Hum Behav 1(4):1–10

    Google Scholar 

  8. Ballard DH (1981) Generalizing the hough transform to detect arbitrary shapes. Pattern Recogn 13(2):111–122

    MATH  Google Scholar 

  9. Barker C, Shan CC (2014) Continuations and natural language, vol 53. Oxford Studies in Theoretical Linguistics

  10. van Benthem JFAK (1991) Logic and the flow of information

  11. Bergen BK (2012) Louder than words: the new science of how the mind makes meaning. Basic Books

  12. Blackburn P, Bos J (2003) Computational semantics. Theor Int J Theory Hist Found Sci pp 27–45

  13. Cassell J, Stone M, Yan H (2000a) Coordination and context-dependence in the generation of embodied conversation. In: Proceedings of the first international conference on Natural language generation-Volume 14, ACL, pp 171–178

  14. Cassell J, Sullivan J, Churchill E, Prevost S (2000b) Embodied conversational agents. MIT Press, New York

    Google Scholar 

  15. Chrisley R (2003) Embodied artificial intelligence. Artif Intell 149(1):131–150

    Google Scholar 

  16. Clancey WJ (1993) Situated action: A neuropsychological interpretation response to vera and simon. Cogn Sci 17(1):87–116

    Google Scholar 

  17. Clark HH, Brennan SE (1991) Grounding in communication. Perspect Soc Share Cognit 13(1991):127–149

    Google Scholar 

  18. Cooper R (2005) Records and record types in semantic theory. J Logic Comput 15(2):99–112

    MathSciNet  MATH  Google Scholar 

  19. Cooper R (2017) Adapting type theory with records for natural language semantics. In: Modern perspectives in type-theoretical semantics, Springer, New York, pp 71–94

  20. Cooper R, Ginzburg J (2015) Type theory with records for natural language semantics. The handbook of contemporary semantic theory p 375

  21. Coventry K, Garrod SC (2005) Spatial prepositions and the functional geometric framework. Towards a classification of extra-geometric influences

  22. Craik KJW (1943) The nature of explanation. Cambridge University, Cambridge

    Google Scholar 

  23. De Groote P (2001) Type raising, continuations, and classical logic. In: Proceedings of the thirteenth Amsterdam Colloquium, pp 97–101

  24. Dekker PJ (2012) Predicate logic with anaphora. In: Dynamic Semantics, Springer, New York, pp 7–47

  25. Dobnik S, Cooper R (2017) Interfacing language, spatial perception and cognition in type theory with records. J Lang Modell 5(2):273–301

    Google Scholar 

  26. Dobnik S, Cooper R, Larsson S (2012) Modelling language, action, and perception in type theory with records. In: International workshop on constraint solving and language processing, Springer, New York, pp 70–91

  27. Dobnik S, Cooper R, Larsson S (2013) Modelling language, action, and perception in type theory with records. In: Constraint solving and language processing, Springer, New York, pp 70–91

  28. Evans V (2013) Language and time: a cognitive linguistics approach. Cambridge University Press, Cambridge

    Google Scholar 

  29. Feldman J (2010) Embodied language, best-fit analysis, and formal compositionality. Phys Life Rev 7(4):385–410

    Google Scholar 

  30. Fernando T (2009) Situations in ltl as strings. Inf Comput 207(10):980–999

    MathSciNet  MATH  Google Scholar 

  31. Fischer K (2011) How people talk with robots: designing dialog to reduce user uncertainty. AI Magn 32(4):31–38

    Google Scholar 

  32. Foster ME (2007) Enhancing human–computer interaction with embodied conversational agents. In: International conference on universal access in human–computer interaction, Springer, New York, pp 828–837

  33. Gatsoulis Y, Alomari M, Burbridge C, Dondrup C, Duckworth P, Lightbody P, Hanheide M, Hawes N, Hogg D, Cohn A, et al. (2016) Qsrlib: a software library for online acquisition of qualitative spatial relations from video

  34. Gibson JJ (1977) The theory of affordances. Perceiving, acting, and knowing: toward an ecological psychology, pp 67–82

  35. Gibson JJ (1979) The ecological approach to visual perception. Psychology Press

  36. Ginzburg J (1996) Interrogatives: questions, facts and dialogue. The handbook of contemporary semantic theory. Blackwell, Oxford pp 359–423

  37. Ginzburg J, Fernández R (2010) Computational models of dialogue. The handbook of computational linguistics and natural language processing 57:1

    Google Scholar 

  38. Goldman AI (1989) Interpretation psychologized*. Mind Lang 4(3):161–185

    Google Scholar 

  39. Gordon RM (1986) Folk psychology as simulation. Mind Lang 1(2):158–171

    Google Scholar 

  40. Gregoromichelaki E, Kempson R, Howes C (2020) Actionism in syntax and semantics. Dial Percept pp 12–27

  41. Griffiths TL, Chater N, Kemp C, Perfors A, Tenenbaum JB (2010) Probabilistic models of cognition: exploring representations and inductive biases. Trends Cogn Sci 14(8):357–364

    Google Scholar 

  42. Groenendijk J, Stokhof M (1991) Dynamic predicate logic. Linguist Philos pp 39–100

  43. Harel D (1984) Dynamic logic. In: Gabbay M, Gunthner F (eds) Handbook of philosophical logic, volume II: extensions of classical logic, Reidel, p 497–604

  44. Harel D, Kozen D, Tiuyn J (2000) Dynamic logic, 1st edn. The MIT Press, New York

    Google Scholar 

  45. Johnson M (1987) The body in the mind: the bodily basis of meaning, imagination, and reason. University of Chicago Press, Chicago

    Google Scholar 

  46. Kamp H, Van Genabith J, Reyle U (2011) Discourse representation theory. In: Handbook of philosophical logic, Springer, New York, pp 125–394

  47. Kendon A (2004) Gesture: visible action as utterance. Cambridge University Press, Cambridge

    Google Scholar 

  48. Kiela D, Bulat L, Vero AL, Clark S (2016) Virtual embodiment: A scalable long-term strategy for artificial intelligence research. arXiv preprint arXiv:161007432

  49. Klein E, Sag IA (1985) Type-driven translation. Linguist Philos 8(2):163–201

    Google Scholar 

  50. Konrad K (2004) 4 minimal model generation. In: Model generation for natural language interpretation and analysis, Springer, New York, pp 55–56

  51. Kopp S, Wachsmuth I (2010) Gesture in embodied communication and human–computer interaction, vol 5934. Springer, New York

    Google Scholar 

  52. Krishnaswamy N (2017) Monte-carlo simulation generation through operationalization of spatial primitives. PhD thesis, Brandeis University

  53. Krishnaswamy N, Pustejovsky J (2016a) Multimodal semantic simulations of linguistically underspecified motion events. In: Spatial Cognition X, Springer, New York, pp 177–197

  54. Krishnaswamy N, Pustejovsky J (2016b) VoxSim: a visual platform for modeling motion language. In: Proceedings of COLING 2016, the 26th international conference on computational linguistics, ACL

  55. Krishnaswamy N, Pustejovsky J (2018) Deictic adaptation in a virtual environment. In: Spatial cognition XI, Springer, New York, pp 180–196

  56. Krishnaswamy N, Narayana P, Wang I, Rim K, Bangar R, Patil D, Mulay G, Ruiz J, Beveridge R, Draper B, Pustejovsky J (2017) Communicating and acting: Understanding gesture in simulation semantics. In: 12th International workshop on computational semantics

  57. Kruijff GJM, Lison P, Benjamin T, Jacobsson H, Zender H, Kruijff-Korbayová I, Hawes N (2010) Situated dialogue processing for human–robot interaction. In: Cognitive systems, Springer, pp 311–364

  58. Landragin F (2006) Visual perception, language and gesture: a model for their understanding in multimodal dialogue systems. Signal Process 86(12):3578–3595

    MATH  Google Scholar 

  59. Lascarides A, Stone M (2006) Formal semantics for iconic gesture. In: Proceedings of the 10th workshop on the semantics and pragmatics of dialogue (BRANDIAL), pp 64–71

  60. Lascarides A, Stone M (2009) A formal semantic analysis of gesture. J Semant p ffp004

  61. Lücking A, Pfeiffer T, Rieser H (2015) Pointing and reference reconsidered. J Pragmat 77:56–79

    Google Scholar 

  62. Mani I, Pustejovsky J (2012) Interpreting motion: grounded representations for spatial language. Oxford University Press, Oxford

    Google Scholar 

  63. Marge M, Rudnicky AI (2013) Towards evaluating recovery strategies for situated grounding problems in human–robot dialogue. In: 2013 IEEE RO-MAN, IEEE, pp 340–341

  64. Marshall P, Hornecker E (2013) Theories of embodiment in hci. SAGE Handb Digit Technol Res 1:144–158

    Google Scholar 

  65. McNeely-White DG, Ortega FR, Beveridge JR, Draper BA, Bangar R, Patil D, Pustejovsky J, Krishnaswamy N, Rim K, Ruiz J, Wang I (2019) User-aware shared perception for embodied agents. In: 2019 IEEE international conference on humanized computing and communication (HCC), IEEE, pp 46–51

  66. Miller GA, Johnson-Laird PN (1976) Language and perception. Belknap Press, Cambridge

    Google Scholar 

  67. Muller P, Prévot L (2009) Grounding information in route explanation dialogues

  68. Narayana P, Krishnaswamy N, Wang I, Bangar R, Patil D, Mulay G, Rim K, Beveridge R, Ruiz J, Pustejovsky J, Draper B (2018) Cooperating with avatars through gesture, language and action. In: Intelligent systems conference (IntelliSys)

  69. Narayanan S (2010) Mind changes: a simulation semantics account of counterfactuals. Cognit Sci

  70. Naumann R (2001) Aspects of changes: a dynamic event semantics. J Semant 18:27–81

    Google Scholar 

  71. Plaza J (2007) Logics of public communications. Synthese 158(2):165–179

    MathSciNet  MATH  Google Scholar 

  72. Pustejovsky J (1991) The syntax of event structure. Cognition 41(1–3):47–81

    Google Scholar 

  73. Pustejovsky J (1995) The generative Lexicon. MIT Press, New York

    Google Scholar 

  74. Pustejovsky J (2013) Dynamic event structure and habitat theory. In: Proceedings of the 6th international conference on generative approaches to the Lexicon (GL2013), ACL, pp 1–10

  75. Pustejovsky J (2018) From actions to events: communicating through language and gesture. Interact Stud 19(1–2):289–317

    Google Scholar 

  76. Pustejovsky J, Batiukova O (2019) The lexicon. Cambridge University Press, Cambridge

    Google Scholar 

  77. Pustejovsky J, Boguraev B (1993) Lexical knowledge representation and natural language processing. Artif Intell 63(1–2):193–223

    Google Scholar 

  78. Pustejovsky J, Krishnaswamy N (2016) Voxml: a visualization modeling language. Proceedings of LREC

  79. Pustejovsky J, Krishnaswamy N (2020) Embodied human-computer interactions through situated grounding. In: IVA ’20: proceedings of the 20th international conference on intelligent virtual agents, ACM

  80. Pustejovsky J, Moszkowicz JL (2011) The qualitative spatial dynamics of motion in language. Spatial Cognit Comput 11(1):15–44

    Google Scholar 

  81. Qing C, Goodman ND, Lassiter D (2016) A rational speech-act model of projective content. In: Proceedings of cognitive science, pp 1110–1115

  82. Randell D, Cui Z, Cohn A, Nebel B, Rich C, Swartout W (1992) A spatial logic based on regions and connection. In: KR’92. Principles of knowledge representation and reasoning: proceedings of the 3rd international conference, Morgan Kaufmann, San Mateo, pp 165–176

  83. Roy D (2005) Semiotic schemas: a framework for grounding language in action and perception. Artif Intell 167(1–2):170–205

    Google Scholar 

  84. Schaffer S, Reithinger N (2019) Conversation is multimodal: thus conversational user interfaces should be as well. In: Proceedings of the 1st international conference on conversational user interfaces, pp 1–3

  85. Scheutz M, Cantrell R, Schermerhorn P (2011) Toward humanlike task-based dialogue processing for human robot interaction. AI Magn 32(4):77–84

    Google Scholar 

  86. Schlenker P (2020) Gestural grammar. Nat Lang Linguist Theory pp 1–50

  87. Shapiro L (2014) The Routledge handbook of embodied cognition. Routledge, England

    Google Scholar 

  88. Stalnaker R (2002) Common ground. Linguist Philos 25(5–6):701–721

    Google Scholar 

  89. Tavares JMRS, Padilha AJMN (1995) A new approach for merging edge line segments. In: Proceedings RecPad’95, Aveiro

  90. Tellex S, Gopalan N, Kress-Gazit H, Matuszek C (2020) Robots that use language. Annu Rev Control Robot Auton Syst 3:25–55

    Google Scholar 

  91. Tomasello M, Carpenter M (2007) Shared intentionality. Dev Sci 10(1):121–125

    Google Scholar 

  92. Ullman TD, Goodman ND, Tenenbaum JB (2012) Theory learning as stochastic search in the language of thought. Cogn Dev 27(4):455–480

    Google Scholar 

  93. Unger C (2011) Dynamic semantics as monadic computation. In: JSAI international symposium on artificial intelligence, Springer, New York, pp 68–81

  94. Van Benthem J (2011) Logical dynamics of information and interaction. Cambridge University Press, Cambridge

  95. Van Ditmarsch H, van Der Hoek W, Kooi B (2007) Dynamic epistemic logic, vol 337. Springer, New York

    MATH  Google Scholar 

  96. Van Eijck J, Unger C (2010) Computational semantics with functional programming. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  97. Vera AH, Simon HA (1993) Situated action: a symbolic interpretation. Cognit Sci 17(1):7–48. https://doi.org/10.1016/S0364-0213(05)80008-4

    Article  Google Scholar 

  98. Wahlster W (2006) Dialogue systems go multimodal: The smartkom experience. In: SmartKom: foundations of multimodal dialogue systems, Springer, New York, pp 3–27

  99. Wang I, Narayana P, Patil D, Mulay G, Bangar R, Draper B, Beveridge R, Ruiz J (2017) EGGNOG: A continuous, multi-modal data set of naturally occurring gestures with ground truth labels. In: To appear in the Proceedings of the 12th IEEE international conference on automatic face & gesture recognition

  100. Weiser M (1999) The computer for the 21st century. ACM SIGMOBILE Mob Comput Commun Rev 3(3):3–11

    Google Scholar 

  101. Williams T, Bussing M, Cabrol S, Boyle E, Tran N (2019) Mixed reality deictic gesture for multi-modal robot communication. In: 2019 14th ACM/IEEE international conference on human–robot interaction (HRI), IEEE, pp 191–201

  102. Winston ME, Chaffin R, Herrmann D (1987) A taxonomy of part-whole relations. Cognit Sci 11(4):417–444

    Google Scholar 

Download references

Acknowledgements

We would like to thank Ross Beveridge, Bruce Draper, Francisco R. Ortega, and their team at Colorado State University, and Jaime Ruiz and his team at the University of Florida, without whose contribution the Diana System would not be a reality. We would also like to thank Katherine Krajovic, R. Pito Salas, and Nathaniel J. Dimick for their work on the Kirby implementation. Particular thanks to Ms. Krajovic for assembling the dialogue flowcharts in Fig. 14. We would also like to thank Ken Lai for his discussion regarding common ground structure. This work was supported by the US Defense Advanced Research Projects Agency (DARPA) and the Army Research Office (ARO) under contract #W911NF-15-C-0238 at Brandeis University. This work was also supported in part by a grant to James Pustejovsky from the IIS Division of National Science Foundation (1763926) entitled “Building a Uniform Meaning Representation for Natural Language Processing”. The points of view expressed herein are solely those of the authors and do not represent the views of the Department of Defense or the United States Government. Any errors or omissions are, of course, the responsibility of the authors.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to James Pustejovsky.

Additional information

This work was supported by the US Defense Advanced Research Projects Agency (DARPA) and the Army Research Office (ARO) under contract #W911NF-15-C-0238 at Brandeis University. It was first presented in [79], on which this discussion is based.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pustejovsky, J., Krishnaswamy, N. Embodied Human Computer Interaction. Künstl Intell 35, 307–327 (2021). https://doi.org/10.1007/s13218-021-00727-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13218-021-00727-5

Keywords

Navigation