Abstract
In this paper, we argue that embodiment can play an important role in the design and modeling of systems developed for Human Computer Interaction. To this end, we describe a simulation platform for building Embodied Human Computer Interactions (EHCI). This system, VoxWorld, enables multimodal dialogue systems that communicate through language, gesture, action, facial expressions, and gaze tracking, in the context of task-oriented interactions. A multimodal simulation is an embodied 3D virtual realization of both the situational environment and the co-situated agents, as well as the most salient content denoted by communicative acts in a discourse. It is built on the modeling language VoxML (Pustejovsky and Krishnaswamy in VoxML: a visualization modeling language, proceedings of LREC, 2016), which encodes objects with rich semantic typing and action affordances, and actions themselves as multimodal programs, enabling contextually salient inferences and decisions in the environment. VoxWorld enables an embodied HCI by situating both human and artificial agents within the same virtual simulation environment, where they share perceptual and epistemic common ground. We discuss the formal and computational underpinnings of embodiment and common ground, how they interact and specify parameters of the interaction between humans and artificial agents, and demonstrate behaviors and types of interactions on different classes of artificial agents.
Similar content being viewed by others
Notes
See Sect. 5 for details on integrating various sensor types and their relationships with the particulars of the artificial agent’s embodiment.
as = argument structure; qs = qualia structure.
Beginning in [52], voxemes have been denoted [[voxeme]].
It should be noted that Gibsonian affordances might be construed as the goal of an activity in some contexts.
TTR encodes actions (such as put and grasp above) as finite-state sequences of subevents (cf. [72]), but the computational effect of applying the updating functions over the current RobotState, given an action, are similar to our interpretation of events as state-transformers; e.g., mapping from RobotState to RobotState.
VoxSim source can be found here.
Shared aural perception is possible, while haptic technology is rapidly advancing. We expect that much of the semantics presented here would be suitable for modeling extra-visual shared perception. This is the topic of ongoing research, beginning with haptics in VR.
The theory of semiotic schemas introduced in [83] attempts to encode the perceptual context of a linguistic utterance as well, to resolve reference.
Forward kinematics computes the position of the end-effector from the joint parameters. Inverse kinematics computes the joint parameters from the position of the effector.
\([\![S ]\!]= ([\![\mathbf{NP} ]\!][\![\mathbf{GP} ]\!]).\)
\([\![\mathbf{GP}_1 ]\!]= \lambda j. ([\![\mathbf{D}_{Obj} ]\!];\lambda j'.(([\![\mathbf{G}_{af} ]\!]j')j)).\)
\([\![\mathbf{GP}_2 ]\!]= \lambda k. ([\![\mathbf{D}_{Loc} ]\!]; \lambda j. ([\![\mathbf{D}_{Obj} ]\!];\lambda j'.(([\![\mathbf{G}_{af} ]\!]j')j)k)).\)
\([\![\mathbf{GP}_3 ]\!]= \lambda k. ([\![\mathbf{D}_{Dir} ]\!]; \lambda j. ([\![\mathbf{D}_{Obj} ]\!];\lambda j'.(([\![\mathbf{G}_{af} ]\!]j')j)k)).\)
\([\alpha ]_{\sigma } (x_i \vee e_i)\), \([\beta ]_{\sigma } (x_i \vee e_i).\)
\([\alpha ]_{\sigma } ([\beta ]_{\sigma } (x_i \vee e_i))\), \([\beta ]_{\sigma } ([\alpha ]_{\sigma } (x_i \vee e_i)).\)
\([\beta ]_{\sigma } ([\alpha ]_{\sigma } ([\beta ]_{\sigma } (x_i \vee e_i))) \), \([\alpha ]_{\sigma } ([\beta ]_{\sigma } ([\alpha ]_{\sigma } (x_i \vee e_i))).\)
\([(\alpha \cup \beta )^*]_{\sigma } \varphi. \)
A video demo can be viewed here http://www.voxicon.net/wp-content/uploads/2020/07/DARPA-CwC-Brandeis-CSU-July-2020.mp4.
VoxML encodes relations using a number of common spatial reasoning calculi, including the Region Connection Calculus [82], where this would be encoded EC(y, sfc).
References
Anderson ML (2003) Embodied cognition: a field guide. Artif Intell 149(1):91–130
Asher N (1998) Common ground, corrections and coordination. J Semant
Asher N (2008) A type driven theory of predication with complex types. Fund Inf 84(2):151–183
Asher N, Lascarides A (2003) Logics of conversation. Cambridge University Press, Cambridge
Asher N, Pogodalla S (2010) Sdrt and continuation semantics. In: JSAI international symposium on artificial intelligence, Springer, New York, pp 3–15
Asher N, Pustejovsky J (2006) A type composition logic for generative lexicon. J Cognit Sci 6:1–38
Baker CL, Jara-Ettinger J, Saxe R, Tenenbaum JB (2017) Rational quantitative attribution of beliefs, desires and percepts in human mentalizing. Nat Hum Behav 1(4):1–10
Ballard DH (1981) Generalizing the hough transform to detect arbitrary shapes. Pattern Recogn 13(2):111–122
Barker C, Shan CC (2014) Continuations and natural language, vol 53. Oxford Studies in Theoretical Linguistics
van Benthem JFAK (1991) Logic and the flow of information
Bergen BK (2012) Louder than words: the new science of how the mind makes meaning. Basic Books
Blackburn P, Bos J (2003) Computational semantics. Theor Int J Theory Hist Found Sci pp 27–45
Cassell J, Stone M, Yan H (2000a) Coordination and context-dependence in the generation of embodied conversation. In: Proceedings of the first international conference on Natural language generation-Volume 14, ACL, pp 171–178
Cassell J, Sullivan J, Churchill E, Prevost S (2000b) Embodied conversational agents. MIT Press, New York
Chrisley R (2003) Embodied artificial intelligence. Artif Intell 149(1):131–150
Clancey WJ (1993) Situated action: A neuropsychological interpretation response to vera and simon. Cogn Sci 17(1):87–116
Clark HH, Brennan SE (1991) Grounding in communication. Perspect Soc Share Cognit 13(1991):127–149
Cooper R (2005) Records and record types in semantic theory. J Logic Comput 15(2):99–112
Cooper R (2017) Adapting type theory with records for natural language semantics. In: Modern perspectives in type-theoretical semantics, Springer, New York, pp 71–94
Cooper R, Ginzburg J (2015) Type theory with records for natural language semantics. The handbook of contemporary semantic theory p 375
Coventry K, Garrod SC (2005) Spatial prepositions and the functional geometric framework. Towards a classification of extra-geometric influences
Craik KJW (1943) The nature of explanation. Cambridge University, Cambridge
De Groote P (2001) Type raising, continuations, and classical logic. In: Proceedings of the thirteenth Amsterdam Colloquium, pp 97–101
Dekker PJ (2012) Predicate logic with anaphora. In: Dynamic Semantics, Springer, New York, pp 7–47
Dobnik S, Cooper R (2017) Interfacing language, spatial perception and cognition in type theory with records. J Lang Modell 5(2):273–301
Dobnik S, Cooper R, Larsson S (2012) Modelling language, action, and perception in type theory with records. In: International workshop on constraint solving and language processing, Springer, New York, pp 70–91
Dobnik S, Cooper R, Larsson S (2013) Modelling language, action, and perception in type theory with records. In: Constraint solving and language processing, Springer, New York, pp 70–91
Evans V (2013) Language and time: a cognitive linguistics approach. Cambridge University Press, Cambridge
Feldman J (2010) Embodied language, best-fit analysis, and formal compositionality. Phys Life Rev 7(4):385–410
Fernando T (2009) Situations in ltl as strings. Inf Comput 207(10):980–999
Fischer K (2011) How people talk with robots: designing dialog to reduce user uncertainty. AI Magn 32(4):31–38
Foster ME (2007) Enhancing human–computer interaction with embodied conversational agents. In: International conference on universal access in human–computer interaction, Springer, New York, pp 828–837
Gatsoulis Y, Alomari M, Burbridge C, Dondrup C, Duckworth P, Lightbody P, Hanheide M, Hawes N, Hogg D, Cohn A, et al. (2016) Qsrlib: a software library for online acquisition of qualitative spatial relations from video
Gibson JJ (1977) The theory of affordances. Perceiving, acting, and knowing: toward an ecological psychology, pp 67–82
Gibson JJ (1979) The ecological approach to visual perception. Psychology Press
Ginzburg J (1996) Interrogatives: questions, facts and dialogue. The handbook of contemporary semantic theory. Blackwell, Oxford pp 359–423
Ginzburg J, Fernández R (2010) Computational models of dialogue. The handbook of computational linguistics and natural language processing 57:1
Goldman AI (1989) Interpretation psychologized*. Mind Lang 4(3):161–185
Gordon RM (1986) Folk psychology as simulation. Mind Lang 1(2):158–171
Gregoromichelaki E, Kempson R, Howes C (2020) Actionism in syntax and semantics. Dial Percept pp 12–27
Griffiths TL, Chater N, Kemp C, Perfors A, Tenenbaum JB (2010) Probabilistic models of cognition: exploring representations and inductive biases. Trends Cogn Sci 14(8):357–364
Groenendijk J, Stokhof M (1991) Dynamic predicate logic. Linguist Philos pp 39–100
Harel D (1984) Dynamic logic. In: Gabbay M, Gunthner F (eds) Handbook of philosophical logic, volume II: extensions of classical logic, Reidel, p 497–604
Harel D, Kozen D, Tiuyn J (2000) Dynamic logic, 1st edn. The MIT Press, New York
Johnson M (1987) The body in the mind: the bodily basis of meaning, imagination, and reason. University of Chicago Press, Chicago
Kamp H, Van Genabith J, Reyle U (2011) Discourse representation theory. In: Handbook of philosophical logic, Springer, New York, pp 125–394
Kendon A (2004) Gesture: visible action as utterance. Cambridge University Press, Cambridge
Kiela D, Bulat L, Vero AL, Clark S (2016) Virtual embodiment: A scalable long-term strategy for artificial intelligence research. arXiv preprint arXiv:161007432
Klein E, Sag IA (1985) Type-driven translation. Linguist Philos 8(2):163–201
Konrad K (2004) 4 minimal model generation. In: Model generation for natural language interpretation and analysis, Springer, New York, pp 55–56
Kopp S, Wachsmuth I (2010) Gesture in embodied communication and human–computer interaction, vol 5934. Springer, New York
Krishnaswamy N (2017) Monte-carlo simulation generation through operationalization of spatial primitives. PhD thesis, Brandeis University
Krishnaswamy N, Pustejovsky J (2016a) Multimodal semantic simulations of linguistically underspecified motion events. In: Spatial Cognition X, Springer, New York, pp 177–197
Krishnaswamy N, Pustejovsky J (2016b) VoxSim: a visual platform for modeling motion language. In: Proceedings of COLING 2016, the 26th international conference on computational linguistics, ACL
Krishnaswamy N, Pustejovsky J (2018) Deictic adaptation in a virtual environment. In: Spatial cognition XI, Springer, New York, pp 180–196
Krishnaswamy N, Narayana P, Wang I, Rim K, Bangar R, Patil D, Mulay G, Ruiz J, Beveridge R, Draper B, Pustejovsky J (2017) Communicating and acting: Understanding gesture in simulation semantics. In: 12th International workshop on computational semantics
Kruijff GJM, Lison P, Benjamin T, Jacobsson H, Zender H, Kruijff-Korbayová I, Hawes N (2010) Situated dialogue processing for human–robot interaction. In: Cognitive systems, Springer, pp 311–364
Landragin F (2006) Visual perception, language and gesture: a model for their understanding in multimodal dialogue systems. Signal Process 86(12):3578–3595
Lascarides A, Stone M (2006) Formal semantics for iconic gesture. In: Proceedings of the 10th workshop on the semantics and pragmatics of dialogue (BRANDIAL), pp 64–71
Lascarides A, Stone M (2009) A formal semantic analysis of gesture. J Semant p ffp004
Lücking A, Pfeiffer T, Rieser H (2015) Pointing and reference reconsidered. J Pragmat 77:56–79
Mani I, Pustejovsky J (2012) Interpreting motion: grounded representations for spatial language. Oxford University Press, Oxford
Marge M, Rudnicky AI (2013) Towards evaluating recovery strategies for situated grounding problems in human–robot dialogue. In: 2013 IEEE RO-MAN, IEEE, pp 340–341
Marshall P, Hornecker E (2013) Theories of embodiment in hci. SAGE Handb Digit Technol Res 1:144–158
McNeely-White DG, Ortega FR, Beveridge JR, Draper BA, Bangar R, Patil D, Pustejovsky J, Krishnaswamy N, Rim K, Ruiz J, Wang I (2019) User-aware shared perception for embodied agents. In: 2019 IEEE international conference on humanized computing and communication (HCC), IEEE, pp 46–51
Miller GA, Johnson-Laird PN (1976) Language and perception. Belknap Press, Cambridge
Muller P, Prévot L (2009) Grounding information in route explanation dialogues
Narayana P, Krishnaswamy N, Wang I, Bangar R, Patil D, Mulay G, Rim K, Beveridge R, Ruiz J, Pustejovsky J, Draper B (2018) Cooperating with avatars through gesture, language and action. In: Intelligent systems conference (IntelliSys)
Narayanan S (2010) Mind changes: a simulation semantics account of counterfactuals. Cognit Sci
Naumann R (2001) Aspects of changes: a dynamic event semantics. J Semant 18:27–81
Plaza J (2007) Logics of public communications. Synthese 158(2):165–179
Pustejovsky J (1991) The syntax of event structure. Cognition 41(1–3):47–81
Pustejovsky J (1995) The generative Lexicon. MIT Press, New York
Pustejovsky J (2013) Dynamic event structure and habitat theory. In: Proceedings of the 6th international conference on generative approaches to the Lexicon (GL2013), ACL, pp 1–10
Pustejovsky J (2018) From actions to events: communicating through language and gesture. Interact Stud 19(1–2):289–317
Pustejovsky J, Batiukova O (2019) The lexicon. Cambridge University Press, Cambridge
Pustejovsky J, Boguraev B (1993) Lexical knowledge representation and natural language processing. Artif Intell 63(1–2):193–223
Pustejovsky J, Krishnaswamy N (2016) Voxml: a visualization modeling language. Proceedings of LREC
Pustejovsky J, Krishnaswamy N (2020) Embodied human-computer interactions through situated grounding. In: IVA ’20: proceedings of the 20th international conference on intelligent virtual agents, ACM
Pustejovsky J, Moszkowicz JL (2011) The qualitative spatial dynamics of motion in language. Spatial Cognit Comput 11(1):15–44
Qing C, Goodman ND, Lassiter D (2016) A rational speech-act model of projective content. In: Proceedings of cognitive science, pp 1110–1115
Randell D, Cui Z, Cohn A, Nebel B, Rich C, Swartout W (1992) A spatial logic based on regions and connection. In: KR’92. Principles of knowledge representation and reasoning: proceedings of the 3rd international conference, Morgan Kaufmann, San Mateo, pp 165–176
Roy D (2005) Semiotic schemas: a framework for grounding language in action and perception. Artif Intell 167(1–2):170–205
Schaffer S, Reithinger N (2019) Conversation is multimodal: thus conversational user interfaces should be as well. In: Proceedings of the 1st international conference on conversational user interfaces, pp 1–3
Scheutz M, Cantrell R, Schermerhorn P (2011) Toward humanlike task-based dialogue processing for human robot interaction. AI Magn 32(4):77–84
Schlenker P (2020) Gestural grammar. Nat Lang Linguist Theory pp 1–50
Shapiro L (2014) The Routledge handbook of embodied cognition. Routledge, England
Stalnaker R (2002) Common ground. Linguist Philos 25(5–6):701–721
Tavares JMRS, Padilha AJMN (1995) A new approach for merging edge line segments. In: Proceedings RecPad’95, Aveiro
Tellex S, Gopalan N, Kress-Gazit H, Matuszek C (2020) Robots that use language. Annu Rev Control Robot Auton Syst 3:25–55
Tomasello M, Carpenter M (2007) Shared intentionality. Dev Sci 10(1):121–125
Ullman TD, Goodman ND, Tenenbaum JB (2012) Theory learning as stochastic search in the language of thought. Cogn Dev 27(4):455–480
Unger C (2011) Dynamic semantics as monadic computation. In: JSAI international symposium on artificial intelligence, Springer, New York, pp 68–81
Van Benthem J (2011) Logical dynamics of information and interaction. Cambridge University Press, Cambridge
Van Ditmarsch H, van Der Hoek W, Kooi B (2007) Dynamic epistemic logic, vol 337. Springer, New York
Van Eijck J, Unger C (2010) Computational semantics with functional programming. Cambridge University Press, Cambridge
Vera AH, Simon HA (1993) Situated action: a symbolic interpretation. Cognit Sci 17(1):7–48. https://doi.org/10.1016/S0364-0213(05)80008-4
Wahlster W (2006) Dialogue systems go multimodal: The smartkom experience. In: SmartKom: foundations of multimodal dialogue systems, Springer, New York, pp 3–27
Wang I, Narayana P, Patil D, Mulay G, Bangar R, Draper B, Beveridge R, Ruiz J (2017) EGGNOG: A continuous, multi-modal data set of naturally occurring gestures with ground truth labels. In: To appear in the Proceedings of the 12th IEEE international conference on automatic face & gesture recognition
Weiser M (1999) The computer for the 21st century. ACM SIGMOBILE Mob Comput Commun Rev 3(3):3–11
Williams T, Bussing M, Cabrol S, Boyle E, Tran N (2019) Mixed reality deictic gesture for multi-modal robot communication. In: 2019 14th ACM/IEEE international conference on human–robot interaction (HRI), IEEE, pp 191–201
Winston ME, Chaffin R, Herrmann D (1987) A taxonomy of part-whole relations. Cognit Sci 11(4):417–444
Acknowledgements
We would like to thank Ross Beveridge, Bruce Draper, Francisco R. Ortega, and their team at Colorado State University, and Jaime Ruiz and his team at the University of Florida, without whose contribution the Diana System would not be a reality. We would also like to thank Katherine Krajovic, R. Pito Salas, and Nathaniel J. Dimick for their work on the Kirby implementation. Particular thanks to Ms. Krajovic for assembling the dialogue flowcharts in Fig. 14. We would also like to thank Ken Lai for his discussion regarding common ground structure. This work was supported by the US Defense Advanced Research Projects Agency (DARPA) and the Army Research Office (ARO) under contract #W911NF-15-C-0238 at Brandeis University. This work was also supported in part by a grant to James Pustejovsky from the IIS Division of National Science Foundation (1763926) entitled “Building a Uniform Meaning Representation for Natural Language Processing”. The points of view expressed herein are solely those of the authors and do not represent the views of the Department of Defense or the United States Government. Any errors or omissions are, of course, the responsibility of the authors.
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was supported by the US Defense Advanced Research Projects Agency (DARPA) and the Army Research Office (ARO) under contract #W911NF-15-C-0238 at Brandeis University. It was first presented in [79], on which this discussion is based.
Rights and permissions
About this article
Cite this article
Pustejovsky, J., Krishnaswamy, N. Embodied Human Computer Interaction. Künstl Intell 35, 307–327 (2021). https://doi.org/10.1007/s13218-021-00727-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13218-021-00727-5