Abstract
This article presents the results of a study on the effects of representing time and intention in models of socially situated tasks on the quality of policies for robot behavior. The ability to reason about how others’ observable actions relate to their unobservable intentions is an important part of interpreting and responding to social behavior. It is also often necessary to observe the timing of actions in order to disambiguate others’ intentions. Therefore, our proposed approach is to model these interactions as time-indexed partially observable Markov decision processes (POMDPs). The intentions of humans are represented as hidden state in the POMDP models, and the time-dependence of actions by both humans and the robot are explicitly modelled. Our hypothesis is that planning for these interactions with a model that represents time dependent action outcomes and uncertainty about others’ intentions will achieve better results than simpler models that make fixed assumptions about people’s intentionality or abstract away time-dependent effects. A driving interaction governed by social conventions and involving ambiguity in the other driver’s intent was used as the scenario with which to test this hypothesis. A robot car controlled by policies from time-dependent POMDP models or by policies from two less expressive model variants performed this interaction in a driving simulator with human drivers. The time-dependent POMDP policies achieved better results than those of the models without explicit time representation or human intention as hidden state, both according to the reward obtained and to people’s subjective impressions of how socially appropriate and natural the robot’s behavior was. These results demonstrate both the relative superiority of these representation choices and the effectiveness of this approach to planning for socially situated tasks.
Similar content being viewed by others
References
Abele S, Bless H, Ehrhart KM (2004) Social information processing in strategic decision-making: why timing matters. Organ Behav Hum Decis Process 93(1):28–46
Aumann RJ (1976) Agreeing to disagree. Ann Stat 4:1236–1239
Baldwin DA, Baird JA (2001) Discerning intentions in dynamic human action. Trends Cogn Sci 5(4):171–178
Bosch Lt, Oostdijk N, Ruiter JPd (2004) Durational aspects of turn-taking in spontaneous face-to-face and telephone dialogues. In: Proc of text, speech, and dialogue, 7th international conference (TSD), pp 563–570
Bratman ME (1990) What is intention? In: Cohen PR, Morgan J, Pollack ME (eds) Intentions in communication. MIT Press, Cambridge, pp 15–31
Broz F (2008) Planning for human–robot interaction: representing time and human intention. Tech. Rep. CMU-RI-TR-08-49, Carnegie Mellon U. Robotics Institute
Broz F, Nourbakhsh IR, Simmons RG (2008) Planning for human–robot interaction using time-state aggregated pomdps. In: Proc of the 23rd AAAI conference on artificial intelligence (AAAI), pp 1339–1344
Broz F, Nourbakhsh IR, Simmons RG (2011) Designing pomdp models of socially situated tasks. In: Proc of the 20th IEEE international symposium on robot and human interactive communication (Ro-man)
Chernova S, DePalma N, Breazeal C (2011) Crowdsourcing real world human-robot dialog and teamwork through online multiplayer games. AI Mag 32(4):100–111
Cirillo M, Karlsson L, Saffiotti A (2009) A human-aware robot task planner. In: Proc of the 11th international conference on automated planning and scheduling (ICAPS), Thessaloniki, Greece
Colman AM (2003) Cooperation, psychological game theory, and limitations of rationality in social interaction. Behav Brain Sci 26:139–153
Emery-Montemerlo R, Gordon G, Schneider J, Thrun S (2004) Approximate solutions for partially observable stochastic games with common payoffs. In: Proc of the 3rd international joint conference on autonomous agents and multi-agent systems (AAMAS)
Foka A, Trahanias P (2010) Probabilistic autonomous robot navigation in dynamic environments with human motion prediction. Int J Soc Robot 2:79–94
Frith U, Frith C (2001) The biological basis of social interaction. Curr Dir Psychol Sci 10(5):151–155
Geanakoplos J (1992) Common knowledge. In: Proc of the 4th conference on theoretical aspects of reasoning about knowledge (TARK), pp 254–315
Kaelbling LP, Littman ML, Cassandra AR (1998) Planning and acting in partially observable stochastic domains. Artif Intell 101(1–2):99–134
Kirk RE (1982) Experimental design, 2nd edn. Brooks/Cole, Pacific Grove
Lewis DK (1969) Convention: a philosophical study. Harvard University Press, Cambridge
Maynard Smith J (1998) The origin of altruism. Nature 393:639
Meltzoff AN (1995) Understanding the intentions of others: re-enactment of intended actions by 18-month-old children. Dev Psychol 31(5):838–850
Mui L, Mohtashemi M, Halberstadt A (2002) Notions of reputation in multi-agents systems: a review. In: Proc of the 1st international joint conference on autonomous agents and multiagent systems (AAMAS), pp 280–287
Nourbakhsh I, Powers R, Birchfield S (1995) Dervish an office navigating robot. AI Mag 16(2):53–60
Pelphrey KA, Morris JP, Mccarthy G (2004) Grasping the intentions of others: the perceived intentionality of an action influences activity in the superior temporal sulcus during social perception. J Cogn Neurosci 16(10):1706–1716
Pineau J, Gordon G, Thrun S (2003) Point-based value iteration: an anytime algorithm for pomdps. In: Proc of the 18th international joint conference on artificial intelligence (IJCAI)
Poupart P, Boutilier C (2004) VDCBPI: an approximate scalable algorithm for large scale pomdps. In: Proc of advances in neural information processing systems (NIPS), vol 17, pp 1081–1088
Puterman LM (1994) Markov decision processes. Wiley, New York
Schultz J, Imamizu H, Kawato M, Frith CD (2004) Activation of the human superior temporal gyrus during observation of goal attribution by intentional objects. J Cogn Neurosci 16(10):1695–1705
Sebanz N, Bekkering H, Knoblich G (2006) Joint action: bodies and minds moving together. Trends Cogn Sci 10(2):70–76
Simmons R, Koenig S (1995) Probabilistic robot navigation in partially observable environments. In: Proc of the 14th international joint conference on artificial intelligence, pp 1080–1087
Smith T (2007) Probabilistic planning for robotic exploration. PhD thesis, The Robotics Institute, Carnegie Mellon University, Pittsburgh, PA
Smith T (2007) ZMDP software for POMDP and MDP planning. http://www.contrib.andrew.cmu.edu/trey/zmdp/
Smith T, Simmons R (2004) Heuristic search value iteration for pomdps. In: Proc of the 20th conference on uncertainty in artificial intelligence (UAI)
Smith T, Simmons RG (2006) Focused real-time dynamic programming for MDPs: squeezing more out of a heuristic. In: Proc of the 21st national conference on artificial intelligence (AAAI)
Striano T, Henning A, Stahl D (2006) Sensitivity to interpersonal timing at 3 and 6 months of age. Interact Stud 7(2):251–271
Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. IEEE Trans Neural Netw 9(5):1054
Tipaldi GD, Arras K (2011) Planning problems for social robots. In: Proc of the 21st international conference on automated planning and scheduling (ICAPS)
Tomasello M, Carpenter M, Call J, Behne T, Moll H (2005) Understanding and sharing intentions: the origins of social cognition. Behav Brain Sci 28:675–735
TORCS (2013) The open racing car simulator. http://torcs.sourceforge.net/
Walter H, Adenzato M, Ciaramidaro A, Enrici I, Pia L, Bara BG (2004) Understanding intentions in social interaction: the role of the anterior paracingulate cortex. J Cogn Neurosci 16(10):1854–1863
Wereschagin M (2006) Pittsburgh left seen by many as a local right. Pittsburgh Tribune-Review, 14 June 2006
Author information
Authors and Affiliations
Corresponding author
Appendix: Model Description
Appendix: Model Description
The overall structure of the domain description file is given in Table 16. The state space is time-indexed, so the number of timesteps in an episode must be specified. The time index is represented by a special state variable named “Time” that always has the value of the current timestep. This variable may be referred to in the preconditions of action or reward rules but may not have its value changed by an action rule. The rest of the state space is defined by a list of state variable names with the corresponding range of integer values that each variable make take.
The action space is defined in the same manner. Additionally, three special action variables are defined, “Human,” “Env,” and “Side Effects”. These are used to specify the action rules that describe the actions of the human agent and the environment. Because these actions are automatically taken in response to the actions of the robot at each timestep rather than being chosen by the robot, their range of action values are undefined.
1.1 A.1 States and Observations
The state space for the POMDP models is defined as follows:
-
human intention (2)—go first, yield
-
robot position (5)—the region occupied by the robot
-
human position (5)—the region occupied by the human
-
robot angle (2)—whether car body is turned more than 45 degrees
-
robot speed (3)—stopped, rolling/creeping, fast
-
human speed (3)—stopped, rolling/creeping, fast
-
human headlights (2)—whether the human has flashed their headlights
-
collision (3)—no collision, collision at this timestep, collision has occurred
-
traffic light (3)—the color of the traffic light
-
time (60)—the current timestep
The intersection and the surrounding roads are partitioned into rectangular regions relative to each agent’s starting position that are significant to the interaction. These regions are: more than a car length from the start of the intersection, a car length from the start of the intersection, the half of the intersection closest to its start, the half of the intersection closest to its end, and the road on the other side of the intersection. There is an additional position value that corresponds to the event of an agent entering the region of the road beyond the intersection for the first timestep. This value is used to assign the one-time reward for an agent crossing the intersection.
Observations are created by defining a subset of the state variables that are directly observable. The observable state variables for this domain are as follows:
-
robot position
-
human position
-
human headlights
-
human speed
-
collision
-
traffic light
Because the robot’s current speed and angle should always be known based on the effects of the set speed and turn actions, these variables are not included in the observation space.
1.2 A.2 Actions and Action Rules
Four actions are available to the robot to control its motion. These are high level actions that must be converted into control commands for the car in the driving simulator by a low level controller.
-
set speed (3)—stop, go slow, go fast
-
turn (1)—turn car body to the left
It is assumed that these actions can produce their immediate effect as a change in state within the timestep they are applied. For example, if the car is going fast and the robot chooses the stop action, the car will be stopped by the next timestep. Other action effects, such as whether moving at a certain speed will cause the robot to move into the next region, are probabilistic.
Further probabilistic action outcomes are created by sequentially applying the action rules for the special action variables to the states resulting from the robot’s action rules. The action rules for the “Human” action describe the possible behavior of the human. The action rules for the “Env” action have preconditions based on the time variable and control the changing of the traffic light. The action rules for the “Side Effects” action detect combinations of state variables indicating that the robot and human have driven into the same region in a way that may result in a collision.
Action rules are defined by the preconditions that determine whether or not the rule is applicable in a state, the possible effects of that action on the state variable values, and a weighting factor that specifies the relative likelihood of those outcomes. This action description language provides a simple and flexible way to express the state transition structure in terms of small subsets of relevant variables and their values. For each action rule, the action variable and variable value that the rule describes is specified. In order to uniquely identify each action rule an additional rule name is also given. This is necessary because the same combination of action variable and value may have different effects and different weights depending on the preconditions. Multiple action rules may be used to specify the full range of possible outcomes and their relative weights for a single action. The possible changes to state variables caused by an action rule are specified as a list of action effects. Each action effect may effect multiple state variables simultaneously. All of the effects for an action rule are given equal weight by default. The weight, if specified, is an integer that defines a ratio of how much less likely the action effects listed are than the default. For example, a weight of 2 would mean that a rule’s action effects are half as likely.
An example action rule describing the behavior of the human agent is shown in Table 17. The special “Human” action variable specifies that this action rule describes the behavior of the uncontrollable human agent. Because the robot cannot choose the actions taken by this other agent, the action value is 0, with “flash_no_yield” providing a unique identifier for this action rule. This action rule has a single possible effect, to turn on the headlights of the human’s car. The condition list describes the state variable values that determine when this action rule may take effect. It only makes sense to apply this action in states where the headlights are off. The human may turn on the lights at any time before they move into the intersection. The goal condition restricts this rule to cases where the human has the intention not to yield to the turning driver. The velocity 0 condition reflects the fact that people were only observed to turn on their headlights when their car was not moving. This action effect is \(\frac{1}{100}\) as likely to occur as a default outcome because a car that doesn’t intend to yield flashing its headlights is a rare event.
Rights and permissions
About this article
Cite this article
Broz, F., Nourbakhsh, I. & Simmons, R. Planning for Human–Robot Interaction in Socially Situated Tasks. Int J of Soc Robotics 5, 193–214 (2013). https://doi.org/10.1007/s12369-013-0185-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12369-013-0185-z