Keywords

1 Introduction

The ability to learn behavior models from demonstrations is a powerful and much soxught after technique in many applications such as robot navigation, autonomous driving, robot manipulation among others. Concretely, a robot’s decision making in such tasks is modeled using a Markov decision process (mdp). The reward function of the mdp is assumed to be the “representation of behavior”. However, it is often difficult to manually design reward functions that encode desired behaviors; hence irl formally introduced in [4] is commonly used to recover the reward function using human demonstrations of the task, as done in these examples [1, 5, 8]. This is because it is often much easier to demonstrate a task than rigorously specify all factors leading to desired behavior. Bayesian inverse reinforcement learning (birl) introduced in [7] is in particular suited for such task where a single reward function may not be sufficient, and expert demonstrations are often sub-optimal.

However, when using birl in practical tasks, we are faced with many challenges such as; very large, continuous and constrained state and action spaces, which make standard birl inference algorithms impractical. Constraints common in these tasks include the fact that often not all actions are executable on a robot. For example conventional cars cannot drive sideways, and not all robots can turn on the spot. Thus, naïve representation of such mdp using basic function approximation techniques such as grid discretization easily blow up in space and computational demands. Additionally, such discretization may also discard possible good policies by limiting possible actions available at a state, while still not accounting for the task constraints. We therefore develop a new graph based representation that significantly reduces the size of the state space and encodes task specific constraints directly into the action set of the mdp. Furthermore, standard birl inference algorithms such as policy walk (pw) of [7] based on Markov chain Monte Carlo (mcmc) or maximum a posteriori(map) approaches, often require iterating over all possible states and actions. This quickly becomes impractical when these spaces get very large as in our case. We thus develop a novel extension of the birl algorithm by defining a new likelihood function which does not require iterative over all states and actions, but instead uses samples of trajectories over possibly infinite state and action spaces.

2 Method

Our behavior learning setup consists of two stages; firstly, a flexible data-driven mdp representation called Controller graph (cg) detailed in Sect. 2.1, and secondly, reward learning step using sampled trajectory based birl.

Fig. 1.
figure 1

Conceptual illustration of cg of a stochastic shortest path mdp with 7 states, s and g indicating start and goal states respectively. Policy is shown with double red line. Reverse edges are shown with dotted blue lines.

2.1 Flexible MDP Representation

We use cgs for efficiently representing very large, possibly continuous mdps with action set already constrained to the target domain, by building upon [3, 6]. A cg conceptually illustrated in Fig. 1, is a weighted labeled graph \(\mathcal {G}= \langle \mathcal {V}, \mathcal {E}, \mathbf {W} \rangle \) with a vertex set \(\mathcal {V} = \{ v_i \}\), an edge set \(\mathcal {E} = \{{(v_i, v_j)}_{a}\}\) and a transition matrix \(\mathbf {W}\), such that \(\mathcal {V} \subset \mathcal {S}\) and \(\mathcal {E} = \{{(v_i, v_j)}_a \mid w_{i,j} > 0, \, \forall v_i, v_j \in \mathcal {V}, a \in \mathcal {A} \}\), where \(\mathcal {S}\) and \({\mathcal {A}} \) are state and action spaces respectively of the underlying mdp.

Therefore, vertices are state samples summarised by a vector \(\mathbf {x}_i\) and edges are short trajectories or “macro actions” \(\mathbf {x}_{i:j}\) between vertices i and j, which we call local controllers. These local controllers can be any deterministic controller such as motion primitives [2]; hence directly encode task constraints.The transition weights \(w_{i,j}\) can be estimated by simulating such local controllers a number of times and setting the weight as the log ratio of success in reaching the target vertex. The local controllers can also be interpreted as Markov options, in that once selected, a local controller completely defines the policy up to the next vertex. In practice, most robot control tasks already have fine tuned controllers that are almost deterministic.

To build a cg, an empty graph is initialized using samples from the expert demonstrations or alternatively random uniform samples from the state space. Additional vertex samples are then added iteratively by sampling around existing nodes; heuristically trading off exploration and exploitation. This trade off is guided by examining the variance of value of vertices around a local region, and whether or not a vertex is part of the iteration’s best policy. In practice, this leads to very few states are shown in [5], where a \(10\,\text {m}^2\) 2D area can be effectively represented using under 150 vertices. A grid discretization on the same area with 10cm resolution would already generate \(10^4\) states.

2.2 BIRL Using Sampled Trajectories

Building upon Ng and Russell [4] we develop an iterative birl algorithm that use trajectories randomly sampled from cgs to recover reward functions in very large (possibly infinite) spaces.We define a new likelihood function for birl that uses these sampled trajectories as shown in (1).

$$\begin{aligned} \Pr (\mathrm {\Xi } \mid R) = \prod _{\xi ^e \in \mathrm {\Xi }} \left( \frac{\exp \left( \beta \zeta (\xi ^e, R) \right) }{ \exp \left( \beta \zeta (\xi ^e, R) \right) + \sum ^k_{i=1} \exp \left( \beta \zeta (\xi _i^g, R) \right) } \right) \end{aligned}$$
(1)

where \(\mathrm {\Xi }\) is the set of expert demonstrations, each being trajectory of state-action pairs. \(\zeta (\xi , R) = \sum _{(s, a) \in \xi } Q^{\pi }(s, a)\), with policy \(\pi \) obtained using reward R. \(\xi _i^g\) is a trajectory sampled using a candidate policy at iteration i, while k is the current iteration. \(\beta \) is our confidence on the expert taking optimal actions when performing the demonstrations. Therefore, as the reward function improves, we are able to generate sample trajectories of increasing similarity to the expert. This new likelihood function is related to the original one of [7] when each trajectory is interpreted as a single action. The prior remains unchanged as given in [7]. The posterior is given by Bayes rule as \(\Pr (R \mid \mathrm {\Xi }) = 1/\eta \Pr (\mathrm {\Xi } \mid R) \Pr (R)\), with \(\eta = \int \Pr (\mathrm {\Xi } \mid R) \Pr (R)\, dR\). To infer the reward posterior distribution, the same pw algorithm of [7] can be employed, or alternatively, map estimates also yield good results as we found out experimentally. Once the reward function is found, it can be used to generate costmaps for motion planning or directly embedded in planning algorithm objective functions. In our case, we additionally assume inline with [1, 8] that the reward function is a linear combination of features of the state and action spaces; then infer the feature weights.

3 Experiments and Results

We conducted extensive experiments in simulation and on a real robot to demonstrate that the setup can indeed learn many complex navigation behaviors with practical constraints. We were able to learn five navigation behaviors useful for robot navigation in a busy airport scenario. These are: polite, sociable and rude navigation behaviors; and additionally, merging with flows and slipstream navigation. The behaviors were evaluated using objective and subjective metrics to assess potential trade offs in normativeness vs functionality. As shown in [5], we found that it is possible to have normative behavior with sacrificing functionality.

4 Conclusions

We have presented an approach that takes irl algorithms developed in machine learning literature and develops compatible but practical extensions for application in real world robotics. This endeavor highlights the key challenges that need to be addressed to achieve more generalizable approaches. For the future, we are working on formal performance bounds for the new algorithm.