Practical Bayesian Inverse Reinforcement Learning for Robot Navigation

Okal, Billy; Arras, Kai O.

doi:10.1007/978-3-319-46131-1_33

Billy Okal²⁰ &
Kai O. Arras^20,21

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9853))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

3260 Accesses

Abstract

Inverse reinforcement learning (irl) provides a concise framework for learning behaviors from human demonstrations; and is highly desired in practical and difficult to specify tasks such as normative robot navigation. However, most existing irl algorithms are often ladened with practical challenges such as representation mismatch and poor scalability when deployed in real world tasks. Moreover, standard reinforcement learning (rl) representations often do not allow for incorporation of task constraints common for example in robot navigation. In this paper, we present an approach that tackles these challenges in a unified manner and delivers a learning setup that is both practical and scalable. We develop a graph-based spare representation for rl and a scalable irl algorithm based on sampled trajectories. Experimental evaluation in simulation and from a real deployment in a busy airport demonstrate the strengths of the learning setup over existing approaches.

You have full access to this open access chapter, Download conference paper PDF

Learning High-Level Navigation Strategies via Inverse Reinforcement Learning: A Comparative Analysis

Robot Navigation Based on Reinforcement Learning: An Overview

Comparative Analysis of Classic and Reinforcement Learning Approaches for Robot Navigation in Dynamic Environments

Keywords

1 Introduction

The ability to learn behavior models from demonstrations is a powerful and much soxught after technique in many applications such as robot navigation, autonomous driving, robot manipulation among others. Concretely, a robot’s decision making in such tasks is modeled using a Markov decision process (mdp). The reward function of the mdp is assumed to be the “representation of behavior”. However, it is often difficult to manually design reward functions that encode desired behaviors; hence irl formally introduced in [4] is commonly used to recover the reward function using human demonstrations of the task, as done in these examples [1, 5, 8]. This is because it is often much easier to demonstrate a task than rigorously specify all factors leading to desired behavior. Bayesian inverse reinforcement learning (birl) introduced in [7] is in particular suited for such task where a single reward function may not be sufficient, and expert demonstrations are often sub-optimal.

However, when using birl in practical tasks, we are faced with many challenges such as; very large, continuous and constrained state and action spaces, which make standard birl inference algorithms impractical. Constraints common in these tasks include the fact that often not all actions are executable on a robot. For example conventional cars cannot drive sideways, and not all robots can turn on the spot. Thus, naïve representation of such mdp using basic function approximation techniques such as grid discretization easily blow up in space and computational demands. Additionally, such discretization may also discard possible good policies by limiting possible actions available at a state, while still not accounting for the task constraints. We therefore develop a new graph based representation that significantly reduces the size of the state space and encodes task specific constraints directly into the action set of the mdp. Furthermore, standard birl inference algorithms such as policy walk (pw) of [7] based on Markov chain Monte Carlo (mcmc) or maximum a posteriori(map) approaches, often require iterating over all possible states and actions. This quickly becomes impractical when these spaces get very large as in our case. We thus develop a novel extension of the birl algorithm by defining a new likelihood function which does not require iterative over all states and actions, but instead uses samples of trajectories over possibly infinite state and action spaces.

2 Method

Our behavior learning setup consists of two stages; firstly, a flexible data-driven mdp representation called Controller graph (cg) detailed in Sect. 2.1, and secondly, reward learning step using sampled trajectory based birl.

2.1 Flexible MDP Representation

We use cgs for efficiently representing very large, possibly continuous mdps with action set already constrained to the target domain, by building upon [3, 6]. A cg conceptually illustrated in Fig. 1, is a weighted labeled graph $\mathcal {G}= \langle \mathcal {V}, \mathcal {E}, \mathbf {W} \rangle $ with a vertex set $\mathcal {V} = \{ v_i \}$, an edge set $\mathcal {E} = \{{(v_i, v_j)}_{a}\}$ and a transition matrix $\mathbf {W}$, such that $\mathcal {V} \subset \mathcal {S}$ and $\mathcal {E} = \{{(v_i, v_j)}_a \mid w_{i,j} > 0, \, \forall v_i, v_j \in \mathcal {V}, a \in \mathcal {A} \}$, where $\mathcal {S}$ and ${\mathcal {A}} $ are state and action spaces respectively of the underlying mdp.

Therefore, vertices are state samples summarised by a vector $\mathbf {x}_i$ and edges are short trajectories or “macro actions” $\mathbf {x}_{i:j}$ between vertices i and j, which we call local controllers. These local controllers can be any deterministic controller such as motion primitives [2]; hence directly encode task constraints.The transition weights $w_{i,j}$ can be estimated by simulating such local controllers a number of times and setting the weight as the log ratio of success in reaching the target vertex. The local controllers can also be interpreted as Markov options, in that once selected, a local controller completely defines the policy up to the next vertex. In practice, most robot control tasks already have fine tuned controllers that are almost deterministic.

To build a cg, an empty graph is initialized using samples from the expert demonstrations or alternatively random uniform samples from the state space. Additional vertex samples are then added iteratively by sampling around existing nodes; heuristically trading off exploration and exploitation. This trade off is guided by examining the variance of value of vertices around a local region, and whether or not a vertex is part of the iteration’s best policy. In practice, this leads to very few states are shown in [5], where a $10\,\text {m}^2$ 2D area can be effectively represented using under 150 vertices. A grid discretization on the same area with 10cm resolution would already generate $10^4$ states.

2.2 BIRL Using Sampled Trajectories

Building upon Ng and Russell [4] we develop an iterative birl algorithm that use trajectories randomly sampled from cgs to recover reward functions in very large (possibly infinite) spaces.We define a new likelihood function for birl that uses these sampled trajectories as shown in (1).

$$\begin{aligned} \Pr (\mathrm {\Xi } \mid R) = \prod _{\xi ^e \in \mathrm {\Xi }} \left( \frac{\exp \left( \beta \zeta (\xi ^e, R) \right) }{ \exp \left( \beta \zeta (\xi ^e, R) \right) + \sum ^k_{i=1} \exp \left( \beta \zeta (\xi _i^g, R) \right) } \right) \end{aligned}$$

(1)

where $\mathrm {\Xi }$ is the set of expert demonstrations, each being trajectory of state-action pairs. $\zeta (\xi , R) = \sum _{(s, a) \in \xi } Q^{\pi }(s, a)$, with policy $\pi $ obtained using reward R. $\xi _i^g$ is a trajectory sampled using a candidate policy at iteration i, while k is the current iteration. $\beta $ is our confidence on the expert taking optimal actions when performing the demonstrations. Therefore, as the reward function improves, we are able to generate sample trajectories of increasing similarity to the expert. This new likelihood function is related to the original one of [7] when each trajectory is interpreted as a single action. The prior remains unchanged as given in [7]. The posterior is given by Bayes rule as $\Pr (R \mid \mathrm {\Xi }) = 1/\eta \Pr (\mathrm {\Xi } \mid R) \Pr (R)$, with $\eta = \int \Pr (\mathrm {\Xi } \mid R) \Pr (R)\, dR$. To infer the reward posterior distribution, the same pw algorithm of [7] can be employed, or alternatively, map estimates also yield good results as we found out experimentally. Once the reward function is found, it can be used to generate costmaps for motion planning or directly embedded in planning algorithm objective functions. In our case, we additionally assume inline with [1, 8] that the reward function is a linear combination of features of the state and action spaces; then infer the feature weights.

3 Experiments and Results

We conducted extensive experiments in simulation and on a real robot to demonstrate that the setup can indeed learn many complex navigation behaviors with practical constraints. We were able to learn five navigation behaviors useful for robot navigation in a busy airport scenario. These are: polite, sociable and rude navigation behaviors; and additionally, merging with flows and slipstream navigation. The behaviors were evaluated using objective and subjective metrics to assess potential trade offs in normativeness vs functionality. As shown in [5], we found that it is possible to have normative behavior with sacrificing functionality.

4 Conclusions

We have presented an approach that takes irl algorithms developed in machine learning literature and develops compatible but practical extensions for application in real world robotics. This endeavor highlights the key challenges that need to be addressed to achieve more generalizable approaches. For the future, we are working on formal performance bounds for the new algorithm.

References

Henry, P., Vollmer, C., Ferris, B., Fox, D.: Learning to navigate through crowded environments. In: International Conference on Robotics and Automation (ICRA), Anchorage, Alaska (2010)
Google Scholar
LaValle, S.M.: Planning Algorithms. Cambridge University Press, New York (2006)
Google Scholar
Neumann, G., Pfeiffer, M., Maass, W.: Efficient continuous-time reinforcementlearning with adaptive state graphs. In: European Conference on Machine Learning (ECML), Warsaw, Poland (2007)
Google Scholar
Ng, A.Y., Russell, S.J.: Algorithms for inverse reinforcement learning. In: International Conference on Machine Learning (ICML), Haifa, Israel (2000)
Google Scholar
Okal, B., Arras, K.O.: Learning socially normative robot navigation behaviors using Bayesian inverse reinforcement learning. In: International Conference on Robotics and Automation (ICRA), Stockholm, Sweden (2016)
Google Scholar
Okal, B., Gilbert, H., Arras, K.O.: Efficient inverse reinforcement learning using adaptive state-graphs. In: Learning from Demonstration: Inverse Optimal Control, Reinforcement Learning and Lifelong Learning Workshop at Robotics: Science and Systems (RSS), Rome, Italy (2015)
Google Scholar
Ramachandran, D., Amir, E.: Bayesian inverse reinforcement learning. In: International Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, India (2007)
Google Scholar
Vasquez, D., Okal, B., Arras, K.O.: Inverse reinforcement learning algorithms and features for robot navigation in crowds: an experimental comparison. In: Int. Conf. on Intelligent Robots and Systems (IROS), Chicago, USA (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

Social Robotics Lab, University of Freiburg, Freiburg, Germany
Billy Okal & Kai O. Arras
Bosch Corporate Research, Robert-Bosch GmbH, Renningen, Germany
Kai O. Arras

Authors

Billy Okal
View author publications
You can also search for this author in PubMed Google Scholar
Kai O. Arras
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Billy Okal .

Editor information

Editors and Affiliations

Department of Computer Science, KU Leuven, Leuven, Belgium
Bettina Berendt
Deloitte GmbH, München, Germany
Björn Bringmann
Laboratoire Hubert Curien, Jean Monnet University, Saint-Etienne, France
Élisa Fromont
Allianz SE, Munich, Germany
Gemma Garriga
Max-Planck-Institute for Informatics, Saarbrücken, Germany
Pauli Miettinen
Aalto University School of Science, Espoo, Finland
Nikolaj Tatti
Siemens AG & Lud. Max. Univ. of Munich, Munich, Germany
Volker Tresp

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Okal, B., Arras, K.O. (2016). Practical Bayesian Inverse Reinforcement Learning for Robot Navigation. In: Berendt, B., et al. Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2016. Lecture Notes in Computer Science(), vol 9853. Springer, Cham. https://doi.org/10.1007/978-3-319-46131-1_33

Download citation

DOI: https://doi.org/10.1007/978-3-319-46131-1_33
Published: 03 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46130-4
Online ISBN: 978-3-319-46131-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics