Parametric POMDPs for planning in continuous state spaces
Introduction
A Markov decision process (MDP) models the repeated interaction of an agent with a stochastic environment [1]. MDP-based approaches to planning are well-studied and effective in domains where perfect knowledge of the state of the world is available. Unfortunately they are less effective in problems where the state is uncertain, a condition which prevails in many real-world problems.
When the state is unknown but some (uncertain) information about the state is available through observations, the world can be described by a partially observable Markov decision process (POMDP) [2]. A POMDP model defines a probabilistic representation of an agent’s world. Specifically, given an initial state and action, it defines probability distributions over possible resultant states and observations. Given a reward function, an agent’s task is to select actions which maximise its expected sum of (possibly discounted) future rewards.
The POMDP task is challenging because the agent must consider both the history of all previous observations and actions, and the space of all possible future observations and actions. The task is simplified by the fact that, given knowledge of the POMDP model, the agent can maintain a probability distribution over states which summarises the entire history [3]. This distribution is usually referred to as the agent’s belief. Maintaining a consistent belief allows the problem to be converted from a POMDP over partially observable states to an MDP over fully observable beliefs. Traditional MDP solution methods can then be applied to the resultant belief-state MDP [3], [1].
A number of POMDP solution methods, including the one proposed in this paper, solve the resultant MDP using value iteration. Essentially, value iteration iteratively builds a value function which specifies the expected sum of discounted future rewards attainable from each belief-state. Given a value function, an agent can act by simply choosing the action which instantaneously maximises its value, which is equivalent to planning ahead.
The problem of robot navigation is often cast as a POMDP, on the grounds that localisation is inherently imperfect and MDP-based approaches do not account for this uncertainty. The POMDP solution explicitly models the robot’s position uncertainty, making decisions based on the probabilistic distribution over pose space. This naturally imparts the useful property that the robot will trade off actions that move the robot towards its goal with actions that reduce the robot’s uncertainty in a principled way.
The majority of value-based POMDP research for robot navigation has focussed on the discrete case, dividing configuration spaces into finite numbers of cells. Robot navigation, however, is a fundamentally continuous problem that is poorly represented in the discrete domain unless the discretisation is sufficiently fine. Discrete POMDP solution methods have problems with fine discretisations because the dimensionality of the belief space is equal to the number of states, and computational complexity increases rapidly with the dimensionality of the belief space.
This paper presents an approach to solving robot-navigation POMDP problems efficiently in continuous state spaces. We refer to this approach as a parametric POMDP solution method [4]. By constraining distributions over state space to a parametric family, points in the infinite-dimensional continuous belief space can be represented by finite vectors of sufficient statistics. Choosing a parametric family with a relatively small number of sufficient statistics results in a relatively low-dimensional belief space. For a given combination of MDP model and parametric form, it may be possible to find a (possibly approximate) belief transition function which preserves that parametric form. If so, belief updates can be performed efficiently, directly in the low-dimensional parameter-space. Since the value function is not likely to be piecewise-linear and convex (PWLC) in sufficient-statistic space, fitted value iteration [5] is used to solve the POMDP. For reasons described in Section 3, we focus on the use of Gaussian distributions as a parametric form.
The remainder of this paper is organised as follows. Section 2 discusses related approaches, and Section 3 formulates the dynamic programming equations on which the POMDP solution is founded and discusses the implications of a parametric representation. Section 4 describes a solution using this representation, Section 5 applies this solution to a robot navigation problem and Section 6 concludes and provides directions for future work.
Section snippets
Related work
This section provides a brief review of prior work on POMDP solution methods. For a more thorough review, readers are directed to [6] and the references therein.
Formulation of the POMDP
The objective in a POMDP problem is to calculate a policy which optimises a discounted set of future rewards in a stationary, partially observable environment with known dynamics. This section begins by establishing some terminology.
At each discrete time interval , a POMDP agent is in an unknown state . The agent chooses an action , receives a reward and arrives in state . It then receives an observation from the new state.
It is useful to define an information vector of
Solving parametric POMDPs
Having chosen a parametric form, the continuous distribution over state space can be written in terms of a vector of sufficient statistics . Maintaining a consistent belief in this parametric form requires the specification of an initial belief state, , plus an update function: The instantaneous reward function and the distribution over subsequent observations are also functions of the sufficient statistics, defined in terms of integrals over the state
Experiments with a continuous navigation problem
Since many of the benchmark POMDP problems from the literature assume a discrete state space, comparison against the state of the art is difficult. This section performs a comparison by modifying the continuous navigation problem to which the Perseus algorithm was applied in [14]. The performance of both algorithms is evaluated using a simulator which implements the continuous version of the world defined in Section 5.1. Sections 5.2 Discretised solution, 5.3 Parametric solution outline the
Conclusion
The contribution of this work is a novel approach to efficiently applying the POMDP formulation to problems with continuous states, typical in robotics. Choosing a parametric form provides a compact representation of beliefs in a low-dimensional belief-space. Using an estimator with an analytic solution to the belief update function allows efficient belief updates, without requiring either iterations over the entire state space or mappings to and from a different space. An approximate
Acknowledgements
This work is supported by the ARC Centre of Excellence programme, funded by the Australian Research Council (ARC) and the New South Wales State Government.
Alex Brooks BA, Adelaide University, 1997, BSc and BE (first class honours), Melbourne University, 2000. He is currently pursuing a Ph.D. in Field Robotics at the University of Sydney. Areas of interest include decision making under uncertainty for robot navigation, vision-based navigation and reusable component-based implementations.
References (38)
- et al.
Planning and acting in partially observable stochastic domains
Artificial Intelligence
(1998) - et al.
Feature-based multi-hypothesis localization and tracking using geometric constraints
Robotics and Autonomous Systems
(2003) - et al.
Reinforcement Learning: An Introduction
(1998) - (2000)
- A. Brooks, A. Makarenko, S. Williams, H. Durrant-Whyte, Planning in continuous state spaces with parametric POMDPs, in:...
- G. Gordon, Stable function approximation in dynamic programming, in: Proc. Intl. Conference on Machine Learning, 1995,...
Value-function approximations for partially observable Markov decision processes
Journal of Artificial Intelligence Research
(2000)- E.J. Sondik, The optimal control of partially observable Markov processes, Ph.D. Thesis, Stanford University,...
A survey of partially observable Markov decision processes: Theory, models and algorithms
Management Science
(1982)The optimal search for a moving target when the search path is constrained
Operations Research
(1984)
Perseus: Randomized point-based value iteration for POMDPs
Journal of Artificial Intelligence Research
Computationally feasible bounds for partially observed Markov decision processes
Operations Research
Cited by (0)
Alex Brooks BA, Adelaide University, 1997, BSc and BE (first class honours), Melbourne University, 2000. He is currently pursuing a Ph.D. in Field Robotics at the University of Sydney. Areas of interest include decision making under uncertainty for robot navigation, vision-based navigation and reusable component-based implementations.
Alexei Makarenko BS in Mechanical Engineering, Rensselaer Polytecnic Institute, Troy, NY, 1996. MSc in Aeronautics and Astronautics, Massachusetts Institute of Technology, Cambridge, MA, 1998. Ph.D. in Field Robotics, University of Sydney, 2005. He is currently pursuing post-doctoral research in decentralised robotic sensor networks at the University of Sydney. Areas of interest include system architectures, human-network interaction, and reusable component-based implementations.
Stefan Williams BASc in Systems Design Engineering, First Class Honours, University of Waterloo, 1997. Ph.D. in Field Robotics, University of Sydney, 2001. His current research focus deals with architectures for autonomous systems and navigation in unstructured environments. He is particularly interested in the area of distributed and decentralised data fusion and how systems can be designed to enable autonomy. He is currently working with multi-vehicle Simultaneous Localisation and Mapping in field environments and is also leading a substantial research effort in the area of navigation and modelling in unstructured, underwater environments.
Hugh Durrant-Whyte received the B.Sc. (Eng.) degree (1st class honors) in mechanical and nuclear engineering from the University of London, England, in 1983, and the M.S.E. and Ph.D. degrees, both in systems engineering, from the University of Pennsylvania, Philadelphia, in 1985 and 1986, respectively. From 1987 to 1995, he was a Senior Lecturer in Engineering Science with the University of Oxford, England, and a Fellow of Oriel College Oxford. Since July 1995 he has been Professor of Mechatronic Engineering with the Department of Mechanical and Mechatronic Engineering, The University of Sydney, Australia, where he leads the ARC Centre of Excellence in Autonomous Systems (CAS). His research work focuses on automation in cargo handling, surface and underground mining, defence, unmanned flight vehicles, and autonomous subsea vehicles.