Potential-based reward shaping for finite horizon online POMDP planning

Eck, Adam; Soh, Leen-Kiat; Devlin, Sam; Kudenko, Daniel

doi:10.1007/s10458-015-9292-6

Potential-based reward shaping for finite horizon online POMDP planning

Published: 05 March 2015

Volume 30, pages 403–445, (2016)
Cite this article

Autonomous Agents and Multi-Agent Systems Aims and scope Submit manuscript

Adam Eck¹,
Leen-Kiat Soh¹,
Sam Devlin² &
…
Daniel Kudenko²

1401 Accesses
12 Citations
Explore all metrics

Abstract

In this paper, we address the problem of suboptimal behavior during online partially observable Markov decision process (POMDP) planning caused by time constraints on planning. Taking inspiration from the related field of reinforcement learning (RL), our solution is to shape the agent’s reward function in order to lead the agent to large future rewards without having to spend as much time explicitly estimating cumulative future rewards, enabling the agent to save time to improve the breadth planning and build higher quality plans. Specifically, we extend potential-based reward shaping (PBRS) from RL to online POMDP planning. In our extension, information about belief states is added to the function optimized by the agent during planning. This information provides hints of where the agent might find high future rewards beyond its planning horizon, and thus achieve greater cumulative rewards. We develop novel potential functions measuring information useful to agent metareasoning in POMDPs (reflecting on agent knowledge and/or histories of experience with the environment), theoretically prove several important properties and benefits of using PBRS for online POMDP planning, and empirically demonstrate these results in a range of classic benchmark POMDP planning problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Monte Carlo Tree Search: a review of recent modifications and applications

Article Open access 19 July 2022

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

A practical guide to multi-objective reinforcement learning and planning

Article Open access 13 April 2022

Notes

Sorg et al. [22] also propose applying their optimal reward framework to MDPs, which is slightly different from PBRS in that it allows path-dependent reward modifications (as opposed to shaping only values at leaf and initial situations in PBRS, c.f., Sect. 3.2). However, they note that in full breadth planning (as considered in this paper), optimal rewards are equivalent to leaf heuristics, and thus also to PBRS. Therefore, for the remainder of the paper, we only refer to leaf evaluation heuristics, but the same discussions apply to optimal rewards, as well.
We consider the negative of the entropy since entropy measures uncertainty, which is the reciprocal of certainty.
This example is based on the RockSample benchmark problem described in more detail in Sect. 4.1.2 and used in our experimental study evaluating the empirical performance of PBRS for online POMDP planning.
On the other hand, if we used potential function values to determine how to expand plans, then they would simply represent heuristic functions and the result would be a standard heuristic search algorithm. Since our potential functions are used instead for the evaluation of action values, potential functions are orthogonal to heuristic functions.
To increase the complexity of the RockSample benchmark and make it more suitable for our experimental study by making it a little more uncertain like the other benchmark problems considered in this research, we increased the uncertainty in the observations returned when checking rocks by decreasing the half-efficiency distance of sensing from 20 to 1. This is similar to changes made in other experimental studies, including the similar FieldVisionRockSample considered in [18, 25].
We use a different range of allotted times \(\tau \) for different problems due to the different sizes of the POMDPs, resulting in different exponential growth of the planning trees calculated by the agents.
A mixed observability MDP (MOMDP) is a special POMDP representation that factors the state space into fully observable variables \(\mathcal{X}\) and partially observable variables \(\mathcal{Y}\), such that \(S=\mathcal{X}\times \mathcal{Y}\), and exploits this factorization to simplify the transition and observation probability calculations to speed up computation. The resulting model is equivalent (but faster) to the canonical, unfactored POMDP representation for the same problem [15].
Without time constraints, explicit calculations would always be superior because the agent could simply continue planning deeper throughout the entire planning tree. But with time constraints, the agent must of course sacrifice some breadth for depth, causing under- or over-estimations of agent rewards for some belief states, as discussed in Sect. 2.2.
Available online at http://bigbird.comp.nus.edu.sg/pmwiki/farm/appl/index.php?n=Main.DownloadDespot .

References

Araya-Lopez, M., Buffet, O., Thomas, V., & Charpillet, F. (2010). A POMDP extension with belief-dependent rewards. In Proceedings of the 24th Annual Conference on Neural Information Processing Systems (NIPS’10) (pp. 64–72). Vancouver, B.C., Canada, December 6–9, 2010.
Asmuth, J., Littman, M. L., & Zinkov, R. (2008). Potential-based shaping in model-based reinforcement learning. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence (AAAI’08) (pp. 604–609). Chicago, IL, July 13–17, 2008.
Bertsekas, D. P., & Castanon, D. A. (1999). Rollout algorithms for stochastic scheduling problems. Journal of Heuristics, 5, 89–108.
Article MATH Google Scholar
Boutilier, C. (2002). A POMDP formulation of preference elicitation problems. In Proceedings of the 18th National Conference on Artificial Intelligence (AAAI’02) (pp. 239–246). Edmonton, Alberta, Canada, July 28–August 1, 2002.
Boyd, S. P., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press.
Book MATH Google Scholar
Devlin, S., & Kudenko, D. (2011). Theoretical considerations of potential-based reward shaping for multi-agent systems. In K. Tumer, P. Yolum, L. Sonenberg, & P. Stone (Eds.), Proceedings of the 10th International Conference on Autonomous Agents and Multiagent Sytems (AAMAS’11) (pp. 225–232). Taipei, Taiwan, May 2–6, 2011.
Devlin, S., & Kudenko, D. (2012). Dynamic potential-based reward shaping. In V. Conitzer, M. Winikoff, L. Padgham, & W. van der Hoek (Eds.), Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems (AAMAS’12). Valencia, Spain, June 6–8, 2012.
Doshi, F., & Roy, N. (2008). The permutable POMDP: Fast solutions to POMDPs for preference elicitation. In L. Padgham, D. C. Parkes, J. Muller & S. Parsons (Eds.), Proceedings of the 7th International Conference on Autonomous Agents and Multiagent Systems (AAMAS’08) (pp. 493–500). Estoril, Portugal, May 12–16, 2008.
Eck, A., Soh, L.-K., Devlin, S., & Kudenko, D. (2013). Potential-based reward shaping for POMDPs (Extended Abstract). In T. Ito, C. Jonker, M. Gini, & O. Shehory (Eds.), Proceedings of the 12th International Conference on Autonomous Agents and Multiagent Systems (AAMAS’13). Saint Paul, Minnesota, May 8–10, 2013.
Hauskrecht, M. (2000). Value-function approximations for partially observable Markov decision proceses. Journal of Artificial Intelligence Research, 13, 33–94.
MathSciNet MATH Google Scholar
Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101, 99–134.
Article MathSciNet MATH Google Scholar
Kurniawati, H., Hsu, D., & Lee, W. S. (2008). SARSOP: Efficient point-based POMDP planning by approximating optimally reachable belief spaces. In Proceedings of the 2008 Robotics: Science and Systems Conference (RSS ’08).
Mihaylova, L. et al. (2002). Active sensing for robotics—A survey. In Proceedings of the 5th International Conference on Numerical Methods and Applications (NM&A’02). Borovets, Bulgaria, August 20–24, 2002.
Ng, A. Y., Harada, D., & Russell, S. (1999). Policy invariance under reward transformations: Theory and application to reward shaping, In Proceedings of the 16th International Conference on Machine Learning (ICML’99) (pp. 278–287). Bled, Slovenia, June 27–30, 1999.
Ong, S. C. W., Png, S. W., Hsu, D., & Lee, W. S. (2010). Planning under uncertainty for robotic tasks with mixed observability. International Journal of Robotics Research, 29(8), 1053–1068.
Article Google Scholar
Pineau, J., Gordon, G., & Thrun, S. (2003). Point-based value iteration: An anytime algorithm for POMDPs. In Proceedings of the 18th International Joint Conference on Artificial Intelligence (IJCAI’03) (pp. 1025–1032). Acapulco, Mexico, August 9–15, 2003.
Ross, S., & Chaib-draa, B. (2007). AEMS: An anytime online search algorithm for approximate policy refinement in large POMDPs. In Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI’07) (pp. 2592–2598). Hyderabad, India, January 6–12, 2007.
Ross, S., Pineau, J., Paquet, S., & Chaib-draa, B. (2008). Online planning algorithms for POMDPs. Journal of Artificial Intelligence Research, 32, 663–704.
MathSciNet MATH Google Scholar
Silver, D., & Veness, J. (2010). Monte-Carlo planning in large POMDPs. In Proceedings of the 24th Annual Conference on Neural Information Processing Systems (NIPS’10) (pp. 2164–2172). Vancouver, B.C., Canada, December 6–9, 2010.
Smith, T., & Simmons, R. (2004). Heuristic search value iteration for POMDPs. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence (UAI’04) (pp. 520–527). Banff, Alberta, Canada, July 7–11, 2004.
Somani, A., Ye, N., Hsu, D., & Sun Lee, W. (2013). DESPOT: Online POMDP planning with regularization. In Advances in Neural Information Processing Systems (NIPS) 2013.
Sorg, J., Singh, S., & Lewis, R. L. (2011). Optimal rewards versus leaf-evaluation heuristics in planning agents. In Proceedings of the 25th AAAI Conference on Artificial Intelligence (AAAI’11) (pp. 465–470). San Francisco, CA, August 7–11, 2011.
Spaan, M. T. J., Veiga, T. S., & Lima, P. U. (2010). Active cooperative perception in networked robotic systems using POMDPs. In Proceedings of the 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’10) (pp. 4800–4805). Taipei, Taiwan, October 18–22.
Williams, J. D., & Young, S. (2007). Partially observable Markov decision processes for spoken dialog systems. Computer Speech and Language, 21, 393–422.
Article Google Scholar
Zhang, Z. & Chen, X. (2012). FHHOP: A factored heuristic online planning algorithm for POMDPs. In Proceedings of the 28th Conference on Uncertainty in Artificial Intelligence (UAI’12) (pp. 934–943). Catalina Island, USA, August 15–17, 2012.

Download references

Acknowledgments

This research was partially supported by a National Science Foundation Graduate Research Fellowship (DGE-054850) and a Grant from the National Science Foundation (SES-1132015).

Author information

Authors and Affiliations

Department of Computer Science & Engineering, University of Nebraska, 256 Avery Hall, Lincoln, NE, 68588-0115, USA
Adam Eck & Leen-Kiat Soh
Department of Computer Science, University of York, Deramore Lane, York, YO10 5GH, UK
Sam Devlin & Daniel Kudenko

Authors

Adam Eck
View author publications
You can also search for this author in PubMed Google Scholar
Leen-Kiat Soh
View author publications
You can also search for this author in PubMed Google Scholar
Sam Devlin
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Kudenko
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Adam Eck.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Eck, A., Soh, LK., Devlin, S. et al. Potential-based reward shaping for finite horizon online POMDP planning. Auton Agent Multi-Agent Syst 30, 403–445 (2016). https://doi.org/10.1007/s10458-015-9292-6

Download citation

Published: 05 March 2015
Issue Date: May 2016
DOI: https://doi.org/10.1007/s10458-015-9292-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Potential-based reward shaping for finite horizon online POMDP planning

Abstract

Access this article

Similar content being viewed by others

Monte Carlo Tree Search: a review of recent modifications and applications

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

A practical guide to multi-objective reinforcement learning and planning

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Potential-based reward shaping for finite horizon online POMDP planning

Abstract

Access this article

Similar content being viewed by others

Monte Carlo Tree Search: a review of recent modifications and applications

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

A practical guide to multi-objective reinforcement learning and planning

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation