research-article

Predictive representations for policy gradient in POMDPs

Authors:

Abdeslam Boularias,

Brahim Chaib-draaAuthors Info & Claims

ICML '09: Proceedings of the 26th Annual International Conference on Machine Learning

Pages 65 - 72

https://doi.org/10.1145/1553374.1553383

Published: 14 June 2009 Publication History

Get Access

Abstract

We consider the problem of estimating the policy gradient in Partially Observable Markov Decision Processes (POMDPs) with a special class of policies that are based on Predictive State Representations (PSRs). We compare PSR policies to Finite-State Controllers (FSCs), which are considered as a standard model for policy gradient methods in POMDPs. We present a general Actor-Critic algorithm for learning both FSCs and PSR policies. The critic part computes a value function that has as variables the parameters of the policy. These latter parameters are gradually updated to maximize the value function. We show that the value function is polynomial for both FSCs and PSR policies, with a potentially smaller degree in the case of PSR policies. Therefore, the value function of a PSR policy can have less local optima than the equivalent FSC, and consequently, the gradient algorithm is more likely to converge to a global optimal solution.

References

[1]

Aberdeen, D., & Baxter, J. (2002). Scaling Internal-State Policy-Gradient Methods for POMDPs. Proc. 19th Int. Conf. Machine Learning (pp. 3--10).

Digital Library

Google Scholar

[2]

Aberdeen, D., Buffet, O., & Thomas, O. (2007). Policy-Gradients for PSRs and POMDPs. Proc. 11th Int. Conf. Artificial Intelligence and Statistics.

Google Scholar

[3]

Baxter, J., & Bartlett, P. (2000). Reinforcement Learning in POMDP's via Direct Gradient Ascent. Proc. 17th Int. Conf. Machine Learning (pp. 41--48).

Digital Library

Google Scholar

[4]

Casella, G., & Robert, C. P. (1996). Raoblackwellisation of Sampling Schemes. Biometrika, 15, 229--235.

Google Scholar

[5]

Littman, M., Sutton, R., & Singh, S. (2002). Predictive Representations of State. Advances in Neural Information Processing Systems 14 (pp. 1555--1561).

Google Scholar

[6]

Makino, T., & Takagi, T. (2008). On-line Discovery of Temporal-Difference Networks. Proc. 25th Int. Conf. Machine Learning (pp. 632--639).

Digital Library

Google Scholar

[7]

Meuleau, N., Peshkin, L., Kim, K., & Kaelbling, L. (1999). Learning Finite-State Controllers for Partially Observable Environments. Uncertainty in Artificial Intelligence: Proc. 15th Conf. (pp. 427--436).

Digital Library

Google Scholar

[8]

Peshkin, L. (2001). Reinforcement Learning by Policy Search. Doctoral dissertation, Massachusetts Institute of Technology.

Digital Library

Google Scholar

[9]

Peters, J., & Schaal, S. (2006). Policy Gradient Methods for Robotics. Proc. IEEE Int. Conf. Intelligent Robotics Systems (pp. 2219--2225).

Crossref

Google Scholar

[10]

Shelton, C. R. (2001). Importance Sampling for Reinforcement Learning with Multiple Objectives. Doctoral dissertation, Massachusetts Institute of Technology.

Digital Library

Google Scholar

[11]

Sutton, R. S., Mcallester, D., Singh, S., & Mansour, Y. (2000). Policy Gradient Methods for Reinforcement Learning with Function Approximation. Advances in Neural Information Processing Systems 12 (pp. 1057--1063).

Google Scholar

[12]

Wiewiora, E. (2005). Learning Predictive Representations from a History. Proc. 22nd Int. Conf. Machine Learning (pp. 964--971).

Digital Library

Google Scholar

Cited By

View all

Dang VVien NChung T(2021)Constrained representation learning for recurrent policy optimisation under uncertaintyAdaptive Behavior - Animals, Animats, Software Agents, Robots, Adaptive Systems10.1177/105971231989164129:3(253-265)Online publication date: 1-Jun-2021
https://dl.acm.org/doi/10.1177/1059712319891641
Liang JBoularias A(2020)Learning Transition Models with Time-delayed Causal Relations2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)10.1109/IROS45743.2020.9340809(8087-8093)Online publication date: 24-Oct-2020
https://doi.org/10.1109/IROS45743.2020.9340809
Wingate D(2012)Predictively Defined Representations of StateReinforcement Learning10.1007/978-3-642-27645-3_13(415-439)Online publication date: 2012
https://doi.org/10.1007/978-3-642-27645-3_13

Recommendations

Policy iteration for bounded-parameter POMDPs

POMDP is considered as a basic model for decision making under uncertainty. As a generalization of the exact POMDP model, the bounded-parameter POMDP (BPOMDP) provides only upper and lower bounds on the state-transition probabilities, observation ...
Myopic Bounds for Optimal Policy of POMDPs: An Extension of Lovejoy's Structural Results

This paper provides a relaxation of the sufficient conditions and an extension of the structural results for partially observed Markov decision processes POMDPs obtained by Lovejoy in 1987. Sufficient conditions are provided so that the optimal policy ...
Myopic Bounds for Optimal Policy of POMDPs: An Extension of Lovejoy's Structural Results

This paper provides a relaxation of the sufficient conditions and an extension of the structural results for partially observed Markov decision processes POMDPs obtained by Lovejoy in 1987. Sufficient conditions are provided so that the optimal policy ...

Comments

Information & Contributors

Information

Published In

ICML '09: Proceedings of the 26th Annual International Conference on Machine Learning

June 2009

1331 pages

ISBN:9781605585161

DOI:10.1145/1553374

General Chair:
Andrea Danyluk
Williams College
,
Program Chairs:
Léon Bottou
NEC Laboratories America
,
Michael Littman
Rutgers University

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

ICML '09

Sponsor:

Microsoft Research

ICML '09: The 26th Annual International Conference on Machine Learning held in conjunction with the 2007 International Conference on Inductive Logic Programming

June 14 - 18, 2009

Quebec, Montreal, Canada

Acceptance Rates

Overall Acceptance Rate 140 of 548 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
207
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Dang VVien NChung T(2021)Constrained representation learning for recurrent policy optimisation under uncertaintyAdaptive Behavior - Animals, Animats, Software Agents, Robots, Adaptive Systems10.1177/105971231989164129:3(253-265)Online publication date: 1-Jun-2021
https://dl.acm.org/doi/10.1177/1059712319891641
Liang JBoularias A(2020)Learning Transition Models with Time-delayed Causal Relations2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)10.1109/IROS45743.2020.9340809(8087-8093)Online publication date: 24-Oct-2020
https://doi.org/10.1109/IROS45743.2020.9340809
Wingate D(2012)Predictively Defined Representations of StateReinforcement Learning10.1007/978-3-642-27645-3_13(415-439)Online publication date: 2012
https://doi.org/10.1007/978-3-642-27645-3_13

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Cited By

Index Terms

Recommendations

Policy iteration for bounded-parameter POMDPs

Myopic Bounds for Optimal Policy of POMDPs: An Extension of Lovejoy's Structural Results

Myopic Bounds for Optimal Policy of POMDPs: An Extension of Lovejoy's Structural Results

Comments

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Other Metrics

Article Metrics

Other Metrics

Cited By

Login options

Full Access

PDF

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Policy iteration for bounded-parameter POMDPs

Myopic Bounds for Optimal Policy of POMDPs: An Extension of Lovejoy's Structural Results

Myopic Bounds for Optimal Policy of POMDPs: An Extension of Lovejoy's Structural Results

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations