research-article

Near-Bayesian exploration in polynomial time

Authors:

J. Zico Kolter,

Andrew Y. NgAuthors Info & Claims

ICML '09: Proceedings of the 26th Annual International Conference on Machine Learning

Pages 513 - 520

https://doi.org/10.1145/1553374.1553441

Published: 14 June 2009 Publication History

Abstract

We consider the exploration/exploitation problem in reinforcement learning (RL). The Bayesian approach to model-based RL offers an elegant solution to this problem, by considering a distribution over possible models and acting to maximize expected reward; unfortunately, the Bayesian solution is intractable for all but very restricted cases. In this paper we present a simple algorithm, and prove that with high probability it is able to perform ε-close to the true (intractable) optimal Bayesian policy after some small (polynomial in quantities describing the system) number of time steps. The algorithm and analysis are motivated by the so-called PAC-MDP approach, and extend such results into the setting of Bayesian RL. In this setting, we show that we can achieve lower sample complexity bounds than existing algorithms, while using an exploration strategy that is much greedier than the (extremely cautious) exploration of PAC-MDP algorithms.

References

[1]

Asmuth, J., Li, L., Littman, M. L., Nouri, A., & Wingate, D. (2009). A Bayesian sampling approach to exploration in reinforcement learning. (Preprint).

[2]

Auer, P., & Ortner, R. (2007). Logarithmic online regret bounds for undiscounted reinforcement learning. Neural Information Processing Systems (pp. 49--56).

[3]

Brafman, R. I., & Tennenholtz, M. (2002). R-MAX -- a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3, 213--231.

Digital Library

[4]

Brunskill, E., Leffler, B. R., Li, L., Littman, M. L., & Roy, N. (2008). CORL: A continuous-state offset-dynamics reinforcement learner. Proceedings of the International Conference on Uncertainty in Artificial Intelligence (pp. 53--61).

[5]

Dearden, R., Friedman, N., & Andre, D. (1999). Model based Bayesian exploration. Proceedings of the International Conference on Uncertainty in Artificial Intelligence (pp. 150--159).

Digital Library

[6]

Fel'dbaum, A. A. (1961). Dual control theory, parts I-IV. Automation and Remote Control, 21 874--880, 21 1033--1039, 22 1--12, 22 109--121.

[7]

Filatov, N., & Unbehauen, H. (2004). Adaptive dual control: Theory and applications. Springer.

[8]

Gittins, J. C. (1989). Multiarmed bandit allocation indices. Wiley.

[9]

Kakade, S., Kearns, M., & Langford, J. (2003). Exploration in metric state spaces. Proceedings of the International Conference on Machine Learning (pp. 306--312).

[10]

Kakade, S. M. (2003). On the sample complexity of reinforcement learning. Doctoral dissertation, Gatsby Computational Neuroscience Unit, University College, London.

[11]

Kearns, M., & Koller, D. (1999). Efficient reinforcement learning in factored MDPs. Proceedings of the International Joint Conference on Artificial Intelligence (pp. 740--747).

Digital Library

[12]

Kearns, M., & Singh, S. (2002). Near-optimal reinforcement learning in polynomial time. Machine Learning, 49, 209--232.

Digital Library

[13]

Kolter, J. Z., & Ng, A. Y. (2009). Near-Bayesian exploration in polynomial time (full version). Available at http://ai.stanford.edu/~kolter.

Digital Library

[14]

Poupart, P., Vlassis, N., Hoey, J., & Regan, K. (2006). An analytic solution to discrete Bayesian reinforcement learning. Proceedings of the International Conference on Machine Learning (pp. 697--704).

Digital Library

[15]

Putterman, M. L. (2005). Markov decision processes: Discrete stochastic dynamic programming. Wiley.

Digital Library

[16]

Slud, E. V. (1977). Distribution inequalities for the binomial law. The Annals of Probability, 5, 404--412.

[17]

Strehl, A. L., Li, L., Wiewiora, E., Langford, J., & Littman, M. L. (2006). Pac model-free reinforcement learning. Proceedings of the International Conference on Machine Learning (pp. 881--888).

Digital Library

[18]

Strehl, A. L., & Littman, M. L. (2008a). An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences, 74, 1209--1331.

Digital Library

[19]

Strehl, A. L., & Littman, M. L. (2008b). Online linear regression and its application to model-based reinforcement learning. Neural Information Processing Systems (pp. 1417--1424).

[20]

Strens, M. J. (2000). A Bayesian framework for reinforcement learning. Proceedings of the International Conference on Machine Learning (pp. 943--950).

Digital Library

[21]

Wang, T., Lizotte, D., Bowling, M., & Schuurmans, D. (2005). Bayesian sparse sampling for on-line reward optimization. Proceedings of the International Conference on Machine Learning (pp. 956--963).

Digital Library

Cited By

Choi YZhao LZhang CSong LBian JKim KLarson K(2024)Diversification of adaptive policy for effective offline reinforcement learningProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/427(3863-3871)Online publication date: 3-Aug-2024
https://dl.acm.org/doi/10.24963/ijcai.2024/427
Wang BXie WMartagan TAkcay ARavenstein B(2024)Biomanufacturing Harvest Optimization With Small DataProduction and Operations Management10.1177/1059147824127013033:12(2381-2400)Online publication date: 30-Aug-2024
https://doi.org/10.1177/10591478241270130
Miani MParton MRomito M(2024)Curious Explorer: A Provable Exploration Strategy in Policy LearningIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.346097246:12(11422-11431)Online publication date: Dec-2024
https://doi.org/10.1109/TPAMI.2024.3460972
Show More Cited By

Index Terms

Near-Bayesian exploration in polynomial time
1. Computing methodologies
  1. Machine learning
  2. Modeling and simulation
    1. Model development and analysis
      1. Modeling methodologies
2. Mathematics of computing
  1. Mathematical analysis
    1. Numerical analysis
      1. Computations on polynomials
  2. Probability and statistics
    1. Probabilistic algorithms
    2. Probabilistic reasoning algorithms
      1. Markov-chain Monte Carlo methods
      2. Sequential Monte Carlo methods

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICML '09: Proceedings of the 26th Annual International Conference on Machine Learning

June 2009

1331 pages

ISBN:9781605585161

DOI:10.1145/1553374

General Chair:
Andrea Danyluk
Williams College
,
Program Chairs:
Léon Bottou
NEC Laboratories America
,
Michael Littman
Rutgers University

Copyright © 2009 Copyright 2009 by the author(s)/owner(s).

Sponsors

NSF
Microsoft Research: Microsoft Research
MITACS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

Defense Advanced Research Projects Agency

Conference

ICML '09

Sponsor:

Microsoft Research

ICML '09: The 26th Annual International Conference on Machine Learning held in conjunction with the 2007 International Conference on Inductive Logic Programming

June 14 - 18, 2009

Quebec, Montreal, Canada

Acceptance Rates

Overall Acceptance Rate 140 of 548 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

106
Total Citations
View Citations
709
Total Downloads

Downloads (Last 12 months)36
Downloads (Last 6 weeks)1

Reflects downloads up to 09 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Choi YZhao LZhang CSong LBian JKim KLarson K(2024)Diversification of adaptive policy for effective offline reinforcement learningProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/427(3863-3871)Online publication date: 3-Aug-2024
https://dl.acm.org/doi/10.24963/ijcai.2024/427
Wang BXie WMartagan TAkcay ARavenstein B(2024)Biomanufacturing Harvest Optimization With Small DataProduction and Operations Management10.1177/1059147824127013033:12(2381-2400)Online publication date: 30-Aug-2024
https://doi.org/10.1177/10591478241270130
Miani MParton MRomito M(2024)Curious Explorer: A Provable Exploration Strategy in Policy LearningIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.346097246:12(11422-11431)Online publication date: Dec-2024
https://doi.org/10.1109/TPAMI.2024.3460972
Lu CGeorgescu RVerwey J(2024)Go-Explore Complex 3-D Game Environments for Automated Reachability TestingIEEE Transactions on Games10.1109/TG.2022.322840116:1(235-240)Online publication date: Mar-2024
https://doi.org/10.1109/TG.2022.3228401
Valencia DWilliams HXing YGee TLiarokapis MMacDonald B(2024)Image-Based Deep Reinforcement Learning with Intrinsically Motivated Stimuli: On the Execution of Complex Robotic Tasks2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)10.1109/IROS58592.2024.10801857(587-594)Online publication date: 14-Oct-2024
https://doi.org/10.1109/IROS58592.2024.10801857
Diaz VGhate A(2024)Information-directed policy sampling for episodic Bayesian Markov decision processesIISE Transactions10.1080/24725854.2024.2392663(1-15)Online publication date: 23-Sep-2024
https://doi.org/10.1080/24725854.2024.2392663
Bal MIyigun CPolat FAydin H(2024)Population-based exploration in reinforcement learning through repulsive reward shaping using eligibility tracesAnnals of Operations Research10.1007/s10479-023-05798-1Online publication date: 18-Jan-2024
https://doi.org/10.1007/s10479-023-05798-1
Jiang YKolter JRaileanu ROh ANaumann TGloberson ASaenko KHardt MLevine S(2023)On the importance of exploration for generalization in reinforcement learningProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666691(12951-12986)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3666691
Pitis SOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Consistent aggregation of objectives with diverse time preferences requires non-Markovian rewardsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666250(2877-2893)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3666250
Rathnam SParbhoo SPan WMurphy SDoshi-Velez FKrause ABrunskill ECho KEngelhardt BSabato SScarlett J(2023)The unintended consequences of discount regularizationProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619603(28746-28767)Online publication date: 23-Jul-2023
https://dl.acm.org/doi/10.5555/3618408.3619603
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten