skip to main content
10.1145/1553374.1553441acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicmlConference Proceedingsconference-collections
research-article

Near-Bayesian exploration in polynomial time

Published: 14 June 2009 Publication History

Abstract

We consider the exploration/exploitation problem in reinforcement learning (RL). The Bayesian approach to model-based RL offers an elegant solution to this problem, by considering a distribution over possible models and acting to maximize expected reward; unfortunately, the Bayesian solution is intractable for all but very restricted cases. In this paper we present a simple algorithm, and prove that with high probability it is able to perform ε-close to the true (intractable) optimal Bayesian policy after some small (polynomial in quantities describing the system) number of time steps. The algorithm and analysis are motivated by the so-called PAC-MDP approach, and extend such results into the setting of Bayesian RL. In this setting, we show that we can achieve lower sample complexity bounds than existing algorithms, while using an exploration strategy that is much greedier than the (extremely cautious) exploration of PAC-MDP algorithms.

References

[1]
Asmuth, J., Li, L., Littman, M. L., Nouri, A., & Wingate, D. (2009). A Bayesian sampling approach to exploration in reinforcement learning. (Preprint).
[2]
Auer, P., & Ortner, R. (2007). Logarithmic online regret bounds for undiscounted reinforcement learning. Neural Information Processing Systems (pp. 49--56).
[3]
Brafman, R. I., & Tennenholtz, M. (2002). R-MAX -- a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3, 213--231.
[4]
Brunskill, E., Leffler, B. R., Li, L., Littman, M. L., & Roy, N. (2008). CORL: A continuous-state offset-dynamics reinforcement learner. Proceedings of the International Conference on Uncertainty in Artificial Intelligence (pp. 53--61).
[5]
Dearden, R., Friedman, N., & Andre, D. (1999). Model based Bayesian exploration. Proceedings of the International Conference on Uncertainty in Artificial Intelligence (pp. 150--159).
[6]
Fel'dbaum, A. A. (1961). Dual control theory, parts I-IV. Automation and Remote Control, 21 874--880, 21 1033--1039, 22 1--12, 22 109--121.
[7]
Filatov, N., & Unbehauen, H. (2004). Adaptive dual control: Theory and applications. Springer.
[8]
Gittins, J. C. (1989). Multiarmed bandit allocation indices. Wiley.
[9]
Kakade, S., Kearns, M., & Langford, J. (2003). Exploration in metric state spaces. Proceedings of the International Conference on Machine Learning (pp. 306--312).
[10]
Kakade, S. M. (2003). On the sample complexity of reinforcement learning. Doctoral dissertation, Gatsby Computational Neuroscience Unit, University College, London.
[11]
Kearns, M., & Koller, D. (1999). Efficient reinforcement learning in factored MDPs. Proceedings of the International Joint Conference on Artificial Intelligence (pp. 740--747).
[12]
Kearns, M., & Singh, S. (2002). Near-optimal reinforcement learning in polynomial time. Machine Learning, 49, 209--232.
[13]
Kolter, J. Z., & Ng, A. Y. (2009). Near-Bayesian exploration in polynomial time (full version). Available at http://ai.stanford.edu/~kolter.
[14]
Poupart, P., Vlassis, N., Hoey, J., & Regan, K. (2006). An analytic solution to discrete Bayesian reinforcement learning. Proceedings of the International Conference on Machine Learning (pp. 697--704).
[15]
Putterman, M. L. (2005). Markov decision processes: Discrete stochastic dynamic programming. Wiley.
[16]
Slud, E. V. (1977). Distribution inequalities for the binomial law. The Annals of Probability, 5, 404--412.
[17]
Strehl, A. L., Li, L., Wiewiora, E., Langford, J., & Littman, M. L. (2006). Pac model-free reinforcement learning. Proceedings of the International Conference on Machine Learning (pp. 881--888).
[18]
Strehl, A. L., & Littman, M. L. (2008a). An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences, 74, 1209--1331.
[19]
Strehl, A. L., & Littman, M. L. (2008b). Online linear regression and its application to model-based reinforcement learning. Neural Information Processing Systems (pp. 1417--1424).
[20]
Strens, M. J. (2000). A Bayesian framework for reinforcement learning. Proceedings of the International Conference on Machine Learning (pp. 943--950).
[21]
Wang, T., Lizotte, D., Bowling, M., & Schuurmans, D. (2005). Bayesian sparse sampling for on-line reward optimization. Proceedings of the International Conference on Machine Learning (pp. 956--963).

Cited By

View all
  • (2024)Diversification of adaptive policy for effective offline reinforcement learningProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/427(3863-3871)Online publication date: 3-Aug-2024
  • (2024)Biomanufacturing Harvest Optimization With Small DataProduction and Operations Management10.1177/1059147824127013033:12(2381-2400)Online publication date: 30-Aug-2024
  • (2024)Curious Explorer: A Provable Exploration Strategy in Policy LearningIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.346097246:12(11422-11431)Online publication date: Dec-2024
  • Show More Cited By

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICML '09: Proceedings of the 26th Annual International Conference on Machine Learning
June 2009
1331 pages
ISBN:9781605585161
DOI:10.1145/1553374

Sponsors

  • NSF
  • Microsoft Research: Microsoft Research
  • MITACS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2009

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

Conference

ICML '09
Sponsor:
  • Microsoft Research

Acceptance Rates

Overall Acceptance Rate 140 of 548 submissions, 26%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)36
  • Downloads (Last 6 weeks)1
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Diversification of adaptive policy for effective offline reinforcement learningProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/427(3863-3871)Online publication date: 3-Aug-2024
  • (2024)Biomanufacturing Harvest Optimization With Small DataProduction and Operations Management10.1177/1059147824127013033:12(2381-2400)Online publication date: 30-Aug-2024
  • (2024)Curious Explorer: A Provable Exploration Strategy in Policy LearningIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.346097246:12(11422-11431)Online publication date: Dec-2024
  • (2024)Go-Explore Complex 3-D Game Environments for Automated Reachability TestingIEEE Transactions on Games10.1109/TG.2022.322840116:1(235-240)Online publication date: Mar-2024
  • (2024)Image-Based Deep Reinforcement Learning with Intrinsically Motivated Stimuli: On the Execution of Complex Robotic Tasks2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)10.1109/IROS58592.2024.10801857(587-594)Online publication date: 14-Oct-2024
  • (2024)Information-directed policy sampling for episodic Bayesian Markov decision processesIISE Transactions10.1080/24725854.2024.2392663(1-15)Online publication date: 23-Sep-2024
  • (2024)Population-based exploration in reinforcement learning through repulsive reward shaping using eligibility tracesAnnals of Operations Research10.1007/s10479-023-05798-1Online publication date: 18-Jan-2024
  • (2023)On the importance of exploration for generalization in reinforcement learningProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666691(12951-12986)Online publication date: 10-Dec-2023
  • (2023)Consistent aggregation of objectives with diverse time preferences requires non-Markovian rewardsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666250(2877-2893)Online publication date: 10-Dec-2023
  • (2023)The unintended consequences of discount regularizationProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619603(28746-28767)Online publication date: 23-Jul-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media