Reducing reinforcement learning to KWIK online regression

Li, Lihong; Littman, Michael L.

doi:10.1007/s10472-010-9201-2

Reducing reinforcement learning to KWIK online regression

Published: 29 June 2010

Volume 58, pages 217–237, (2010)
Cite this article

Annals of Mathematics and Artificial Intelligence Aims and scope Submit manuscript

Lihong Li¹ &
Michael L. Littman²

280 Accesses
7 Citations
Explore all metrics

Abstract

One of the key problems in reinforcement learning (RL) is balancing exploration and exploitation. Another is learning and acting in large Markov decision processes (MDPs) where compact function approximation has to be used. This paper introduces REKWIRE, a provably efficient, model-free algorithm for finite-horizon RL problems with value function approximation (VFA) that addresses the exploration-exploitation tradeoff in a principled way. The crucial element of this algorithm is a reduction of RL to online regression in the recently proposed KWIK learning model. We show that, if the KWIK online regression problem can be solved efficiently, then the sample complexity of exploration of REKWIRE is polynomial. Therefore, the reduction suggests a new and sound direction to tackle general RL problems. The efficiency of our algorithm is verified on a set of proof-of-concept experiments where popular, ad hoc exploration approaches fail.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Asmuth, J., Li, L., Littman, M. L., Nouri, A., Wingate, D.: A Bayesian sampling approach to exploration in reinforcement learning. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI-09), pp. 19–26 (2009)
Auer, P.: Using confidence bounds for exploitation-exploration trade-offs. J. Mach. Learn. Res. 3, 397–422 (2002)
Article MathSciNet Google Scholar
Bagnell, J.A., Kakade, S., Ng, A.Y., Schneider, J.: Policy search by dynamic programming. Adv. Neural Inf. Process. Syst. 16 (NIPS-03), 831–838 (2004)
Google Scholar
Boyan, J.A., Moore, A.W.: Generalization in reinforcement learning: safely approximating the value function. Adv. Neural Inf. Process. Syst. 7, 369–376 (1995)
Google Scholar
Brafman, R.I., Tennenholtz, M.: R-max—a general polynomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res. 3, 213–231 (2002)
Article MathSciNet Google Scholar
Cesa-Bianchi, N., Lugosi, G.: Prediction, Learning, and Games. Cambridge University Press (2006)
Chernoff, H.: A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann. Math. Stat. 23, 493–507 (1952)
Article MATH MathSciNet Google Scholar
Chow, C.-S., Tsitsiklis, J.N.: The complexity of dynamic programming. J. Complex. 5, 466–488 (1989)
Article MATH MathSciNet Google Scholar
Duff, M.O.: Optimal Learning: Computational Procedures for Bayes-adaptive Markov Decision Processes. Doctoral dissertation, University of Massachusetts, Amherst, MA (2002)
Engel, Y., Mannor, S., Meir, R.: Reinforcement learning with Gaussian processes. In: Proceedings of the Twenty-Second International Conference on Machine Learning (ICML-05), pp. 201–208 (2005)
Fern, A., Yoon, S.W., Givan, R.: Approximate policy iteration with a policy language bias: solving relational Markov decision processes. J. Artif. Intell. Res. 25, 75–118 (2006)
MATH MathSciNet Google Scholar
Fiechter, C.-N.: Efficient reinforcement learning. In: Proceedings of the Seventh Annual ACM Conference on Computational Learning Theory, pp. 88–97 (1994)
Geist, M., Pietquin, O., Fricout, G.: Kalman temporal differences: the deterministic case. In: Proceedings of the IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL-09), pp. 185–192 (2009)
Jong, N.K., Stone, P.: Model-based exploration in continuous state spaces. In: Proceedings of the Seventh International Symposium on Abstraction, Reformulation and Approximation (SARA-07), pp. 258–272 (2007)
Kakade, S.: On the Sample Complexity of Reinforcement Learning. Doctoral dissertation, University College London, UK (2003)
Kakade, S., Kearns, M.J., Langford, J.: Exploration in metric state spaces. In: Proceedings of the Twentieth International Conference on Machine Learning, pp. 306–312 (2003)
Kearns, M.J., Mansour, Y., Ng, A.Y.: A sparse sampling algorithm for near-optimal planning in large Markov decision processes. Mach. Learn. 49, 193–208 (2002)
Article MATH Google Scholar
Kearns, M.J., Singh, S.P.: Near-optimal reinforcement learning in polynomial time. Mach. Learn. 49, 209–232 (2002)
Article MATH Google Scholar
Kearns, M.J., Vazirani, U.V.: An Introduction to Computational Learning Theory. MIT Press (1994)
Koenig, S., Simmons, R.G.: The effect of representation and knowledge on goal-directed exploration with reinforcement-learning algorithms. Mach. Learn. 22, 227–250 (1996)
MATH Google Scholar
Kolter, J.Z., Ng, A.Y.: Near Bayesian exploration in polynomial time. In: Proceedings of the Twenty-Sixth International Conference on Machine Learning (ICML-09), pp. 513–520 (2009)
Lagoudakis, M.G., Parr, R.: Least-squares policy iteration. J. Mach. Learn. Res. 4, 1107–1149 (2003)
Article MathSciNet Google Scholar
Lagoudakis, M.G., Parr, R.: Reinforcement learning as classification: leveraging modern classifiers. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML-03), pp. 424–431 (2003)
Langford, J., Zadrozny, B.: Reducing T-step reinforcement learning to classification. In: Proceedings of the Machine Learning Reductions Workshop. Chicago, IL (2003)
Langford, J., Zadrozny, B.: Relating reinforcement learning performance to classification performance. In: Proceedings of the Twenty-Second International Conference on Machine Learning (ICML-05), pp. 473–480 (2005)
Li, L.: Focus of Attention in Reinforcement Learning. Master’s thesis, University of Alberta, Edmonton, AB, Canada (2004)
Li, L.: A Unifying Framework for Computational Reinforcement Learning Theory. Doctoral dissertation, Rutgers University, New Brunswick, NJ (2009)
Li, L., Bulitko, V., Greiner, R.: Focus of attention in reinforcement learning. J. Univers. Comput. Sci. 13, 1246–1269 (2007)
Google Scholar
Li, L., Littman, M.L.: Efficient value-function approximation via online linear regression. In: Proceedings of the Tenth International Symposium on Artificial Intelligence and Mathematics (AMAI-08) (2008)
Li, L., Littman, M.L., Mansley, C.R.: Online exploration in least-squares policy iteration. In: Proceedings of the Eighteenth International Conference on Agents and Multiagent Systems (AAMAS-09) (2009)
Li, L., Littman, M.L., Walsh, T.J.: Knows what it knows: A framework for self-aware learning. In: Proceedings of the Twenty-Fifth International Conference on Machine Learning (ICML-08), pp. 568–575 (2008)
Li, L., Littman, M.L., Walsh, T.J., Strehl, A.L.: Knows what it knows: a framework for self-aware learning. (2010, in submission)
Ortner, R., Auer, P., Jaksch, T.: Near-optimal regret bounds for reinforcement learning. J. Mach. Learn. Res. 11, 1563–1600 (2010)
Google Scholar
Poupart, P., Vlassis, N., Hoey, J., Regan, K.: An analytic solution to discrete Bayesian reinforcement learning. In: Proceedings of the Twenty-Third International Conference on Machine Learning (ICML-06), pp. 697–704 (2006)
Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley-Interscience, New York (1994)
MATH Google Scholar
Strehl, A.L., Li, L., Littman, M.L.: Reinforcement learning in finite MDPs: PAC analysis. J. Mach. Learn. Res. 10, 2413–2444 (2009)
MathSciNet Google Scholar
Strehl, A.L., Li, L., Wiewiora, E., Langford, J., Littman, M.L.: PAC model-free reinforcement learning. In: Proceedings of the Twenty-Third International Conference on Machine Learning, pp. 881–888 (2006)
Strehl, A.L., Littman, M.L.: A theoretical analysis of model-based interval estimation. In: Proceedings of the Twenty-Second Conference on Machine Learning, pp. 857–864 (2005)
Strehl, A.L., Littman, M.L.: Online linear regression and its application to model-based reinforcement learning. Adv. Neural Inf. Process. Syst. 20, 1417–1424 (2008)
Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA (1998)
Google Scholar
Thrun, S.: The role of exploration in learning control. In: White, D.A., Sofge, D.A. (eds.) Handbook of Intelligent Control: Neural, Fuzzy and Adaptive Approaches, pp. 527–559. Van Nostrand Reinhold (1992)

Download references

Author information

Authors and Affiliations

Yahoo! Research, 4401 Great America Parkway, Santa Clara, CA, 95054, USA
Lihong Li
Rutgers Laboratory for Real-Life Reinforcement Learning (RL3), Department of Computer Science, Rutgers University, Piscataway, NJ, 08854, USA
Michael L. Littman

Authors

Lihong Li
View author publications
You can also search for this author in PubMed Google Scholar
Michael L. Littman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lihong Li.

Additional information

Part of this work was done while L. Li was at Rutgers University.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, L., Littman, M.L. Reducing reinforcement learning to KWIK online regression. Ann Math Artif Intell 58, 217–237 (2010). https://doi.org/10.1007/s10472-010-9201-2

Download citation

Published: 29 June 2010
Issue Date: April 2010
DOI: https://doi.org/10.1007/s10472-010-9201-2

Keywords

Mathematics Subject Classification (2010)

68T05

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Reducing reinforcement learning to KWIK online regression

Abstract

Access this article

Similar content being viewed by others

Reinforcement Learning

A data-based online reinforcement learning algorithm satisfying probably approximately correct principle

Value Function Approximation

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification (2010)

Navigation

Reducing reinforcement learning to KWIK online regression

Abstract

Access this article

Similar content being viewed by others

Reinforcement Learning

A data-based online reinforcement learning algorithm satisfying probably approximately correct principle

Value Function Approximation

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification (2010)

Search

Navigation