Skip to main content
Log in

Reducing reinforcement learning to KWIK online regression

  • Published:
Annals of Mathematics and Artificial Intelligence Aims and scope Submit manuscript

Abstract

One of the key problems in reinforcement learning (RL) is balancing exploration and exploitation. Another is learning and acting in large Markov decision processes (MDPs) where compact function approximation has to be used. This paper introduces REKWIRE, a provably efficient, model-free algorithm for finite-horizon RL problems with value function approximation (VFA) that addresses the exploration-exploitation tradeoff in a principled way. The crucial element of this algorithm is a reduction of RL to online regression in the recently proposed KWIK learning model. We show that, if the KWIK online regression problem can be solved efficiently, then the sample complexity of exploration of REKWIRE is polynomial. Therefore, the reduction suggests a new and sound direction to tackle general RL problems. The efficiency of our algorithm is verified on a set of proof-of-concept experiments where popular, ad hoc exploration approaches fail.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Asmuth, J., Li, L., Littman, M. L., Nouri, A., Wingate, D.: A Bayesian sampling approach to exploration in reinforcement learning. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI-09), pp. 19–26 (2009)

  2. Auer, P.: Using confidence bounds for exploitation-exploration trade-offs. J. Mach. Learn. Res. 3, 397–422 (2002)

    Article  MathSciNet  Google Scholar 

  3. Bagnell, J.A., Kakade, S., Ng, A.Y., Schneider, J.: Policy search by dynamic programming. Adv. Neural Inf. Process. Syst. 16 (NIPS-03), 831–838 (2004)

    Google Scholar 

  4. Boyan, J.A., Moore, A.W.: Generalization in reinforcement learning: safely approximating the value function. Adv. Neural Inf. Process. Syst. 7, 369–376 (1995)

    Google Scholar 

  5. Brafman, R.I., Tennenholtz, M.: R-max—a general polynomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res. 3, 213–231 (2002)

    Article  MathSciNet  Google Scholar 

  6. Cesa-Bianchi, N., Lugosi, G.: Prediction, Learning, and Games. Cambridge University Press (2006)

  7. Chernoff, H.: A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann. Math. Stat. 23, 493–507 (1952)

    Article  MATH  MathSciNet  Google Scholar 

  8. Chow, C.-S., Tsitsiklis, J.N.: The complexity of dynamic programming. J. Complex. 5, 466–488 (1989)

    Article  MATH  MathSciNet  Google Scholar 

  9. Duff, M.O.: Optimal Learning: Computational Procedures for Bayes-adaptive Markov Decision Processes. Doctoral dissertation, University of Massachusetts, Amherst, MA (2002)

  10. Engel, Y., Mannor, S., Meir, R.: Reinforcement learning with Gaussian processes. In: Proceedings of the Twenty-Second International Conference on Machine Learning (ICML-05), pp. 201–208 (2005)

  11. Fern, A., Yoon, S.W., Givan, R.: Approximate policy iteration with a policy language bias: solving relational Markov decision processes. J. Artif. Intell. Res. 25, 75–118 (2006)

    MATH  MathSciNet  Google Scholar 

  12. Fiechter, C.-N.: Efficient reinforcement learning. In: Proceedings of the Seventh Annual ACM Conference on Computational Learning Theory, pp. 88–97 (1994)

  13. Geist, M., Pietquin, O., Fricout, G.: Kalman temporal differences: the deterministic case. In: Proceedings of the IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL-09), pp. 185–192 (2009)

  14. Jong, N.K., Stone, P.: Model-based exploration in continuous state spaces. In: Proceedings of the Seventh International Symposium on Abstraction, Reformulation and Approximation (SARA-07), pp. 258–272 (2007)

  15. Kakade, S.: On the Sample Complexity of Reinforcement Learning. Doctoral dissertation, University College London, UK (2003)

  16. Kakade, S., Kearns, M.J., Langford, J.: Exploration in metric state spaces. In: Proceedings of the Twentieth International Conference on Machine Learning, pp. 306–312 (2003)

  17. Kearns, M.J., Mansour, Y., Ng, A.Y.: A sparse sampling algorithm for near-optimal planning in large Markov decision processes. Mach. Learn. 49, 193–208 (2002)

    Article  MATH  Google Scholar 

  18. Kearns, M.J., Singh, S.P.: Near-optimal reinforcement learning in polynomial time. Mach. Learn. 49, 209–232 (2002)

    Article  MATH  Google Scholar 

  19. Kearns, M.J., Vazirani, U.V.: An Introduction to Computational Learning Theory. MIT Press (1994)

  20. Koenig, S., Simmons, R.G.: The effect of representation and knowledge on goal-directed exploration with reinforcement-learning algorithms. Mach. Learn. 22, 227–250 (1996)

    MATH  Google Scholar 

  21. Kolter, J.Z., Ng, A.Y.: Near Bayesian exploration in polynomial time. In: Proceedings of the Twenty-Sixth International Conference on Machine Learning (ICML-09), pp. 513–520 (2009)

  22. Lagoudakis, M.G., Parr, R.: Least-squares policy iteration. J. Mach. Learn. Res. 4, 1107–1149 (2003)

    Article  MathSciNet  Google Scholar 

  23. Lagoudakis, M.G., Parr, R.: Reinforcement learning as classification: leveraging modern classifiers. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML-03), pp. 424–431 (2003)

  24. Langford, J., Zadrozny, B.: Reducing T-step reinforcement learning to classification. In: Proceedings of the Machine Learning Reductions Workshop. Chicago, IL (2003)

  25. Langford, J., Zadrozny, B.: Relating reinforcement learning performance to classification performance. In: Proceedings of the Twenty-Second International Conference on Machine Learning (ICML-05), pp. 473–480 (2005)

  26. Li, L.: Focus of Attention in Reinforcement Learning. Master’s thesis, University of Alberta, Edmonton, AB, Canada (2004)

  27. Li, L.: A Unifying Framework for Computational Reinforcement Learning Theory. Doctoral dissertation, Rutgers University, New Brunswick, NJ (2009)

  28. Li, L., Bulitko, V., Greiner, R.: Focus of attention in reinforcement learning. J. Univers. Comput. Sci. 13, 1246–1269 (2007)

    Google Scholar 

  29. Li, L., Littman, M.L.: Efficient value-function approximation via online linear regression. In: Proceedings of the Tenth International Symposium on Artificial Intelligence and Mathematics (AMAI-08) (2008)

  30. Li, L., Littman, M.L., Mansley, C.R.: Online exploration in least-squares policy iteration. In: Proceedings of the Eighteenth International Conference on Agents and Multiagent Systems (AAMAS-09) (2009)

  31. Li, L., Littman, M.L., Walsh, T.J.: Knows what it knows: A framework for self-aware learning. In: Proceedings of the Twenty-Fifth International Conference on Machine Learning (ICML-08), pp. 568–575 (2008)

  32. Li, L., Littman, M.L., Walsh, T.J., Strehl, A.L.: Knows what it knows: a framework for self-aware learning. (2010, in submission)

  33. Ortner, R., Auer, P., Jaksch, T.: Near-optimal regret bounds for reinforcement learning. J. Mach. Learn. Res. 11, 1563–1600 (2010)

    Google Scholar 

  34. Poupart, P., Vlassis, N., Hoey, J., Regan, K.: An analytic solution to discrete Bayesian reinforcement learning. In: Proceedings of the Twenty-Third International Conference on Machine Learning (ICML-06), pp. 697–704 (2006)

  35. Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley-Interscience, New York (1994)

    MATH  Google Scholar 

  36. Strehl, A.L., Li, L., Littman, M.L.: Reinforcement learning in finite MDPs: PAC analysis. J. Mach. Learn. Res. 10, 2413–2444 (2009)

    MathSciNet  Google Scholar 

  37. Strehl, A.L., Li, L., Wiewiora, E., Langford, J., Littman, M.L.: PAC model-free reinforcement learning. In: Proceedings of the Twenty-Third International Conference on Machine Learning, pp. 881–888 (2006)

  38. Strehl, A.L., Littman, M.L.: A theoretical analysis of model-based interval estimation. In: Proceedings of the Twenty-Second Conference on Machine Learning, pp. 857–864 (2005)

  39. Strehl, A.L., Littman, M.L.: Online linear regression and its application to model-based reinforcement learning. Adv. Neural Inf. Process. Syst. 20, 1417–1424 (2008)

    Google Scholar 

  40. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA (1998)

    Google Scholar 

  41. Thrun, S.: The role of exploration in learning control. In: White, D.A., Sofge, D.A. (eds.) Handbook of Intelligent Control: Neural, Fuzzy and Adaptive Approaches, pp. 527–559. Van Nostrand Reinhold (1992)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lihong Li.

Additional information

Part of this work was done while L. Li was at Rutgers University.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, L., Littman, M.L. Reducing reinforcement learning to KWIK online regression. Ann Math Artif Intell 58, 217–237 (2010). https://doi.org/10.1007/s10472-010-9201-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10472-010-9201-2

Keywords

Mathematics Subject Classification (2010)

Navigation