Skip to main content

Sample Complexity Bounds of Exploration

  • Chapter
Reinforcement Learning

Part of the book series: Adaptation, Learning, and Optimization ((ALO,volume 12))

Abstract

Efficient exploration is widely recognized as a fundamental challenge inherent in reinforcement learning. Algorithms that explore efficiently converge faster to near-optimal policies. While heuristics techniques are popular in practice, they lack formal guarantees and may not work well in general. This chapter studies algorithms with polynomial sample complexity of exploration, both model-based and model-free ones, in a unified manner. These so-called PAC-MDP algorithms behave near-optimally except in a “small” number of steps with high probability. A new learning model known as KWIK is used to unify most existing model-based PAC-MDP algorithms for various subclasses of Markov decision processes.We also compare the sample-complexity framework to alternatives for formalizing exploration efficiency such as regret minimization and Bayes optimal solutions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 299.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 379.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 379.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Abbeel, P., Ng, A.Y.: Exploration and apprenticeship learning in reinforcement learning. In: Proceedings of the Twenty-Second International Conference on Machine Learning (ICML-2005), pp. 1–8 (2005)

    Google Scholar 

  • Asmuth, J., Li, L., Littman, M.L., Nouri, A., Wingate, D.: A Bayesian sampling approach to exploration in reinforcement learning. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI-2009), pp. 19–26 (2009)

    Google Scholar 

  • Bartlett, P.L., Tewari, A.: REGAL: A regularization based algorithm for reinforcement learning in weakly communicating MDPs. In: Proceedings of the Twenty-Fifth Annual Conference on Uncertainty in Artificial Intelligence (UAI-2009), pp. 35–42 (2009)

    Google Scholar 

  • Barto, A.G., Bradtke, S.J., Singh, S.P.: Learning to act using real-time dynamic programming. Artificial Intelligence 72(1-2), 81–138 (1995)

    Article  Google Scholar 

  • Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming. Athena Scientific (1996)

    Google Scholar 

  • Brafman, R.I., Tennenholtz, M.: R-max—a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research 3, 213–231 (2002)

    MathSciNet  Google Scholar 

  • Brunskill, E., Leffler, B.R., Li, L., Littman, M.L., Roy, N.: Provably efficient learning with typed parametric models. Journal of Machine Learning Research 10, 1955–1988 (2009)

    MathSciNet  MATH  Google Scholar 

  • Burnetas, A.N., Katehakis, M.N.: Optimal adaptive policies for Markov decision processes. Mathematics of Operations Research 22(1), 222–255 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  • Dearden, R., Friedman, N., Andre, D.: Model based Bayesian exploration. In: Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI-1999), pp. 150–159 (1999)

    Google Scholar 

  • Diuk, C., Cohen, A., Littman, M.L.: An object-oriented representation for efficient reinforcement learning. In: Proceedings of the Twenty-Fifth International Conference on Machine Learning (ICML-2008), pp. 240–247 (2008)

    Google Scholar 

  • Diuk, C., Li, L., Leffler, B.R.: The adaptive k-meteorologists problem and its application to structure discovery and feature selection in reinforcement learning. In: Proceedings of the Twenty-Sixth International Conference on Machine Learning (ICML-2009), pp. 249–256 (2009)

    Google Scholar 

  • Duff, M.O.: Optimal learning: Computational procedures for Bayes-adaptive Markov decision processes. PhD thesis, University of Massachusetts, Amherst, MA (2002)

    Google Scholar 

  • Even-Dar, E., Mansour, Y.: Learning rates for Q-learning. Journal of Machine Learning Research 5, 1–25 (2003)

    MathSciNet  Google Scholar 

  • Even-Dar, E., Mannor, S., Mansour, Y.: Multi-Armed Bandit and Markov Decision Processes. In: Kivinen, J., Sloan, R.H. (eds.) COLT 2002. LNCS (LNAI), vol. 2375, pp. 255–270. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  • Fiechter, C.N.: Efficient reinforcement learning. In: Proceedings of the Seventh Annual ACM Conference on Computational Learning Theory (COLT-1994), pp. 88–97 (1994)

    Google Scholar 

  • Fiechter, C.N.: Expected mistake bound model for on-line reinforcement learning. In: Proceedings of the Fourteenth International Conference on Machine Learning (ICML-1997), pp. 116–124 (1997)

    Google Scholar 

  • Guestrin, C., Patrascu, R., Schuurmans, D.: Algorithm-directed exploration for model-based reinforcement learning in factored MDPs. In: Proceedings of the Nineteenth International Conference on Machine Learning (ICML-2002), pp. 235–242 (2002)

    Google Scholar 

  • Jaakkola, T., Jordan, M.I., Singh, S.P.: On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation 6(6), 1185–1201 (1994)

    Article  MATH  Google Scholar 

  • Jaksch, T., Ortner, R., Auer, P.: Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research 11, 1563–1600 (2010)

    MathSciNet  MATH  Google Scholar 

  • Jong, N.K., Stone, P.: Model-based function approximation in reinforcement learning. In: Proceedings of the Sixth International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS-2007), pp. 670–677 (2007)

    Google Scholar 

  • Kaelbling, L.P.: Learning in Embedded Systems. MIT Press, Cambridge (1993)

    Google Scholar 

  • Kakade, S.: On the sample complexity of reinforcement learning. PhD thesis, Gatsby Computational Neuroscience Unit, University College London, UK (2003)

    Google Scholar 

  • Kakade, S., Kearns, M.J., Langford, J.: Exploration in metric state spaces. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), pp. 306–312 (2003)

    Google Scholar 

  • Kearns, M.J., Koller, D.: Efficient reinforcement learning in factored MDPs. In: Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI-1999), pp. 740–747 (1999)

    Google Scholar 

  • Kearns, M.J., Singh, S.P.: Finite-sample convergence rates for Q-learning and indirect algorithms. In: Advances in Neural Information Processing Systems (NIPS-1998), vol. 11, pp. 996–1002 (1999)

    Google Scholar 

  • Kearns, M.J., Singh, S.P.: Near-optimal reinforcement learning in polynomial time. Machine Learning 49(2-3), 209–232 (2002)

    Article  MATH  Google Scholar 

  • Kearns, M.J., Mansour, Y., Ng, A.Y.: A sparse sampling algorithm for near-optimal planning in large Markov decision processes. Machine Learning 49(2-3), 193–208 (2002)

    Article  MATH  Google Scholar 

  • Kocsis, L., Szepesvári, C.: Bandit Based Monte-Carlo Planning. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) ECML 2006. LNCS (LNAI), vol. 4212, pp. 282–293. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  • Koenig, S., Simmons, R.G.: The effect of representation and knowledge on goal-directed exploration with reinforcement-learning algorithms. Machine Learning 22(1-3), 227–250 (1996)

    Article  MATH  Google Scholar 

  • Kolter, J.Z., Ng, A.Y.: Near Bayesian exploration in polynomial time. In: Proceedings of the Twenty-Sixth International Conference on Machine Learning (ICML-2009), pp. 513–520 (2009)

    Google Scholar 

  • Li, L.: A unifying framework for computational reinforcement learning theory. PhD thesis, Rutgers University, New Brunswick, NJ (2009)

    Google Scholar 

  • Li, L., Littman, M.L.: Reducing reinforcement learning to KWIK online regression. Annals of Mathematics and Artificial Intelligence 58(3-4), 217–237 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  • Li, L., Littman, M.L., Mansley, C.R.: Online exploration in least-squares policy iteration. In: Proceedings of the Eighteenth International Conference on Agents and Multiagent Systems (AAMAS-2009), pp. 733–739 (2009)

    Google Scholar 

  • Li, L., Littman, M.L., Walsh, T.J., Strehl, A.L.: Knows what it knows: A framework for self-aware learning. Machine Learning 82(3), 399–443 (2011)

    Article  MATH  Google Scholar 

  • Littlestone, N.: Learning quickly when irrelevant attributes abound: A new linear-threshold algorithms. Machine Learning 2(4), 285–318 (1987)

    Google Scholar 

  • Meuleau, N., Bourgine, P.: Exploration of multi-state environments: Local measures and back-propagation of uncertainty. Machine Learning 35(2), 117–154 (1999)

    Article  MATH  Google Scholar 

  • Moore, A.W., Atkeson, C.G.: Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning 13(1), 103–130 (1993)

    Google Scholar 

  • Neu, G., György, A., Szepesvári, C., Antos, A.: Online Markov decision processes under bandit feedback. In: Advances in Neural Information Processing Systems 23 (NIPS-2010), pp. 1804–1812 (2011)

    Google Scholar 

  • Ng, A.Y., Harada, D., Russell, S.J.: Policy invariance under reward transformations: Theory and application to reward shaping. In: Proceedings of the Sixteenth International Conference on Machine Learning (ICML-1999), pp. 278–287 (1999)

    Google Scholar 

  • Nouri, A., Littman, M.L.: Multi-resolution exploration in continuous spaces. In: Advances in Neural Information Processing Systems 21 (NIPS-2008), pp. 1209–1216 (2009)

    Google Scholar 

  • Nouri, A., Littman, M.L.: Dimension reduction and its application to model-based exploration in continuous spaces. Machine Learning 81(1), 85–98 (2010)

    Article  Google Scholar 

  • Poupart, P., Vlassis, N., Hoey, J., Regan, K.: An analytic solution to discrete Bayesian reinforcement learning. In: Proceedings of the Twenty-Third International Conference on Machine Learning (ICML-2006), pp. 697–704 (2006)

    Google Scholar 

  • Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley-Interscience, New York (1994)

    Book  MATH  Google Scholar 

  • Ratitch, B., Precup, D.: Using MDP Characteristics to Guide Exploration in Reinforcement Learning. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) ECML 2003. LNCS (LNAI), vol. 2837, pp. 313–324. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  • Robbins, H.: Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society 58(5), 527–535 (1952)

    Article  MathSciNet  MATH  Google Scholar 

  • Sayedi, A., Zadimoghaddam, M., Blum, A.: Trading off mistakes and don’t-know predictions. In: Advances in Neural Information Processing Systems 23 (NIPS-2010), pp. 2092–2100 (2011)

    Google Scholar 

  • Singh, S.P., Jaakkola, T., Littman, M.L., Szepesvári, C.: Convergence results for single-step on-policy reinforcement-learning algorithms. Machine Learning 38(3), 287–308 (2000)

    Article  MATH  Google Scholar 

  • Strehl, A.L.: Model-based reinforcement learning in factored-state MDPs. In: Proceedings of the IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, pp. 103–110 (2007a)

    Google Scholar 

  • Strehl, A.L.: Probably approximately correct (PAC) exploration in reinforcement learning. PhD thesis, Rutgers University, New Brunswick, NJ (2007b)

    Google Scholar 

  • Strehl, A.L., Littman, M.L.: An analysis of model-based interval estimation for Markov decision processes. Journal of Computer and System Sciences 74(8), 1309–1331 (2008a)

    Article  MathSciNet  MATH  Google Scholar 

  • Strehl, A.L., Littman, M.L.: Online linear regression and its application to model-based reinforcement learning. In: Advances in Neural Information Processing Systems 20 (NIPS-2007), pp. 1417–1424 (2008b)

    Google Scholar 

  • Strehl, A.L., Li, L., Littman, M.L.: Incremental model-based learners with formal learning-time guarantees. In: Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence (UAI-2006), pp. 485–493 (2006a)

    Google Scholar 

  • Strehl, A.L., Li, L., Wiewiora, E., Langford, J., Littman, M.L.: PAC model-free reinforcement learning. In: Proceedings of the Twenty-Third International Conference on Machine Learning (ICML-2006), pp. 881–888 (2006b)

    Google Scholar 

  • Strehl, A.L., Diuk, C., Littman, M.L.: Efficient structure learning in factored-state MDPs. In: Proceedings of the Twenty-Second AAAI Conference on Artificial Intelligence (AAAI-2007), pp. 645–650 (2007)

    Google Scholar 

  • Strehl, A.L., Li, L., Littman, M.L.: Reinforcement learning in finite MDPs: PAC analysis. Journal of Machine Learning Research 10, 2413–2444 (2009)

    MathSciNet  MATH  Google Scholar 

  • Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)

    Google Scholar 

  • Szita, I., Lőrincz, A.: The many faces of optimism: A unifying approach. In: Proceedings of the Twenty-Fifth International Conference on Machine Learning (ICML-2008), pp. 1048–1055 (2008)

    Google Scholar 

  • Szita, I., Szepesvári, C.: Model-based reinforcement learning with nearly tight exploration complexity bounds. In: Proceedings of the Twenty-Seventh International Conference on Machine Learning (ICML-2010), pp. 1031–1038 (2010)

    Google Scholar 

  • Szita, I., Szepesvári, C.: Agnostic KWIK learning and efficient approximate reinforcement learning. In: Proceedings of the Twenty-Fourth Annual Conference on Learning Theory, COLT-2011 (2011)

    Google Scholar 

  • Tewari, A., Bartlett, P.L.: Optimistic linear programming gives logarithmic regret for irreducible MDPs. In: Advances in Neural Information Processing Systems 20 (NIPS-2007), pp. 1505–1512 (2008)

    Google Scholar 

  • Thrun, S.: The role of exploration in learning control. In: White, D.A., Sofge, D.A. (eds.) Handbook of Intelligent Control: Neural, Fuzzy and Adaptive Approaches, pp. 527–559. Van Nostrand Reinhold (1992)

    Google Scholar 

  • Valiant, L.G.: A theory of the learnable. Communications of the ACM 27(11), 1134–1142 (1984)

    Article  MATH  Google Scholar 

  • Walsh, T.J.: Efficient learning of relational models for sequential decision making. PhD thesis, Rutgers University, New Brunswick, NJ (2010)

    Google Scholar 

  • Walsh, T.J., Szita, I., Diuk, C., Littman, M.L.: Exploring compact reinforcement-learning representations with linear regression. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI-2009), pp. 591–598 (2009); corrected version as Technical Report DCS-tr-660, Department of Computer Science, Rutgers University

    Google Scholar 

  • Walsh, T.J., Goschin, S., Littman, M.L.: Integrating sample-based planning and model-based reinforcement learning. In: Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI-2010), pp. 612–617 (2010a)

    Google Scholar 

  • Walsh, T.J., Subramanian, K., Littman, M.L., Diuk, C.: Generalizing apprenticeship learning across hypothesis classes. In: Proceedings of the Twenty-Seventh International Conference on Machine Learning (ICML-2010), pp. 1119–1126 (2010b)

    Google Scholar 

  • Walsh, T.J., Hewlett, D., Morrison, C.T.: Blending autonomous and apprenticeship learning. In: Advances in Neural Information Processing Systems 24, NIPS-2011 (2012)

    Google Scholar 

  • Watkins, C.J., Dayan, P.: Q-learning. Machine Learning 8, 279–292 (1992)

    MATH  Google Scholar 

  • Whitehead, S.D.: Complexity and cooperation in Q-learning. In: Proceedings of the Eighth International Workshop on Machine Learning (ICML-1991), pp. 363–367 (1991)

    Google Scholar 

  • Wiering, M., Schmidhuber, J.: Efficient model-based exploration. In: Proceedings of the Fifth International Conference on Simulation of Adaptive Behavior: From Animals to Animats 5 (SAB-1998), pp. 223–228 (1998)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lihong Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Li, L. (2012). Sample Complexity Bounds of Exploration. In: Wiering, M., van Otterlo, M. (eds) Reinforcement Learning. Adaptation, Learning, and Optimization, vol 12. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-27645-3_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-27645-3_6

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-27644-6

  • Online ISBN: 978-3-642-27645-3

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics