Skip to main content

PAC Bounds for Multi-armed Bandit and Markov Decision Processes

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2375))

Abstract

The bandit problem is revisited and considered under the PAC model. Our main contribution in this part is to show that given n arms, it suffices to pull the arms O(n/ε 2log1/δ) times to find an ∈-optimal arm with probability of at least 1 - δ. This is in contrast to the naive bound of O(n/ε 2logn/δ). We derive another algorithm whose complexity depends on the specific setting of the rewards, rather than the worst case setting. We also provide a matching lower bound.

We show how given an algorithm for the PAC model Multi-armed Bandit problem, one can derive a batch learning algorithm for Markov Decision Processes. This is done essentially by simulating Value Iteration, and in each iteration invoking the multi-armed bandit algorithm. Using our PAC algorithm for the multi-armed bandit problem we improve the dependence on the number of actions.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. M. Anthony and P. L. Bartlett. Neural Network Learning; Theoretical Foundations. Cambridge University Press, 1999.

    Google Scholar 

  2. P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proc. 36th Annual Symposium on Foundations of Computer Science, pages 322–331. IEEE Computer Society Press, 1995.

    Google Scholar 

  3. P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The non-stochastic multi-armed bandit problem. preprint, 2001.

    Google Scholar 

  4. D. A. Berry and B. Fristedt. Bandit Problems. Chapman and Hall, 1985.

    Google Scholar 

  5. D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Progamming. Athena Scientific, 1995.

    Google Scholar 

  6. V. S. Borkar and S.P Meyn. The O. D. E. method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control, 38(2):447–469, 2000.

    Article  MATH  MathSciNet  Google Scholar 

  7. R. Brafman and M. Tennenholtz. R-MAX-A General Polynomial Time Algorithm for Near Optimal Reinforcement Learning. In International Joint Conference on Artificial Intelligence, 2001.

    Google Scholar 

  8. H. Chernoff. Sequential Analysis and Optimal Design. Society for industrial and Applied Mathematics, Philadelphia, 1972.

    MATH  Google Scholar 

  9. P. Dayan and C. Watkins. Q-learning. Machine Learning, 8:279–292, 1992.

    MATH  Google Scholar 

  10. Eyal Even-Dar and Yishay Mansour. Learning rates for q-learning. In Fourteenth Annual Conference on Computation Learning Theory, pages 589–604, 2001.

    Google Scholar 

  11. C. N. Fiechter. PAC adaptive control of linear systems. In Tenth Annual conference on Computational Learing Theory, pages 72–80, 1997.

    Google Scholar 

  12. J. Gittins and D. Jones. A dynamic allocation index for the sequential design of experiments. In J. Gani, K. Sarkadi, and I. Vincze, editors, Progress in Statistics, pages 241–266. North-Holland, Amsterdam, 1974.

    Google Scholar 

  13. M. Kearns, Y. Mansour, and A. Ng. Approximate planning in large POMDPs via reusable trajectories. In Advances in Neural Information Processing Systems, 1999.

    Google Scholar 

  14. M. Kearns and S. Singh. Near-optimal reinforcement learning in polynomial time. In Proc. of the 15th Int. Conf. on Machine Learning, pages 260–268. Morgan Kaufmann, 1998.

    Google Scholar 

  15. M. Kearns and S. Singh. Finite-sample convergence rates for Q-learning and indirect algorithms near-optimal reinforcement learning in polynomial time. In Neural Information Processing Systems 11, pages 996–1002. Morgan Kaufmann, 1999.

    Google Scholar 

  16. Michael J. Kearns, Yishay Mansour, and Andrew Y. Ng. A sparse sampling algorithm for near-optimal planning in large Markov Decision Processes. In International Joint Conference on AI, pages 1324–1231, 1999.

    Google Scholar 

  17. T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:4–22, 1985.

    Article  MATH  MathSciNet  Google Scholar 

  18. H. Robbins. Some aspects of sequential design of experiments. Bull. Amer. Math. Soc., 55:527–535, 1952.

    Article  MathSciNet  Google Scholar 

  19. J. N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine Learning, 16:185–202, 1994.

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Even-Dar, E., Mannor, S., Mansour, Y. (2002). PAC Bounds for Multi-armed Bandit and Markov Decision Processes. In: Kivinen, J., Sloan, R.H. (eds) Computational Learning Theory. COLT 2002. Lecture Notes in Computer Science(), vol 2375. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45435-7_18

Download citation

  • DOI: https://doi.org/10.1007/3-540-45435-7_18

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-43836-6

  • Online ISBN: 978-3-540-45435-9

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics