Abstract
The bandit problem is revisited and considered under the PAC model. Our main contribution in this part is to show that given n arms, it suffices to pull the arms O(n/ε 2log1/δ) times to find an ∈-optimal arm with probability of at least 1 - δ. This is in contrast to the naive bound of O(n/ε 2logn/δ). We derive another algorithm whose complexity depends on the specific setting of the rewards, rather than the worst case setting. We also provide a matching lower bound.
We show how given an algorithm for the PAC model Multi-armed Bandit problem, one can derive a batch learning algorithm for Markov Decision Processes. This is done essentially by simulating Value Iteration, and in each iteration invoking the multi-armed bandit algorithm. Using our PAC algorithm for the multi-armed bandit problem we improve the dependence on the number of actions.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
M. Anthony and P. L. Bartlett. Neural Network Learning; Theoretical Foundations. Cambridge University Press, 1999.
P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proc. 36th Annual Symposium on Foundations of Computer Science, pages 322–331. IEEE Computer Society Press, 1995.
P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The non-stochastic multi-armed bandit problem. preprint, 2001.
D. A. Berry and B. Fristedt. Bandit Problems. Chapman and Hall, 1985.
D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Progamming. Athena Scientific, 1995.
V. S. Borkar and S.P Meyn. The O. D. E. method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control, 38(2):447–469, 2000.
R. Brafman and M. Tennenholtz. R-MAX-A General Polynomial Time Algorithm for Near Optimal Reinforcement Learning. In International Joint Conference on Artificial Intelligence, 2001.
H. Chernoff. Sequential Analysis and Optimal Design. Society for industrial and Applied Mathematics, Philadelphia, 1972.
P. Dayan and C. Watkins. Q-learning. Machine Learning, 8:279–292, 1992.
Eyal Even-Dar and Yishay Mansour. Learning rates for q-learning. In Fourteenth Annual Conference on Computation Learning Theory, pages 589–604, 2001.
C. N. Fiechter. PAC adaptive control of linear systems. In Tenth Annual conference on Computational Learing Theory, pages 72–80, 1997.
J. Gittins and D. Jones. A dynamic allocation index for the sequential design of experiments. In J. Gani, K. Sarkadi, and I. Vincze, editors, Progress in Statistics, pages 241–266. North-Holland, Amsterdam, 1974.
M. Kearns, Y. Mansour, and A. Ng. Approximate planning in large POMDPs via reusable trajectories. In Advances in Neural Information Processing Systems, 1999.
M. Kearns and S. Singh. Near-optimal reinforcement learning in polynomial time. In Proc. of the 15th Int. Conf. on Machine Learning, pages 260–268. Morgan Kaufmann, 1998.
M. Kearns and S. Singh. Finite-sample convergence rates for Q-learning and indirect algorithms near-optimal reinforcement learning in polynomial time. In Neural Information Processing Systems 11, pages 996–1002. Morgan Kaufmann, 1999.
Michael J. Kearns, Yishay Mansour, and Andrew Y. Ng. A sparse sampling algorithm for near-optimal planning in large Markov Decision Processes. In International Joint Conference on AI, pages 1324–1231, 1999.
T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:4–22, 1985.
H. Robbins. Some aspects of sequential design of experiments. Bull. Amer. Math. Soc., 55:527–535, 1952.
J. N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine Learning, 16:185–202, 1994.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Even-Dar, E., Mannor, S., Mansour, Y. (2002). PAC Bounds for Multi-armed Bandit and Markov Decision Processes. In: Kivinen, J., Sloan, R.H. (eds) Computational Learning Theory. COLT 2002. Lecture Notes in Computer Science(), vol 2375. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45435-7_18
Download citation
DOI: https://doi.org/10.1007/3-540-45435-7_18
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43836-6
Online ISBN: 978-3-540-45435-9
eBook Packages: Springer Book Archive