Abstract
We consider the framework of stochastic multi-armed bandit problems and study the possibilities and limitations of strategies that perform an online exploration of the arms. The strategies are assessed in terms of their simple regret, a regret notion that captures the fact that exploration is only constrained by the number of available rounds (not necessarily known in advance), in contrast to the case when the cumulative regret is considered and when exploitation needs to be performed at the same time. We believe that this performance criterion is suited to situations when the cost of pulling an arm is expressed in terms of resources rather than rewards. We discuss the links between the simple and the cumulative regret. The main result is that the required exploration–exploitation trade-offs are qualitatively different, in view of a general lower bound on the simple regret in terms of the cumulative regret.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Machine Learning Journal 47, 235–256 (2002)
Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.: The non-stochastic multi-armed bandit problem. SIAM Journal on Computing 32(1), 48–77 (2002)
Bubeck, S., Munos, R., Stoltz, G.: Pure exploration for multi-armed bandit problems. Technical report, HAL report hal-00257454 (2009), http://hal.archives-ouvertes.fr/hal-00257454/en
Bubeck, S., Munos, R., Stoltz, G., Szepesvari, C.: Online optimization in \(\mathcal{X}\)–armed bandits. In: Advances in Neural Information Processing Systems, vol. 21 (2009)
Coquelin, P.-A., Munos, R.: Bandit algorithms for tree search. In: Proceedings of the 23rd Conference on Uncertainty in Artificial Intelligence (2007)
Even-Dar, E., Mannor, S., Mansour, Y.: PAC bounds for multi-armed bandit and Markov decision processes. In: Kivinen, J., Sloan, R.H. (eds.) COLT 2002. LNCS (LNAI), vol. 2375, pp. 255–270. Springer, Heidelberg (2002)
Gelly, S., Wang, Y., Munos, R., Teytaud, O.: Modification of UCT with patterns in Monte-Carlo go. Technical Report RR-6062, INRIA (2006)
Kleinberg, R.: Nearly tight bounds for the continuum-armed bandit problem. In: 18th Advances in Neural Information Processing Systems (2004)
Kocsis, L., Szepesvari, C.: Bandit based Monte-carlo planning. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) ECML 2006. LNCS (LNAI), vol. 4212, pp. 282–293. Springer, Heidelberg (2006)
Lai, T.L., Robbins, H.: Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6, 4–22 (1985)
Madani, O., Lizotte, D., Greiner, R.: The budgeted multi-armed bandit problem. In: Proceedings of the 17th Annual Conference on Computational Learning Theory, pp. 643–645 (2004); Open problems session
Mannor, S., Tsitsiklis, J.N.: The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research 5, 623–648 (2004)
Robbins, H.: Some aspects of the sequential design of experiments. Bulletin of the American Mathematics Society 58, 527–535 (1952)
Schlag, K.: Eleven tests needed for a recommendation. Technical Report ECO2006/2, European University Institute (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bubeck, S., Munos, R., Stoltz, G. (2009). Pure Exploration in Multi-armed Bandits Problems. In: Gavaldà, R., Lugosi, G., Zeugmann, T., Zilles, S. (eds) Algorithmic Learning Theory. ALT 2009. Lecture Notes in Computer Science(), vol 5809. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04414-4_7
Download citation
DOI: https://doi.org/10.1007/978-3-642-04414-4_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04413-7
Online ISBN: 978-3-642-04414-4
eBook Packages: Computer ScienceComputer Science (R0)