Abstract
The K-armed bandit problem is a formalization of the exploration versus exploitation dilemma, a well-known issue in stochastic optimization tasks. In a K-armed bandit problem, a player is confronted with a gambling machine with K arms where each arm is associated to an unknown gain distribution and the goal is to maximize the sum of the rewards (or minimize the sum of losses). Several approaches have been proposed in literature to deal with the K-armed bandit problem. Most of them combine a greedy exploitation strategy with a random exploratory phase. This paper focuses on the improvement of the exploration step by having recourse to the notion of probability of correct selection (PCS), a well-known notion in the simulation literature yet overlooked in the optimization domain. The rationale of our approach is to perform at each exploration step the arm sampling which maximizes the probability of selecting the optimal arm (i.e. the PCS) at the following step. This strategy is implemented by a bandit algorithm, called ε-PCSgreedy, which integrates the PCS exploration approach with the classical ε-greedy schema. A set of numerical experiments on artificial and real datasets shows that a more effective exploration may improve the performance of the entire bandit strategy.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Machine Learning 47(2/3), 235–256 (2002)
Azoulay-Schwartz, R., Kraus, S., Wilkenfeld, J.: Exploitation vs. exploration: choosing a supplier in an environment of incomplete information. Decision support systems 38(1), 1–18 (2004)
Bertsekas, D.P.: Dynamic Programming - Deterministic and Stochastic Models. Prentice-Hall, Englewood Cliffs (1987)
Genz, A.: Numerical computation of multivariate normal probabilities. Journal of Computational and Graphical Statistics (1), 141–149 (1992)
Gittins, J.C.: Multi-armed Bandit Allocation Indices. Wiley, Chichester (1989)
Hardwick, J., Stout, Q.: Bandit strategies for ethical sequential allocation. Computing Science and Statistics 23, 421–424 (1991)
Kaelbling, L.P., Littman, M.L., Moore, A.P.: Reinforcement learning: A survey. Journal of Artificial Intelligence Research 4, 237–285 (1996)
Kim, S., Nelson, B.: Selecting the Best System. In: Handbooks in Operations Research and Management Science. Elsevier Science, Amsterdam (2006)
Kim, S.-H., Nelson, B.L.: Selecting the best system: theory and methods. In: WSC 2003: Proceedings of the 35th conference on Winter simulation, pp. 101–112 (2003)
Schneider, J., Moore, A.: Active learning in discrete input spaces. In: Proceedings of the 34th Interface Symposium (2002)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)
Tong, Y.L.: The Multivariate Normal Distribution. Springer, Heidelberg (1990)
Vermorel, J., Mohri, M.: Multi-armed bandit algorithms and empirical evaluation. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS, vol. 3720, pp. 437–448. Springer, Heidelberg (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Caelen, O., Bontempi, G. (2008). Improving the Exploration Strategy in Bandit Algorithms. In: Maniezzo, V., Battiti, R., Watson, JP. (eds) Learning and Intelligent Optimization. LION 2007. Lecture Notes in Computer Science, vol 5313. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-92695-5_5
Download citation
DOI: https://doi.org/10.1007/978-3-540-92695-5_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-92694-8
Online ISBN: 978-3-540-92695-5
eBook Packages: Computer ScienceComputer Science (R0)