Improving the Exploration Strategy in Bandit Algorithms

Caelen, Olivier; Bontempi, Gianluca

doi:10.1007/978-3-540-92695-5_5

Olivier Caelen⁴ &
Gianluca Bontempi⁴

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5313))

Included in the following conference series:

International Conference on Learning and Intelligent Optimization

971 Accesses
9 Citations

Abstract

The K-armed bandit problem is a formalization of the exploration versus exploitation dilemma, a well-known issue in stochastic optimization tasks. In a K-armed bandit problem, a player is confronted with a gambling machine with K arms where each arm is associated to an unknown gain distribution and the goal is to maximize the sum of the rewards (or minimize the sum of losses). Several approaches have been proposed in literature to deal with the K-armed bandit problem. Most of them combine a greedy exploitation strategy with a random exploratory phase. This paper focuses on the improvement of the exploration step by having recourse to the notion of probability of correct selection (PCS), a well-known notion in the simulation literature yet overlooked in the optimization domain. The rationale of our approach is to perform at each exploration step the arm sampling which maximizes the probability of selecting the optimal arm (i.e. the PCS) at the following step. This strategy is implemented by a bandit algorithm, called ε-PCSgreedy, which integrates the PCS exploration approach with the classical ε-greedy schema. A set of numerical experiments on artificial and real datasets shows that a more effective exploration may improve the performance of the entire bandit strategy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

BelMan: An Information-Geometric Approach to Stochastic Bandits

Thompson Sampling Guided Stochastic Searching on the Line for Adversarial Learning

CEMAB: A Cross-Entropy-based Method for Large-Scale Multi-Armed Bandits

References

Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Machine Learning 47(2/3), 235–256 (2002)
Article MATH Google Scholar
Azoulay-Schwartz, R., Kraus, S., Wilkenfeld, J.: Exploitation vs. exploration: choosing a supplier in an environment of incomplete information. Decision support systems 38(1), 1–18 (2004)
Article Google Scholar
Bertsekas, D.P.: Dynamic Programming - Deterministic and Stochastic Models. Prentice-Hall, Englewood Cliffs (1987)
MATH Google Scholar
Genz, A.: Numerical computation of multivariate normal probabilities. Journal of Computational and Graphical Statistics (1), 141–149 (1992)
Google Scholar
Gittins, J.C.: Multi-armed Bandit Allocation Indices. Wiley, Chichester (1989)
MATH Google Scholar
Hardwick, J., Stout, Q.: Bandit strategies for ethical sequential allocation. Computing Science and Statistics 23, 421–424 (1991)
Google Scholar
Kaelbling, L.P., Littman, M.L., Moore, A.P.: Reinforcement learning: A survey. Journal of Artificial Intelligence Research 4, 237–285 (1996)
Google Scholar
Kim, S., Nelson, B.: Selecting the Best System. In: Handbooks in Operations Research and Management Science. Elsevier Science, Amsterdam (2006)
Google Scholar
Kim, S.-H., Nelson, B.L.: Selecting the best system: theory and methods. In: WSC 2003: Proceedings of the 35th conference on Winter simulation, pp. 101–112 (2003)
Google Scholar
Schneider, J., Moore, A.: Active learning in discrete input spaces. In: Proceedings of the 34th Interface Symposium (2002)
Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)
Google Scholar
Tong, Y.L.: The Multivariate Normal Distribution. Springer, Heidelberg (1990)
Book MATH Google Scholar
Vermorel, J., Mohri, M.: Multi-armed bandit algorithms and empirical evaluation. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS, vol. 3720, pp. 437–448. Springer, Heidelberg (2005)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Machine Learning Group, Département d’Informatique, Faculté des Sciences, Université Libre de Bruxelles, Bruxelles, Belgium
Olivier Caelen & Gianluca Bontempi

Authors

Olivier Caelen
View author publications
You can also search for this author in PubMed Google Scholar
Gianluca Bontempi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dept. Computer Science, University of Bologna, Bologna, Italy
Vittorio Maniezzo
Università degli Studi di Trento, Trento, Italy
Roberto Battiti
Discrete Math and Complex Systems Department, Sandia National Laboratories, Albuquerque, New Mexico, USA
Jean-Paul Watson

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Caelen, O., Bontempi, G. (2008). Improving the Exploration Strategy in Bandit Algorithms. In: Maniezzo, V., Battiti, R., Watson, JP. (eds) Learning and Intelligent Optimization. LION 2007. Lecture Notes in Computer Science, vol 5313. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-92695-5_5

Download citation

DOI: https://doi.org/10.1007/978-3-540-92695-5_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-92694-8
Online ISBN: 978-3-540-92695-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics