Abstract
We focus on the effect of the exploration/exploitation trade-off strategies on the algorithmic design off multi-armed bandits (MAB) with reward vectors. Pareto dominance relation assesses the quality of reward vectors in infinite horizon MABs, like the UCB1 and UCB2 algorithms. In single objective MABs, there is a trade-off between the exploration of the suboptimal arms, and exploitation of a single optimal arm. Pareto dominance based MABs fairly exploit all Pareto optimal arms, and explore suboptimal arms. We study the exploration vs exploitation trade-off for two UCB like algorithms for reward vectors. We analyse the properties of the proposed MAB algorithms in terms of upper regret bounds and we experimentally compare their exploration vs exploitation trade-off on a bi-objective Bernoulli environment coming from control theory.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Lizotte, D., Bowling, M., Murphy, S.: Efficient reinforcement learning with multiple reward functions for randomized clinical trial analysis. In: Proceedings of the Twenty-Seventh International Conference on Machine Learning (ICML) (2010)
Wiering, M., de Jong, E.: Computing optimal stationary policies for multi-objective markov decision processes. In: Proceedings of Approximate Dynamic Programming and Reinforcement Learning (ADPRL), pp. 158–165. IEEE (2007)
van Moffaert, K., Drugan, M.M., Nowé, A.: Hypervolume-based multi-objective reinforcement learning. In: Purshouse, R.C., Fleming, P.J., Fonseca, C.M., Greco, S., Shaw, J. (eds.) EMO 2013. LNCS, vol. 7811, pp. 352–366. Springer, Heidelberg (2013)
Wang, W., Sebag, M.: Multi-objective Monte Carlo tree search. In: Asian Conference on Machine Learning, pp. 1–16 (2012)
Zitzler, E., Thiele, L., Laumanns, M., Fonseca, C.M., da Fonseca, V.: Performance assessment of multiobjective optimizers: an analysis and review. IEEE Trans. Evol. Comput. 7, 117–132 (2003)
Drugan, M., Nowe, A.: Designing multi-objective multi-armed bandits: a study. In: Proceedings of International Joint Conference of Neural Networks (IJCNN) (2013)
Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite time analysis of the multiarmed bandit problem. Mach. Learn. 47, 235–256 (2002)
Maron, O., Moore, A.: Hoeffding races: accelerating model selection search for classification and function approximation. In: Advances in Neural Information Processing Systems, vol. 6, pp. 59–66. Morgan Kaufmann (1994)
Vaerenbergh, K.V., Rodriguez, A., Gagliolo, M., Vrancx, P., Nowe, A., Stoev, J., Goossens, S., Pinte, G., Symens, W.: Improving wet clutch engagement with reinforcement learning. In: International Joint Conference on Neural Networks (IJCNN). IEEE (2012)
Acknowledgements
Madalina M. Drugan was supported by the IWT-SBO project PERPETUAL (gr. nr. 110041).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Drugan, M.M. (2015). Infinite Horizon Multi-armed Bandits with Reward Vectors: Exploration/Exploitation Trade-off. In: Duval, B., van den Herik, J., Loiseau, S., Filipe, J. (eds) Agents and Artificial Intelligence. ICAART 2015. Lecture Notes in Computer Science(), vol 9494. Springer, Cham. https://doi.org/10.1007/978-3-319-27947-3_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-27947-3_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27946-6
Online ISBN: 978-3-319-27947-3
eBook Packages: Computer ScienceComputer Science (R0)