Abstract
The fundamental problem in learning and planning of Markov Decision Processes is how the agent explores and exploits an uncertain environment. The classical solutions to the problem are basically heuristics that lack appropriate theoretical justifications. As a result, principled solutions based on Bayesian estimation, though intractable even in small cases, have been recently investigated. The common approach is to approximate Bayesian estimation with sophisticated methods that cope the intractability of computing the Bayesian posterior. However, we notice that the complexity of these approximations still prevents their use as the long-term reward gain improvement seems to be diminished by the difficulties of implementation. In this work, we propose a deliberately simplistic model-based algorithm to show the benefits of Bayesian estimation when compared to classical model-free solutions. In particular, our agent combines several Markov Chains from its belief state and uses the matrix-based Elimination Algorithm to find the best action to take. We test our agent over the three standard problems Chain, Loop, and Maze, and find that it outperforms the classical Q-Learning with e-Greedy, Boltzmann, and Interval Estimation action selection heuristics.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)
Kaelbling, L.P., Littman, M.L., Moore, A.P.: Reinforcement learning: A survey. Journal of Artificial Intelligence Research 4, 237–285 (1996)
Poupart, P., Vlassis, N., Hoey, J., Regan, K.: An analytic solution to discrete bayesian reinforcement learning. In: 23rd International Conference on Machine Learning (2006)
Wang, T., Lizotte, D., Bowling, M., Schuurmans, D.: Bayesian sparse sampling for on-line reward optimization. In: International Conference on Machine Learning, Bonn, Germany (2005)
Jordan, M.I. (ed.): Learning in graphical models. MIT Press, Cambridge (1999)
Neal, R.M.: Bayesian Learning for Neural Networks. Springer, New York (1996)
Watkins, C.: Learning from Delayed Rewards. PhD thesis, University of Cambridge (1989)
Russell, S.J., Norvig, P.: Artificial intelligence: a modern approach, 2nd edn. Prentice Hall/Pearson Education, Upper Saddle River, N.J (2003)
Wiering, M.: Explorations in efficient reinforcement learning. PhD thesis, University of Amsterdam (1999)
Kaelbling, L.P.: Associative reinforcement learning: Functions in k-dnf. Machine Learning 15(3), 279–298 (1994)
Duff, M.O.: Optimal learning: Computational procedures for Bayes-adaptive Markov decision processes. PhD thesis, University of Massachusetts Amherst (2002)
Lusena, C., Goldsmith, J., Mundhenk, M.: Nonapproximability results for partially observable markov decision processes. JAIR 14, 83–103 (2001)
Mundhenk, M., Goldsmith, J., Lusena, C., Allender, E.: Complexity of finite-horizon markov decision process problems. Journal of the ACM (JACM) 47(4), 681–720 (2000)
Castro, P.S., Precup, D.: Using linear programming for bayesian exploration in markov decision processes. IJCAI , 2437–2442 (2007)
Brafman, R.I., Tennenholtz, M.: R-max - a general polynomial time algorithm for near-optimal reinforcement learning. J. Mach. Learn. Res. 3, 213–231 (2003)
Kearns, M., Mansour, Y., Ng, A.Y.: A sparse sampling algorithm for near-optimal planning in large markov decision processes. Machine Learning 49(2-3), 193–208 (2002)
Kearns, M., Singh, S.: Near-optimal reinforcement learning in polynomial time. Machine Learning 49(2-3), 209–232 (2002)
Strens, M.J.A.: A bayesian framework for reinforcement learning. In: Proceedings of the Seventeenth International Conference on Machine Learning, pp. 943–950. Morgan Kaufmann, San Francisco (2000)
Dearden, R., Friedman, N., Andre, D.: Model based bayesian exploration. In: Proceedings of the 15th Annual Conference on Uncertainty in Artificial Intelligence, pp. 150–159 (1999)
Dearden, R., Friedman, N., Russell, S.J.: Bayesian q-learning. In: AAAI/IAAI, pp. 761–768 (1998)
Sonin, I.M.: A generalized gittins index for markov chain and its recursive calculation. submitted to Statistics and Probability Letters (2007)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Novoa, E. (2007). Simple Model-Based Exploration and Exploitation of Markov Decision Processes Using the Elimination Algorithm. In: Gelbukh, A., Kuri Morales, Á.F. (eds) MICAI 2007: Advances in Artificial Intelligence. MICAI 2007. Lecture Notes in Computer Science(), vol 4827. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-76631-5_31
Download citation
DOI: https://doi.org/10.1007/978-3-540-76631-5_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-76630-8
Online ISBN: 978-3-540-76631-5
eBook Packages: Computer ScienceComputer Science (R0)