Abstract
This paper presents a framework allowing to tune continual exploration in an optimal way. It first quantifies the rate of exploration by defining the degree of exploration of a state as the probability-distribution entropy for choosing an admissible action. Then, the exploration/exploitation tradeoff is stated as a global optimization problem: find the exploration strategy that minimizes the expected cumulated cost, while maintaining fixed degrees of exploration at same nodes. In other words, “exploitation” is maximized for constant “exploration”. This formulation leads to a set of nonlinear updating rules reminiscent of the value-iteration algorithm. Convergence of these rules to a local minimum can be proved for a stationary environment. Interestingly, in the deterministic case, when there is no exploration, these equations reduce to the Bellman equations for finding the shortest path while, when it is maximum, a full “blind” exploration is performed.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Achbany, Y., Fouss, F., Yen, L., Pirotte, A., Saerens, M.: Tuning continual exploration in reinforcement learning. Technical report (2005), http://www.isys.ucl.ac.be/staff/francois/Articles/Achbany2005a.pdf
Bazaraa, M.S., Sherali, H.D., Shetty, C.M.: Nonlinear programming: Theory and algorithms. John Wiley and Sons, Chichester (1993)
Bertsekas, D.P.: Neuro-dynamic programming. Athena Scientific, Belmont (1996)
Bertsekas, D.P.: Network optimization: continuous and discrete models. Athena Scientific, Belmont (1998)
Bertsekas, D.P.: Dynamic programming and optimal control. Athena sientific, Belmont (2000)
Boyan, J.A., Littman, M.L.: Packet routing in dynamically changing networks: A reinforcement learning approach. In: Advances in Neural Information Processing Systems 6 (NIPS6), pp. 671–678 (1994)
Brown, R.G.: Smoothing, forecasting and prediction of discrete time series. Prentice-Hall, Englewood Cliffs (1962)
Christofides, N.: Graph theory: An algorithmic approach. Academic Press, London (1975)
Cover, T.M., Thomas, J.A.: Elements of information theory. John Wiley and Sons, Chichester (1991)
Kapur, J.N., Kesavan, H.K.: Entropy optimization principles with applications. Academic Press, London (1992)
Kemeny, J.G., Snell, J.L.: Finite markov chains. Springer, Heidelberg (1976)
Osborne, M.J.: An introduction to game theory. Oxford University Press, Oxford (2004)
Raiffa, H.: Decision analysis. Addison-Wesley, Reading (1970)
Rummery, G., Niranjan, M.: On-line q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Cambridge University Engineering Departement (1994)
Shani, G., Brafman, R., Shimony, S.: Adaptation for changing stochastic environments through online pomdp policy learning. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 353–364. Springer, Heidelberg (2005)
Singh, S., Sutton, R.: Reinforcement learning with replacing eligibility traces. Machine Learning 22, 123–158 (1996)
Spall, J.C.: Introduction to stochastic search and optimization. Wiley, Chichester (2003)
Sutton, R.S., Barto, A.G.: Reinforcement learning: an introduction. The MIT Press, Cambridge (1998)
Thrun, S.: Efficient exploration in reinforcement learning. Technical report, School of Computer Science, Carnegie Mellon University (1992)
Thrun, S.: The role of exploration in learning control. In: White, D., Sofge, D. (eds.) Handbook for Intelligent Control: Neural, Fuzzy and Adaptive Approaches, Van Nostrand Reinhold, Florence, Kentucky 41022 (1992)
Thrun, S., Burgard, W., Fox, D.: Probabilistic Robotics. MIT Press, Cambridge (2005)
Verbeeck, K.: Coordinated exploration in multi-agent reinforcement learning. PhD thesis, Vrije Universiteit Brussel, Belgium (2004)
Watkins, J.C.: Learning from delayed rewards. PhD thesis, King’s College of Cambridge, UK (1989)
Watkins, J.C., Dayan, P.: Q-learning. Machine Learning 8(3-4), 279–292 (1992)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Achbany, Y., Fouss, F., Yen, L., Pirotte, A., Saerens, M. (2006). Optimal Tuning of Continual Online Exploration in Reinforcement Learning. In: Kollias, S.D., Stafylopatis, A., Duch, W., Oja, E. (eds) Artificial Neural Networks – ICANN 2006. ICANN 2006. Lecture Notes in Computer Science, vol 4131. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11840817_82
Download citation
DOI: https://doi.org/10.1007/11840817_82
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-38625-4
Online ISBN: 978-3-540-38627-8
eBook Packages: Computer ScienceComputer Science (R0)