Abstract
We consider batch reinforcement learning problems in continuous space, expected total discounted-reward Markovian Decision Problems. As opposed to previous theoretical work, we consider the case when the training data consists of a single sample path (trajectory) of some behaviour policy. In particular, we do not assume access to a generative model of the environment. The algorithm studied is policy iteration where in successive iterations the Q-functions of the intermediate policies are obtained by means of minimizing a novel Bellman-residual type error. PAC-style polynomial bounds are derived on the number of samples needed to guarantee near-optimal performance where the bound depends on the mixing rate of the trajectory, the smoothness properties of the underlying Markovian Decision Problem, the approximation power and capacity of the function set used.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Lagoudakis, M., Parr, R.: Least-squares policy iteration. Journal of Machine Learning Research 4, 1107–1149 (2003)
Antos, A., Szepesvári, C., Munos, R.: Learning near-optimal policies with fitted policy iteration and a single sample path: approximate iterative policy evaluation. In: ICML 2006 (submitted, 2006)
Bertsekas, D.P., Shreve, S.E.: Stochastic Optimal Control (The Discrete Time Case). Academic Press, New York (1978)
Sutton, R.S., Barto, A.G.: Toward a modern theory of adaptive networks: Expectation and prediction. In: Proc. of the Ninth Annual Conference of Cognitive Science Society, Erlbaum, Hillsdale (1987)
Munos, R.: Error bounds for approximate policy iteration. In: 19th International Conference on Machine Learning, pp. 560–567 (2003)
Meyn, S.P., Tweedie, R.: Markov Chains and Stochastic Stability. Springer, New York (1993)
Anthony, M., Bartlett, P.L.: Neural Network Learning: Theoretical Foundations. Cambridge University Press, Cambridge (1999)
Yu, B.: Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability 22(1), 94–116 (1994)
Nobel, A.: Histogram regression estimation using data-dependent partitions. Annals of Statistics 24(3), 1084–1105 (1996)
Haussler, D.: Sphere packing numbers for subsets of the boolean n-cube with bounded Vapnik-Chervonenkis dimension. Journal of Combinatorial Theory Series A 69, 217–232 (1995)
Samuel, A.L.: Some studies in machine learning using the game of checkers. IBM Journal on Research and Development, 210–229 (1963); Reprinted in Computers and Thought, Feigenbaum, E.A., Feldman, J. (eds.). McGraw-Hill, New York (1963)
Bellman, R.E., Dreyfus, S.E.: Functional approximation and dynamic programming. Math. Tables and other Aids Comp. 13, 247–251 (1959)
Bertsekas, D.P., Tsitsiklis, J.: Neuro-Dynamic Programming. Athena Scientific (1996)
Sutton, R.S., Barto, A.G.: Reinforcement learning: An introduction. Bradford Book (1998)
Gordon, G.J.: Stable function approximation in dynamic programming. In: Prieditis, A., Russell, S. (eds.) Proceedings of the Twelfth International Conference on Machine Learning, pp. 261–268. Morgan Kaufmann, San Francisco (1995)
Tsitsiklis, J.N., Van Roy, B.: Feature-based methods for large scale dynamic programming. Machine Learning 22, 59–94 (1996)
Guestrin, C., Koller, D., Parr, R.: Max-norm projections for factored mdps. In: Proceedings of the International Joint Conference on Artificial Intelligence (2001)
Ernst, D., Geurts, P., Wehenkel, L.: Tree-based batch mode reinforcement learning. Journal of Machine Learning Research 6, 503–556 (2005)
Wang, X., Dietterich, T.G.: Efficient value function approximation using regression trees. In: Proceedings of the IJCAI Workshop on Statistical Machine Learning for Large-Scale Optimization, Stockholm, Sweden (1999)
Dietterich, T.G., Wang, X.: Batch value function approximation via support vectors. In: Dietterich, T.G., Becker, S., Ghahramani, Z. (eds.) Advances in Neural Information Processing Systems 14. MIT Press, Cambridge (2002)
Szepesvári, C., Munos, R.: Finite time bounds for sampling based fitted value iteration. In: ICML 2005 (2005)
Meir, R.: Nonparametric time series prediction through adaptive model selection. Machine Learning 39(1), 5–34 (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Antos, A., Szepesvári, C., Munos, R. (2006). Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path. In: Lugosi, G., Simon, H.U. (eds) Learning Theory. COLT 2006. Lecture Notes in Computer Science(), vol 4005. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11776420_42
Download citation
DOI: https://doi.org/10.1007/11776420_42
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-35294-5
Online ISBN: 978-3-540-35296-9
eBook Packages: Computer ScienceComputer Science (R0)