Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path

Antos, András; Szepesvári, Csaba; Munos, Rémi

doi:10.1007/11776420_42

András Antos²⁰,
Csaba Szepesvári²⁰ &
Rémi Munos²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4005))

Included in the following conference series:

International Conference on Computational Learning Theory

2897 Accesses

Abstract

We consider batch reinforcement learning problems in continuous space, expected total discounted-reward Markovian Decision Problems. As opposed to previous theoretical work, we consider the case when the training data consists of a single sample path (trajectory) of some behaviour policy. In particular, we do not assume access to a generative model of the environment. The algorithm studied is policy iteration where in successive iterations the Q-functions of the intermediate policies are obtained by means of minimizing a novel Bellman-residual type error. PAC-style polynomial bounds are derived on the number of samples needed to guarantee near-optimal performance where the bound depends on the mixing rate of the trajectory, the smoothness properties of the underlying Markovian Decision Problem, the approximation power and capacity of the function set used.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

An Incremental Fast Policy Search Using a Single Sample Path

Finite-time error bounds for Greedy-GQ

Article 30 April 2024

A Survey on Constraining Policy Updates Using the KL Divergence

References

Lagoudakis, M., Parr, R.: Least-squares policy iteration. Journal of Machine Learning Research 4, 1107–1149 (2003)
Article MathSciNet Google Scholar
Antos, A., Szepesvári, C., Munos, R.: Learning near-optimal policies with fitted policy iteration and a single sample path: approximate iterative policy evaluation. In: ICML 2006 (submitted, 2006)
Google Scholar
Bertsekas, D.P., Shreve, S.E.: Stochastic Optimal Control (The Discrete Time Case). Academic Press, New York (1978)
MATH Google Scholar
Sutton, R.S., Barto, A.G.: Toward a modern theory of adaptive networks: Expectation and prediction. In: Proc. of the Ninth Annual Conference of Cognitive Science Society, Erlbaum, Hillsdale (1987)
Google Scholar
Munos, R.: Error bounds for approximate policy iteration. In: 19th International Conference on Machine Learning, pp. 560–567 (2003)
Google Scholar
Meyn, S.P., Tweedie, R.: Markov Chains and Stochastic Stability. Springer, New York (1993)
MATH Google Scholar
Anthony, M., Bartlett, P.L.: Neural Network Learning: Theoretical Foundations. Cambridge University Press, Cambridge (1999)
Book MATH Google Scholar
Yu, B.: Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability 22(1), 94–116 (1994)
Article MathSciNet MATH Google Scholar
Nobel, A.: Histogram regression estimation using data-dependent partitions. Annals of Statistics 24(3), 1084–1105 (1996)
Article MathSciNet MATH Google Scholar
Haussler, D.: Sphere packing numbers for subsets of the boolean n-cube with bounded Vapnik-Chervonenkis dimension. Journal of Combinatorial Theory Series A 69, 217–232 (1995)
Article MathSciNet MATH Google Scholar
Samuel, A.L.: Some studies in machine learning using the game of checkers. IBM Journal on Research and Development, 210–229 (1963); Reprinted in Computers and Thought, Feigenbaum, E.A., Feldman, J. (eds.). McGraw-Hill, New York (1963)
Google Scholar
Bellman, R.E., Dreyfus, S.E.: Functional approximation and dynamic programming. Math. Tables and other Aids Comp. 13, 247–251 (1959)
Article MathSciNet MATH Google Scholar
Bertsekas, D.P., Tsitsiklis, J.: Neuro-Dynamic Programming. Athena Scientific (1996)
Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement learning: An introduction. Bradford Book (1998)
Google Scholar
Gordon, G.J.: Stable function approximation in dynamic programming. In: Prieditis, A., Russell, S. (eds.) Proceedings of the Twelfth International Conference on Machine Learning, pp. 261–268. Morgan Kaufmann, San Francisco (1995)
Google Scholar
Tsitsiklis, J.N., Van Roy, B.: Feature-based methods for large scale dynamic programming. Machine Learning 22, 59–94 (1996)
MATH Google Scholar
Guestrin, C., Koller, D., Parr, R.: Max-norm projections for factored mdps. In: Proceedings of the International Joint Conference on Artificial Intelligence (2001)
Google Scholar
Ernst, D., Geurts, P., Wehenkel, L.: Tree-based batch mode reinforcement learning. Journal of Machine Learning Research 6, 503–556 (2005)
MathSciNet Google Scholar
Wang, X., Dietterich, T.G.: Efficient value function approximation using regression trees. In: Proceedings of the IJCAI Workshop on Statistical Machine Learning for Large-Scale Optimization, Stockholm, Sweden (1999)
Google Scholar
Dietterich, T.G., Wang, X.: Batch value function approximation via support vectors. In: Dietterich, T.G., Becker, S., Ghahramani, Z. (eds.) Advances in Neural Information Processing Systems 14. MIT Press, Cambridge (2002)
Google Scholar
Szepesvári, C., Munos, R.: Finite time bounds for sampling based fitted value iteration. In: ICML 2005 (2005)
Google Scholar
Meir, R.: Nonparametric time series prediction through adaptive model selection. Machine Learning 39(1), 5–34 (2000)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Computer and Automation Research Inst. of the Hungarian Academy of Sciences, Kende u. 13-17, Budapest, 1111, Hungary
András Antos & Csaba Szepesvári
Centre de Mathématiques Appliquées, Ecole Polytechnique, 91128 Cedex, Palaiseau, France
Rémi Munos

Authors

András Antos
View author publications
You can also search for this author in PubMed Google Scholar
Csaba Szepesvári
View author publications
You can also search for this author in PubMed Google Scholar
Rémi Munos
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

ICREA and Department of Economics, Universitat Pompeu Fabra, Ramon Trias Fargas 25-27, 08005, Barcelona, Spain
Gábor Lugosi
Ruhr-Universität Bochum, Germany
Hans Ulrich Simon

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Antos, A., Szepesvári, C., Munos, R. (2006). Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path. In: Lugosi, G., Simon, H.U. (eds) Learning Theory. COLT 2006. Lecture Notes in Computer Science(), vol 4005. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11776420_42

Download citation

DOI: https://doi.org/10.1007/11776420_42
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-35294-5
Online ISBN: 978-3-540-35296-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path

Abstract

Access this chapter

Preview

Similar content being viewed by others

An Incremental Fast Policy Search Using a Single Sample Path

Finite-time error bounds for Greedy-GQ

A Survey on Constraining Policy Updates Using the KL Divergence

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path

Abstract

Access this chapter

Preview

Similar content being viewed by others

An Incremental Fast Policy Search Using a Single Sample Path

Finite-time error bounds for Greedy-GQ

A Survey on Constraining Policy Updates Using the KL Divergence

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation