Basis Function Adaptation in Temporal Difference Reinforcement Learning

Menache, Ishai; Mannor, Shie; Shimkin, Nahum

doi:10.1007/s10479-005-5732-z

Basis Function Adaptation in Temporal Difference Reinforcement Learning

Published: February 2005

Volume 134, pages 215–238, (2005)
Cite this article

Annals of Operations Research Aims and scope Submit manuscript

Ishai Menache¹,
Shie Mannor² &
Nahum Shimkin³

1050 Accesses
Explore all metrics

Abstract

Reinforcement Learning (RL) is an approach for solving complex multi-stage decision problems that fall under the general framework of Markov Decision Problems (MDPs), with possibly unknown parameters. Function approximation is essential for problems with a large state space, as it facilitates compact representation and enables generalization. Linear approximation architectures (where the adjustable parameters are the weights of pre-fixed basis functions) have recently gained prominence due to efficient algorithms and convergence guarantees. Nonetheless, an appropriate choice of basis function is important for the success of the algorithm. In the present paper we examine methods for adapting the basis function during the learning process in the context of evaluating the value function under a fixed control policy. Using the Bellman approximation error as an optimization criterion, we optimize the weights of the basis function while simultaneously adapting the (non-linear) basis function parameters. We present two algorithms for this problem. The first uses a gradient-based approach and the second applies the Cross Entropy method. The performance of the proposed algorithms is evaluated and compared in simulations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Alon, G., D.P. Kroese, T. Raviv, and R.Y. Rubinstein. (2005).“Application of the Cross-Entropy Method to the Buffer Allocation Problem in a Simulation-Based Environment. ”Annals of Operation Research 134, 137–151, a preliminary version appeared in the third Aegean International Conference on Design and Analysis of Manufacturing Systems.
Auer, P., M. Herbster, and M. Warmuth. (1996).“Exponentially Many Local Minima for Single Neurons.” In D. Touretzky, M. Mozer, and M. Hasselmo (eds.), Advances in Neural Information Processing Systems, Vol. 8, MIT Press, pp. 316–322.
Barto, A., R. Sutton, and C. Anderson. (1983).“Neuron-Like Adaptive Elements that can Solve Difficult Learning Control Problems.” IEEE Transactions on Systems, Man, and Cybernetics 13, 834–846.
Google Scholar
Bertsekas, D. (1995).Dynamic Programming and Optimal Control. Athena Scientific.
Bertsekas, D. (1999).Nonlinear Programming, 2nd edition. Athena Scientific.
Bertsekas, D. and J. Tsitsiklis (1996). Neuro-Dynamic Programming. Athena Scientific.
Boyan, J.A. (2002).“Technical Update: Least-Squares Temporal Difference Learning.”Machine Learning 49, 233–246.
Article Google Scholar
Bradtke, S. (1993).“Reinforcement Learning Applied to Linear Quadratic Regulation.”In S. Hanson and J. Cowan (eds.), Advances in Neural Information Processing Systems, Vol. 5, Morgan Kaufmann, pp. 295–302.
Bradtke, S. and A. Barto. (1996).“Linear Least-Squares Algorithms for Temporal Difference Learning.”Machine Learning 22(1/2/3), 33–57.
Article Google Scholar
de-Boer, P., D.Y. Kroese, S. Mannor, and R.Y. Rubinstein. (2005).“A Tutorial on the Cross-Entropy Method.”Available from http://www.cemethod.org. Annals of Operation Research 134, 19–67.
Article Google Scholar
Dubin, U. (2002). “Application of the Cross-Entropy Method to Neural Computation.” Unpublished Master’s thesis, Technion.
Ghosh, J. and A. Nag. (2000). “An Overview on Radial Basis Function Networks.” In R.J. Howlett and L.C. Jain (eds.), Radial Basis Function Neural Networks Theory and Applications., Physica-Verlag.
Haykin, S.S. (1998). Neural Networks : A Comprehensive Foundation. Prentice Hall.
Helvik, B.E. and O. Wittner. (2001). “Using the Cross-Entropy Method to Guide/Govern Mobile Agent’s Path Finding in Networks.” In Proceedings of the 3rd International Workshop on Mobile Agents for Telecommunication Applications – MATA 01. Morgan Kaufmann.
Kaelbling, L.P., M. Littman, and A.W. Moore. (1996). “Reinforcement Learning – A Survey.” Journal of Artificial Intelligence Research 4, 237–285.
Google Scholar
Lagoudakis, M.G. and R. Parr (2001). “Model-Free Least-Squares Policy Iteration.” In Advances in Neural Information Processing Systems, Vol. 14, Morgan Kaufmann, pp. 1547–1554.
Mannor, S., R.Y. Rubinstein, and Y. Gat (2003). “The Cross-Entropy Method for Fast Policy Search.” In T. Fawcett and N. Mishra (eds.), Machine Learning, Proceedings of the Twentieth International Conference, AAAI press, pp. 512–519.
McGovern, A. and A.G. Barto. (2001). “Automatic Discovery of Subgoals in Reinforcement Learning Using Diverse Density.” In Proceedings of the 18th International Conference on Machine Learning, Morgan Kaufmann, pp. 361–368.
McLachlan, G. and T. Krishnan. (1997). The EM Algorithm and Extensions. John Wiley & Sons.
Menache, I., S. Mannor, and N. Shimkin. (2002). “Q-cut-Dynamic Discovery of Sub-Goals in Reinforcement Learning.” In Proceedings of the 13th European Conference on Machine Learning, Vol 2430, Springer, pp. 295–306.
Munos, R. (2003). “Error Bounds for Approximate Policy Iteration.” In T. Fawcett and N. Mishra (eds.), Machine Learning, Proceedings of the Twentieth International Conference, AAAI press, pp. 560–567.
Nedic, A. and D. Bertsekas. (2001). Least-Squares Policy Evaluation Algorithms with Linear Function Approximation. LIDS Report LIDS-P-2537, to appear in J. of Discrete Event Systems.
Puterman, M. (1994). Markov Decision Processes. Wiley-Interscience.
Ratitch, B. and D. Precup. (2002). “Characterizing Markov Decision Processes.” In Proceedings of the 13th European Conference on Machine Learning, Vol. 2430, Springer, pp. 391–404.
Rubinstein, R.Y. and D.P. Kroese. (2004). The Cross-Entropy Method. A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation and Neural Computation. Springer.
Rubinstein, R.Y. (1999). “The Cross-Entropy Method for Combinatorial and Continuous Optimization.” Methodology and Computing in Applied Probability 1, 127–190.
Article Google Scholar
Singh, S.P., T. Jaakkola, and M.I. Jordan. (1995). “Reinforcement Learning with Soft State Aggregation.” In Advances in Neural Information Processing Systems, Vol. 7, MIT Press, pp. 361–368.
Sutton, R.S. (1988). “Learning to Predict by the Method of Temporal Differences.” Machine Learning 3, 9–44.
Google Scholar
Sutton, R.S. and A.G. Barto. (1998). Reinforcement Learning: An Introduction. MIT Press.
Tsitsiklis, J. and B. Van-Roy. (1996). “Feature-Based Methods for Large Scale Dynamic Programming.” Machine Learning 22, 50–94.
Google Scholar
Tsitsiklis, J. and B. Van Roy. (1997). “An Analysis of Temporal-Difference Learning with Function Approximation.” IEEE Transactions on Automatic Control 42, 674–690.
Article Google Scholar
Werbos, P. (1990). “Consistency of HDP Applied to Simple Reinforcement Learning Problem.” Neural Networks 3, 170–189.
Article Google Scholar
Witten, I.H. (1977). “An Adaptive Optimal Controller for Discrete-Time Markov Environments.” Information and Control 34, 286–295.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical Engineering, Technion, Israel Institute of Technology, Haifa, 32000, Israel
Ishai Menache
Department of Electrical and Computer Engineering, McGill University, 3480 University Street, Montreal, Quebec, Canada
Shie Mannor
Department of Electrical Engineering, Technion, Israel Institute of Technology, Haifa, 32000, Israel
Nahum Shimkin

Authors

Ishai Menache
View author publications
You can also search for this author in PubMed Google Scholar
Shie Mannor
View author publications
You can also search for this author in PubMed Google Scholar
Nahum Shimkin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shie Mannor.

Additional information

This research was partially supported by the Fund for Promotion of Research at the Technion. The work of S.M. was partially supported by the National Science Foundation under grant ECS-0312921.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Menache, I., Mannor, S. & Shimkin, N. Basis Function Adaptation in Temporal Difference Reinforcement Learning. Ann Oper Res 134, 215–238 (2005). https://doi.org/10.1007/s10479-005-5732-z

Download citation

Received: 01 July 2003
Revised: 01 October 2003
Accepted: 01 January 2004
Issue Date: February 2005
DOI: https://doi.org/10.1007/s10479-005-5732-z

Key words

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Basis Function Adaptation in Temporal Difference Reinforcement Learning

Abstract

Access this article

Subscribe and save

Buy Now

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

Subscribe and save

Buy Now