Abstract
Reinforcement Learning (RL) is an approach for solving complex multi-stage decision problems that fall under the general framework of Markov Decision Problems (MDPs), with possibly unknown parameters. Function approximation is essential for problems with a large state space, as it facilitates compact representation and enables generalization. Linear approximation architectures (where the adjustable parameters are the weights of pre-fixed basis functions) have recently gained prominence due to efficient algorithms and convergence guarantees. Nonetheless, an appropriate choice of basis function is important for the success of the algorithm. In the present paper we examine methods for adapting the basis function during the learning process in the context of evaluating the value function under a fixed control policy. Using the Bellman approximation error as an optimization criterion, we optimize the weights of the basis function while simultaneously adapting the (non-linear) basis function parameters. We present two algorithms for this problem. The first uses a gradient-based approach and the second applies the Cross Entropy method. The performance of the proposed algorithms is evaluated and compared in simulations.
References
Alon, G., D.P. Kroese, T. Raviv, and R.Y. Rubinstein. (2005).“Application of the Cross-Entropy Method to the Buffer Allocation Problem in a Simulation-Based Environment. ”Annals of Operation Research 134, 137–151, a preliminary version appeared in the third Aegean International Conference on Design and Analysis of Manufacturing Systems.
Auer, P., M. Herbster, and M. Warmuth. (1996).“Exponentially Many Local Minima for Single Neurons.” In D. Touretzky, M. Mozer, and M. Hasselmo (eds.), Advances in Neural Information Processing Systems, Vol. 8, MIT Press, pp. 316–322.
Barto, A., R. Sutton, and C. Anderson. (1983).“Neuron-Like Adaptive Elements that can Solve Difficult Learning Control Problems.” IEEE Transactions on Systems, Man, and Cybernetics 13, 834–846.
Bertsekas, D. (1995).Dynamic Programming and Optimal Control. Athena Scientific.
Bertsekas, D. (1999).Nonlinear Programming, 2nd edition. Athena Scientific.
Bertsekas, D. and J. Tsitsiklis (1996). Neuro-Dynamic Programming. Athena Scientific.
Boyan, J.A. (2002).“Technical Update: Least-Squares Temporal Difference Learning.”Machine Learning 49, 233–246.
Bradtke, S. (1993).“Reinforcement Learning Applied to Linear Quadratic Regulation.”In S. Hanson and J. Cowan (eds.), Advances in Neural Information Processing Systems, Vol. 5, Morgan Kaufmann, pp. 295–302.
Bradtke, S. and A. Barto. (1996).“Linear Least-Squares Algorithms for Temporal Difference Learning.”Machine Learning 22(1/2/3), 33–57.
de-Boer, P., D.Y. Kroese, S. Mannor, and R.Y. Rubinstein. (2005).“A Tutorial on the Cross-Entropy Method.”Available from http://www.cemethod.org. Annals of Operation Research 134, 19–67.
Dubin, U. (2002). “Application of the Cross-Entropy Method to Neural Computation.” Unpublished Master’s thesis, Technion.
Ghosh, J. and A. Nag. (2000). “An Overview on Radial Basis Function Networks.” In R.J. Howlett and L.C. Jain (eds.), Radial Basis Function Neural Networks Theory and Applications., Physica-Verlag.
Haykin, S.S. (1998). Neural Networks : A Comprehensive Foundation. Prentice Hall.
Helvik, B.E. and O. Wittner. (2001). “Using the Cross-Entropy Method to Guide/Govern Mobile Agent’s Path Finding in Networks.” In Proceedings of the 3rd International Workshop on Mobile Agents for Telecommunication Applications – MATA 01. Morgan Kaufmann.
Kaelbling, L.P., M. Littman, and A.W. Moore. (1996). “Reinforcement Learning – A Survey.” Journal of Artificial Intelligence Research 4, 237–285.
Lagoudakis, M.G. and R. Parr (2001). “Model-Free Least-Squares Policy Iteration.” In Advances in Neural Information Processing Systems, Vol. 14, Morgan Kaufmann, pp. 1547–1554.
Mannor, S., R.Y. Rubinstein, and Y. Gat (2003). “The Cross-Entropy Method for Fast Policy Search.” In T. Fawcett and N. Mishra (eds.), Machine Learning, Proceedings of the Twentieth International Conference, AAAI press, pp. 512–519.
McGovern, A. and A.G. Barto. (2001). “Automatic Discovery of Subgoals in Reinforcement Learning Using Diverse Density.” In Proceedings of the 18th International Conference on Machine Learning, Morgan Kaufmann, pp. 361–368.
McLachlan, G. and T. Krishnan. (1997). The EM Algorithm and Extensions. John Wiley & Sons.
Menache, I., S. Mannor, and N. Shimkin. (2002). “Q-cut-Dynamic Discovery of Sub-Goals in Reinforcement Learning.” In Proceedings of the 13th European Conference on Machine Learning, Vol 2430, Springer, pp. 295–306.
Munos, R. (2003). “Error Bounds for Approximate Policy Iteration.” In T. Fawcett and N. Mishra (eds.), Machine Learning, Proceedings of the Twentieth International Conference, AAAI press, pp. 560–567.
Nedic, A. and D. Bertsekas. (2001). Least-Squares Policy Evaluation Algorithms with Linear Function Approximation. LIDS Report LIDS-P-2537, to appear in J. of Discrete Event Systems.
Puterman, M. (1994). Markov Decision Processes. Wiley-Interscience.
Ratitch, B. and D. Precup. (2002). “Characterizing Markov Decision Processes.” In Proceedings of the 13th European Conference on Machine Learning, Vol. 2430, Springer, pp. 391–404.
Rubinstein, R.Y. and D.P. Kroese. (2004). The Cross-Entropy Method. A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation and Neural Computation. Springer.
Rubinstein, R.Y. (1999). “The Cross-Entropy Method for Combinatorial and Continuous Optimization.” Methodology and Computing in Applied Probability 1, 127–190.
Singh, S.P., T. Jaakkola, and M.I. Jordan. (1995). “Reinforcement Learning with Soft State Aggregation.” In Advances in Neural Information Processing Systems, Vol. 7, MIT Press, pp. 361–368.
Sutton, R.S. (1988). “Learning to Predict by the Method of Temporal Differences.” Machine Learning 3, 9–44.
Sutton, R.S. and A.G. Barto. (1998). Reinforcement Learning: An Introduction. MIT Press.
Tsitsiklis, J. and B. Van-Roy. (1996). “Feature-Based Methods for Large Scale Dynamic Programming.” Machine Learning 22, 50–94.
Tsitsiklis, J. and B. Van Roy. (1997). “An Analysis of Temporal-Difference Learning with Function Approximation.” IEEE Transactions on Automatic Control 42, 674–690.
Werbos, P. (1990). “Consistency of HDP Applied to Simple Reinforcement Learning Problem.” Neural Networks 3, 170–189.
Witten, I.H. (1977). “An Adaptive Optimal Controller for Discrete-Time Markov Environments.” Information and Control 34, 286–295.
Author information
Authors and Affiliations
Corresponding author
Additional information
This research was partially supported by the Fund for Promotion of Research at the Technion. The work of S.M. was partially supported by the National Science Foundation under grant ECS-0312921.
Rights and permissions
About this article
Cite this article
Menache, I., Mannor, S. & Shimkin, N. Basis Function Adaptation in Temporal Difference Reinforcement Learning. Ann Oper Res 134, 215–238 (2005). https://doi.org/10.1007/s10479-005-5732-z
Received:
Revised:
Accepted:
Issue Date:
DOI: https://doi.org/10.1007/s10479-005-5732-z