Abstract
This work considers the problems of learning and planning in Markovian environments with constant observation and reward delays. We provide a hardness result for the general planning problem and positive results for several special cases with deterministic or otherwise constrained dynamics. We present an algorithm, Model Based Simulation, for planning in such environments and use model-based reinforcement learning to extend this approach to the learning setting in both finite and continuous environments. Empirical comparisons show this algorithm holds significant advantages over others for decision making in delayed-observation environments.
Similar content being viewed by others
References
Altman, E., & Nain, P. Closed-loop control with delayed information. In Proceedings of the ACM SIGMETRICS and Performance 1–5, pp. 193–204.
Atkeson C.G., Moore A.W., Schaal S. (1997) Locally weighted learning for control. Artificial Intelligence Review 11(1–5): 75–113
Bander J.L., White C.C. III (1999) Markov decision processes with noise-corrupted and delayed state observations. Journal of the Operational Research Society 50: 660–668
Bertsekas, D. P. (2001). Dynamic programming and optimal control (2nd ed., Vol. 1/2). Athena Scientific.
Boyan, J. A., & Moore, A. W. (1995). Generalization in reinforcement learning: Safely approximating the value function. In Advances in neural information processing systems: Proceedings of the 1994 conference (pp. 369–376). Cambridge, MA: MIT Press.
Brafman R.I., Tennenholtz M. (2002) R-max—A general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research 3: 213–231
Brooks D.M., Leondes C.T. (1972) Markov decision processes with state-information lag. Operations Research 20(4): 904–907
Fox, R., & Tennenholtz, M. (2007). A reinforcement learning algorithm with polynomial interaction complexity for only-costly-observable MDPs. In Proceedings of the 22nd Conference on Artificial Intelligence, pp. 553–558.
Hoeffding W. (1963) Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association 58(301): 13–30
Jong, N. K., & Stone, P. (2006). Kernel-based models for reinforcement learning. In Proceedings of the 2006 ICML Kernel Machines and Reinforcement Learning Workshop.
Kaelbling L.P., Littman M.L., Cassandra A.R. (1998) Planning and acting in partially observable stochastic domains. Artificial Intelligence 101(1–2): 99–134
Kakade, S. (2003). On the Sample Complexity of Reinforcement Learning. PhD thesis, University College London, UK.
Katsikopoulos K.V., Engelbrecht S.E. (2003) Markov decision processes with delays and asynchronous cost collection. IEEE Transactions on Automatic Control 48: 568–574
Lin, L.-J. (1993). Reinforcement Learning for Robots using Neural Networks. PhD thesis, Carnegie Mellon University, Pittsburgh, PA.
Littman, M. L. (1996). Algorithms for sequential decision making. PhD thesis, Brown University, Providence, RI, 1996.
Loch, J., & Singh, S. (1998). Using eligibility traces to find the best memoryless policy in partially observable Markov decision processes. In Proceedings of the 15th International Conference on Machine Learning, pp. 323–331.
Munos, R., & Moore, A. W. (2000). Rates of convergence for variable resolution schemes in optimal control. In Proceedings of the 17th International Conference on Machine Learning, pp. 647–654.
Ormoneit D., Sen Ś. (2002) Kernel-based reinforcement learning. Machine Learning 49: 161–178
Papadimitriou C.H., Tsitsiklis J.N. (1987) The complexity of Markov decision processes. Mathematics of Operations Research 12(3): 441–450
Puterman M.L. (1994) Markov decision processes: Discrete stochastic dynamic programming. Wiley, New York
Singh S.P., Sutton R.S. (1996) Reinforcement learning with replacing eligibility traces. Machine Learning 22(1–3): 123–158
Singh S.P., Yee R.C. (1994) An upper bound on the loss from approximate optimal-value functions. Machine Learning 16(3): 227–233
Strehl, A. L., Li, L., Wiewiora, E., Langford, J., & Littman, M. L. (2006). PAC model-free reinforcement learning. In Proceedings of the 23rd International Conference on Machine Learning, pp. 881–888.
Sutton R.S. (1996) Generalization in reinforcement learning: Successful examples using sparse coarse coding. In: Touretzky D.S., Mozer M.C., HasselmoM. E. (Eds) Advances in neural information processing systems 8. MIT Press, Cambridge, MA, pp 1038–1045
Sutton R.S., Barto A.G. (1998) Reinforcement learning: An introduction. MIT Press, Cambridge, MA
Vijayakumar, S., & Schaal, S. (2000). Locally weighted projection regression: An O(n) algorithm for incremental real time learning in high dimensional space. In Proceedings of the 17th International Conference on Machine Learning, pp. 1079–1086.
Zubek, V. B., & Dietterich, T. G. (2000). A POMDP approximation algorithm that anticipates the need to observe. In Proceedings of the Pacific Rim International Conference on Artificial Intelligence, pp. 521–532.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Walsh, T.J., Nouri, A., Li, L. et al. Learning and planning in environments with delayed feedback. Auton Agent Multi-Agent Syst 18, 83–105 (2009). https://doi.org/10.1007/s10458-008-9056-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10458-008-9056-7