Skip to main content
Log in

Learning and planning in environments with delayed feedback

  • Published:
Autonomous Agents and Multi-Agent Systems Aims and scope Submit manuscript

Abstract

This work considers the problems of learning and planning in Markovian environments with constant observation and reward delays. We provide a hardness result for the general planning problem and positive results for several special cases with deterministic or otherwise constrained dynamics. We present an algorithm, Model Based Simulation, for planning in such environments and use model-based reinforcement learning to extend this approach to the learning setting in both finite and continuous environments. Empirical comparisons show this algorithm holds significant advantages over others for decision making in delayed-observation environments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Altman, E., & Nain, P. Closed-loop control with delayed information. In Proceedings of the ACM SIGMETRICS and Performance 1–5, pp. 193–204.

  2. Atkeson C.G., Moore A.W., Schaal S. (1997) Locally weighted learning for control. Artificial Intelligence Review 11(1–5): 75–113

    Article  Google Scholar 

  3. Bander J.L., White C.C. III (1999) Markov decision processes with noise-corrupted and delayed state observations. Journal of the Operational Research Society 50: 660–668

    Article  MATH  Google Scholar 

  4. Bertsekas, D. P. (2001). Dynamic programming and optimal control (2nd ed., Vol. 1/2). Athena Scientific.

  5. Boyan, J. A., & Moore, A. W. (1995). Generalization in reinforcement learning: Safely approximating the value function. In Advances in neural information processing systems: Proceedings of the 1994 conference (pp. 369–376). Cambridge, MA: MIT Press.

  6. Brafman R.I., Tennenholtz M. (2002) R-max—A general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research 3: 213–231

    Article  MathSciNet  Google Scholar 

  7. Brooks D.M., Leondes C.T. (1972) Markov decision processes with state-information lag. Operations Research 20(4): 904–907

    Article  Google Scholar 

  8. Fox, R., & Tennenholtz, M. (2007). A reinforcement learning algorithm with polynomial interaction complexity for only-costly-observable MDPs. In Proceedings of the 22nd Conference on Artificial Intelligence, pp. 553–558.

  9. Hoeffding W. (1963) Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association 58(301): 13–30

    Article  MATH  MathSciNet  Google Scholar 

  10. Jong, N. K., & Stone, P. (2006). Kernel-based models for reinforcement learning. In Proceedings of the 2006 ICML Kernel Machines and Reinforcement Learning Workshop.

  11. Kaelbling L.P., Littman M.L., Cassandra A.R. (1998) Planning and acting in partially observable stochastic domains. Artificial Intelligence 101(1–2): 99–134

    Article  MATH  MathSciNet  Google Scholar 

  12. Kakade, S. (2003). On the Sample Complexity of Reinforcement Learning. PhD thesis, University College London, UK.

  13. Katsikopoulos K.V., Engelbrecht S.E. (2003) Markov decision processes with delays and asynchronous cost collection. IEEE Transactions on Automatic Control 48: 568–574

    Article  MathSciNet  Google Scholar 

  14. Lin, L.-J. (1993). Reinforcement Learning for Robots using Neural Networks. PhD thesis, Carnegie Mellon University, Pittsburgh, PA.

  15. Littman, M. L. (1996). Algorithms for sequential decision making. PhD thesis, Brown University, Providence, RI, 1996.

  16. Loch, J., & Singh, S. (1998). Using eligibility traces to find the best memoryless policy in partially observable Markov decision processes. In Proceedings of the 15th International Conference on Machine Learning, pp. 323–331.

  17. Munos, R., & Moore, A. W. (2000). Rates of convergence for variable resolution schemes in optimal control. In Proceedings of the 17th International Conference on Machine Learning, pp. 647–654.

  18. Ormoneit D., Sen Ś. (2002) Kernel-based reinforcement learning. Machine Learning 49: 161–178

    Article  MATH  Google Scholar 

  19. Papadimitriou C.H., Tsitsiklis J.N. (1987) The complexity of Markov decision processes. Mathematics of Operations Research 12(3): 441–450

    Article  MATH  MathSciNet  Google Scholar 

  20. Puterman M.L. (1994) Markov decision processes: Discrete stochastic dynamic programming. Wiley, New York

    MATH  Google Scholar 

  21. Singh S.P., Sutton R.S. (1996) Reinforcement learning with replacing eligibility traces. Machine Learning 22(1–3): 123–158

    MATH  Google Scholar 

  22. Singh S.P., Yee R.C. (1994) An upper bound on the loss from approximate optimal-value functions. Machine Learning 16(3): 227–233

    MATH  Google Scholar 

  23. Strehl, A. L., Li, L., Wiewiora, E., Langford, J., & Littman, M. L. (2006). PAC model-free reinforcement learning. In Proceedings of the 23rd International Conference on Machine Learning, pp. 881–888.

  24. Sutton R.S. (1996) Generalization in reinforcement learning: Successful examples using sparse coarse coding. In: Touretzky D.S., Mozer M.C., HasselmoM. E. (Eds) Advances in neural information processing systems 8. MIT Press, Cambridge, MA, pp 1038–1045

    Google Scholar 

  25. Sutton R.S., Barto A.G. (1998) Reinforcement learning: An introduction. MIT Press, Cambridge, MA

    Google Scholar 

  26. Vijayakumar, S., & Schaal, S. (2000). Locally weighted projection regression: An O(n) algorithm for incremental real time learning in high dimensional space. In Proceedings of the 17th International Conference on Machine Learning, pp. 1079–1086.

  27. Zubek, V. B., & Dietterich, T. G. (2000). A POMDP approximation algorithm that anticipates the need to observe. In Proceedings of the Pacific Rim International Conference on Artificial Intelligence, pp. 521–532.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thomas J. Walsh.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Walsh, T.J., Nouri, A., Li, L. et al. Learning and planning in environments with delayed feedback. Auton Agent Multi-Agent Syst 18, 83–105 (2009). https://doi.org/10.1007/s10458-008-9056-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10458-008-9056-7

Keywords

Navigation