Skip to main content
Log in

A Modified Memory-Based Reinforcement Learning Method for Solving POMDP Problems

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

Partially observable Markov decision processes (POMDP) provide a mathematical framework for agent planning under stochastic and partially observable environments. The classic Bayesian optimal solution can be obtained by transforming the problem into Markov decision process (MDP) using belief states. However, because the belief state space is continuous and multi-dimensional, the problem is highly intractable. Many practical heuristic based methods are proposed, but most of them require a complete POMDP model of the environment, which is not always practical. This article introduces a modified memory-based reinforcement learning algorithm called modified U-Tree that is capable of learning from raw sensor experiences with minimum prior knowledge. This article describes an enhancement of the original U-Tree’s state generation process to make the generated model more compact, and also proposes a modification of the statistical test for reward estimation, which allows the algorithm to be benchmarked against some traditional model-based algorithms with a set of well known POMDP problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Bellman RE (2003) Dynamic programming. Dover, Mineola

    MATH  Google Scholar 

  2. Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning: a survey. J Artif Intell Res 4: 237–285

    Google Scholar 

  3. Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT, Cambridge

    Google Scholar 

  4. Sondik E (1971) The optimal control of partially observable Markov decision processes. Ph.D. thesis, Stanford University, Palo Alto

  5. Papadimitriou CH, Tsitsiklis JN (1987) The complexity of markov decision processes. Math Oper Res 12(3):441–450. http://www.jstor.org/stable/3689975

  6. Madani O, Hanks S, Condon A (1999) On the undecidability of probabilistic planning and infinite-horizon partially observable Markov decision problems. In: Proceedings of the AAAI ’99/IAAI ’99. American Association for Artificial Intelligence, Menlo Park, CA, USA, pp 541–548

  7. Hauskrecht M (2000) Value-function approximations for partially observable Markov decision processes. J Artif Intell Res 13: 33–94

    MathSciNet  MATH  Google Scholar 

  8. Littman ML (1994) Memoryless policies: theoretical limitations and practical results. In: From animals to animats 3: proceedings of the third international conference on simulation of adaptive behavior. MIT, Cambridge, pp 238–245

  9. Hauskrecht M, Hauskrecht M (1997) Planning and control in stochastic domains with imperfect information. Technical reports. Massachusetts Institute of Technology, Cambridge

    Google Scholar 

  10. Hansen EA (1998) Solving POMDPs by searching in policy space. In: Proceedings of the fourteenth conference on uncertainty in artificial intelligence. San Francisco, CA, USA, pp 211–219

  11. Smith T, Simmons RG (2005) Point-based POMDP algorithms: improved analysis and implementation. In: Proceedings of international conference on uncertainty in artificial intelligence. Edinburgh, Scotland

  12. Spaan M, Vlassis N (2005) Perseus: randomized point-based value iteration for POMDPs. J Artif Intell Res 24: 195–220

    MATH  Google Scholar 

  13. Pineau J, Gordon G, Thrun S (2003) Point-based value iteration: an anytime algorithm for POMDPs. In: Proceedings of international joint conference on artificial intelligence, vol 18. Acapulco, Mexico, pp 1025–1032

  14. Chrisman L (1992) Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. In: Proceedings of the tenth national conference on artificial intelligence. AAAI, Menlo Park, pp 183–188

  15. Loch J, Singh S (1998) Using eligibility traces to find the best memoryless policy in partially observable Markov decision processes. In: Proceedings of the fifteenth international conference on machine learning. Morgan Kaufmann, San Fransisco, pp 323–331

  16. Mccallum AK (1996) Learning to use selective attention and short-term memory in sequential tasks. In: From animals to animats 4: proceedings of the fourth international conference on simulation of adaptive behavior. MIT, Cambridge, pp 315–324

  17. Peshkin L, Meuleau N, Kaelbling LP (1999) Learning policies with external memory. In: ICML ’99: proceedings of the sixteenth international conference on machine learning. Morgan Kaufmann, San Francisco, pp 307–314

  18. Lanzi PL (2000) Adaptive agents with reinforcement learning and internal memory. In: From animals to animats 6: proceedings of the sixth international conference on simulation of adaptive behavior. MIT, Cambridge, pp 333–342

  19. Gomez FJ, Miikkulainen R (1999) Solving non-markovian control tasks with neuroevolution. In: IJCAI’99: proceedings of the 16th international joint conference on artificial intelligence. Morgan Kaufmann, San Francisco, pp 1356–1361

  20. Bakker B, Kleij GVdVVd (2000) Trading off perception with internal state: reinforcement learning and analysis of q-elman networks in a markovian task. In: IJCNN ’00: proceedings of the IEEE-INNS-ENNS international joint conference on neural networks (IJCNN’00), Vol 3. IEEE Computer Society, Washington, p 3213

  21. Bakker B (2001) Reinforcement learning with lstm in non-markovian tasks with longterm dependencies, Technical reports. Leiden University, Leiden

    Google Scholar 

  22. Mccallum RA (1993) Overcoming incomplete perception with utile distinction memory. In: Proceedings of the tenth international conference on machine learning. Morgan Kaufmann, San Fransisco, pp 190–196

  23. Meuleau N, Peshkin L, eung Kim K, Kaelbling LP (1999) Learning finite-state controllers for partially observable environments. In: Proceedings of the fifteenth conference on uncertainty in artificial intelligence. Morgan Kaufmann, San Fransisco, pp 427–436

  24. Doshi-Velez F (2009) The infinite partially observable markov decision process. Neural Inf Process Syst 22: 477–485

    Google Scholar 

  25. Littman ML, Sutton RS, Singh S (2002) Predictive representations of state. Advances in neural information processing systems 14. MIT, Cambridge, pp 1555–1561

  26. Singh S, Littman ML, Jong NK, Pardoe D, Stone P (2003) Learning predictive state representations. In: Proceedings of the twentieth international conference on machine learning. Washington, DC, USA, pp 712–719

  27. James MR, Singh S (2004) Learning and discovery of predictive state representations in dynamical systems with reset. In: Proceedings of the twenty-first international conference on Machine learning. Banff, Alberta, Canada, pp 417–424

  28. Press WH, Flannery BP, Teukolsky SA, Vetterling WT (1997) Numerical recipes in C: the art of scientific computing, chap 14.3, 2nd edn. Cambridge University Press, Cambridge, pp 623–626

  29. Zheng L, Cho SY, Quek C (2008) A memory-based reinforcement learning algorithm for partially observable Markovian decision processes. In: Proceedings of IEEE world congress on computational intelligence. Hong Kong, pp 800–805

  30. Littman ML, Cassandra AR, Kaelbling LP (1995) Learning policies for partially observable environments: scaling up. In: Proceedings of the twelfth international conference on machine learning. Morgan Kaufmann, San Fransisco, pp 362–370

  31. Shani G, Brafman RI, Shimony SE (2005) Model-based online learning of POMDPs. In: Proceedings of European conference on machine learning. Porto, Portugal

  32. Mccallum RA (1995) Instance-based utile distinctions for reinforcement learning with hidden state. In: Proceedings of the twelfth international conference on machine learning. Morgan Kaufmann, San Fransisco, pp 387–395

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Siu-Yeung Cho.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zheng, L., Cho, SY. A Modified Memory-Based Reinforcement Learning Method for Solving POMDP Problems. Neural Process Lett 33, 187–200 (2011). https://doi.org/10.1007/s11063-011-9172-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-011-9172-2

Keywords

Navigation