Abstract
Partially observable Markov decision processes (POMDP) provide a mathematical framework for agent planning under stochastic and partially observable environments. The classic Bayesian optimal solution can be obtained by transforming the problem into Markov decision process (MDP) using belief states. However, because the belief state space is continuous and multi-dimensional, the problem is highly intractable. Many practical heuristic based methods are proposed, but most of them require a complete POMDP model of the environment, which is not always practical. This article introduces a modified memory-based reinforcement learning algorithm called modified U-Tree that is capable of learning from raw sensor experiences with minimum prior knowledge. This article describes an enhancement of the original U-Tree’s state generation process to make the generated model more compact, and also proposes a modification of the statistical test for reward estimation, which allows the algorithm to be benchmarked against some traditional model-based algorithms with a set of well known POMDP problems.
Similar content being viewed by others
References
Bellman RE (2003) Dynamic programming. Dover, Mineola
Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning: a survey. J Artif Intell Res 4: 237–285
Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT, Cambridge
Sondik E (1971) The optimal control of partially observable Markov decision processes. Ph.D. thesis, Stanford University, Palo Alto
Papadimitriou CH, Tsitsiklis JN (1987) The complexity of markov decision processes. Math Oper Res 12(3):441–450. http://www.jstor.org/stable/3689975
Madani O, Hanks S, Condon A (1999) On the undecidability of probabilistic planning and infinite-horizon partially observable Markov decision problems. In: Proceedings of the AAAI ’99/IAAI ’99. American Association for Artificial Intelligence, Menlo Park, CA, USA, pp 541–548
Hauskrecht M (2000) Value-function approximations for partially observable Markov decision processes. J Artif Intell Res 13: 33–94
Littman ML (1994) Memoryless policies: theoretical limitations and practical results. In: From animals to animats 3: proceedings of the third international conference on simulation of adaptive behavior. MIT, Cambridge, pp 238–245
Hauskrecht M, Hauskrecht M (1997) Planning and control in stochastic domains with imperfect information. Technical reports. Massachusetts Institute of Technology, Cambridge
Hansen EA (1998) Solving POMDPs by searching in policy space. In: Proceedings of the fourteenth conference on uncertainty in artificial intelligence. San Francisco, CA, USA, pp 211–219
Smith T, Simmons RG (2005) Point-based POMDP algorithms: improved analysis and implementation. In: Proceedings of international conference on uncertainty in artificial intelligence. Edinburgh, Scotland
Spaan M, Vlassis N (2005) Perseus: randomized point-based value iteration for POMDPs. J Artif Intell Res 24: 195–220
Pineau J, Gordon G, Thrun S (2003) Point-based value iteration: an anytime algorithm for POMDPs. In: Proceedings of international joint conference on artificial intelligence, vol 18. Acapulco, Mexico, pp 1025–1032
Chrisman L (1992) Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. In: Proceedings of the tenth national conference on artificial intelligence. AAAI, Menlo Park, pp 183–188
Loch J, Singh S (1998) Using eligibility traces to find the best memoryless policy in partially observable Markov decision processes. In: Proceedings of the fifteenth international conference on machine learning. Morgan Kaufmann, San Fransisco, pp 323–331
Mccallum AK (1996) Learning to use selective attention and short-term memory in sequential tasks. In: From animals to animats 4: proceedings of the fourth international conference on simulation of adaptive behavior. MIT, Cambridge, pp 315–324
Peshkin L, Meuleau N, Kaelbling LP (1999) Learning policies with external memory. In: ICML ’99: proceedings of the sixteenth international conference on machine learning. Morgan Kaufmann, San Francisco, pp 307–314
Lanzi PL (2000) Adaptive agents with reinforcement learning and internal memory. In: From animals to animats 6: proceedings of the sixth international conference on simulation of adaptive behavior. MIT, Cambridge, pp 333–342
Gomez FJ, Miikkulainen R (1999) Solving non-markovian control tasks with neuroevolution. In: IJCAI’99: proceedings of the 16th international joint conference on artificial intelligence. Morgan Kaufmann, San Francisco, pp 1356–1361
Bakker B, Kleij GVdVVd (2000) Trading off perception with internal state: reinforcement learning and analysis of q-elman networks in a markovian task. In: IJCNN ’00: proceedings of the IEEE-INNS-ENNS international joint conference on neural networks (IJCNN’00), Vol 3. IEEE Computer Society, Washington, p 3213
Bakker B (2001) Reinforcement learning with lstm in non-markovian tasks with longterm dependencies, Technical reports. Leiden University, Leiden
Mccallum RA (1993) Overcoming incomplete perception with utile distinction memory. In: Proceedings of the tenth international conference on machine learning. Morgan Kaufmann, San Fransisco, pp 190–196
Meuleau N, Peshkin L, eung Kim K, Kaelbling LP (1999) Learning finite-state controllers for partially observable environments. In: Proceedings of the fifteenth conference on uncertainty in artificial intelligence. Morgan Kaufmann, San Fransisco, pp 427–436
Doshi-Velez F (2009) The infinite partially observable markov decision process. Neural Inf Process Syst 22: 477–485
Littman ML, Sutton RS, Singh S (2002) Predictive representations of state. Advances in neural information processing systems 14. MIT, Cambridge, pp 1555–1561
Singh S, Littman ML, Jong NK, Pardoe D, Stone P (2003) Learning predictive state representations. In: Proceedings of the twentieth international conference on machine learning. Washington, DC, USA, pp 712–719
James MR, Singh S (2004) Learning and discovery of predictive state representations in dynamical systems with reset. In: Proceedings of the twenty-first international conference on Machine learning. Banff, Alberta, Canada, pp 417–424
Press WH, Flannery BP, Teukolsky SA, Vetterling WT (1997) Numerical recipes in C: the art of scientific computing, chap 14.3, 2nd edn. Cambridge University Press, Cambridge, pp 623–626
Zheng L, Cho SY, Quek C (2008) A memory-based reinforcement learning algorithm for partially observable Markovian decision processes. In: Proceedings of IEEE world congress on computational intelligence. Hong Kong, pp 800–805
Littman ML, Cassandra AR, Kaelbling LP (1995) Learning policies for partially observable environments: scaling up. In: Proceedings of the twelfth international conference on machine learning. Morgan Kaufmann, San Fransisco, pp 362–370
Shani G, Brafman RI, Shimony SE (2005) Model-based online learning of POMDPs. In: Proceedings of European conference on machine learning. Porto, Portugal
Mccallum RA (1995) Instance-based utile distinctions for reinforcement learning with hidden state. In: Proceedings of the twelfth international conference on machine learning. Morgan Kaufmann, San Fransisco, pp 387–395
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zheng, L., Cho, SY. A Modified Memory-Based Reinforcement Learning Method for Solving POMDP Problems. Neural Process Lett 33, 187–200 (2011). https://doi.org/10.1007/s11063-011-9172-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-011-9172-2