Skip to main content
Log in

Posterior sampling for Monte Carlo planning under uncertainty

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Monte Carlo tree search (MCTS) has recently been drawing great interest in the domain of planning and learning under uncertainty. One of the fundamental challenges is the trade-off between exploration and exploitation. To address this problem, we propose to balance between exploration and exploitation via posterior sampling in the contexts of Markov decision process (MDP) and partially observable Markov decision process (POMDP). Specifically, we treat the cumulative reward returned by taking an action from a search node in the MCTS search tree as a random variable following an unknown distribution. We parametrize this distribution by introducing necessary hidden parameters, and infer the posterior distribution of the hidden parameters in a Bayesian way. We further expand a node in the search tree by using Thompson sampling to select an action based on its posterior probability of being optimal. Following this idea, we develop Dirichlet-NormalGamma based Monte Carlo tree search (DNG-MCTS) and Dirichlet-Dirichlet-NormalGamma based partially observable Monte Carlo planning (D2NG-POMCP) algorithms respectively for Monte Carlo planning in MDPs and POMDPs. Experimental results show that the proposed algorithms outperform the state-of-the-art with better values on several benchmark problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. More details about the CTP problem is in Section 2.1.

  2. Discussion on the advantage of using a simulator is in Section 6.1.

  3. https://en.wikipedia.org/wiki/Canadian_traveller_problem

  4. https://code.google.com/p/mdp-engine/

  5. http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Applications.html

References

  1. Agrawal S, Goyal N (2012) Analysis of thompson sampling for the multi-armed bandit problem. In: Conference on learning theory, pp 39.1–39.26

  2. Agrawal S, Goyal N (2013) Further optimal regret bounds for Thompson sampling. In: Artificial intelligence and statistics, pp 99–107

  3. Anand A, Mausam GA, Singla P (2015) ASAP-UCT: Abstraction of state-action pairs in UCT. In: Yang Q, Wooldridge M (eds) IJCAI. AAAI Press, pp 1509–1515

  4. Anand A, Mausam RN, Singla P (2016) OGA-UCT: On-the-go abstractions in UCT. In: Coles AJ, Coles A, Edelkamp S, Magazzeni D, Sanner S (eds) ICAPS. AAAI Press, pp 29– 37

  5. Asmuth J, Littman ML (2011) Learning is planning: near Bayes-optimal reinforcement learning via Monte-Carlo tree search. In: Uncertainty in artificial intelligence, pp 19–26

  6. Auer P (2003) Using confidence bounds for exploitation-exploration trade-offs. J Mach Learn Res 3:397–422

    MathSciNet  MATH  Google Scholar 

  7. Auer P, Cesa-Bianchi N, Fischer P (2002) Finite-time analysis of the multiarmed bandit problem. Mach Learn 47(2):235– 256

    Article  Google Scholar 

  8. Bai A, Srivastava S, Russell S (2016) Markovian state and action abstractions for MDPs via hierarchical MCTS. In: 25th international joint conference on artificial intelligence (IJCAI). New York

  9. Bai A, Wu F, Chen X (2012) Online planning for large MDPs with MAXQ decomposition (extended abstract). In: van der Hoek W, Padgham L, Conitzer V, Winikoff M (eds) International conference on autonomous agents and multiagent systems, AAMAS 2012, Valencia, Spain, June 4-8, 2012 (3 volumes). IFAAMAS, pp 1215–1216

  10. Bai A, Wu F, Chen X (2013) Bayesian Mixture modelling and inference based Thompson sampling in Monte-Carlo tree search. In: Advances in neural information processing systems 26, pp 1646–1654

  11. Bai A, Wu F, Chen X (2015) Online planning for large Markov decision processes with hierarchical decomposition. ACM Trans Intell Syst Technol (TIST) 6(4):45:1–45:28

    Google Scholar 

  12. Bai A, Wu F, Zhang Z, Chen X (2014) Thompson sampling based Monte-Carlo planning in POMDPs. In: International conference on automated planning and scheduling (ICAPS)

  13. Barrett S, Agmon N, Hazon N, Kraus S, teammates P. Stone. (2014) Communicating with unknown. In: Proceedings of 13th international conference on autonomous agents and multiagent systems (AAMAS 2012)

  14. Barrett S, Stone P, Kraus S, Rosenfeld A (2013) Teamwork with limited knowledge of teammates. In: Proceedings of the twenty-seventh AAAI conference on artificial intelligence

  15. Barto A, Bradtke S, Singh S (1995) Learning to act using real-time dynamic programming. Artif Intell 72(1-2):81–138

    Article  Google Scholar 

  16. Bellman R (1957) Dynamic programming, 1st edn. Princeton University Press, Princeton

    MATH  Google Scholar 

  17. Bertsekas DP, Castanon DA (1999) Rollout algorithms for stochastic scheduling problems. J Heuristics 5 (1):89–108

    Article  Google Scholar 

  18. Bonet B, Geffner H (2003) Labeled rtdp: Improving the convergence of real-time dynamic programming. In: International conference on automated planning and scheduling, vol 3

  19. Bonet B, Geffner H (2012) Action selection for MDPs Anytime AO* vs. UCT. In: AAAI conference on artificial intelligence, pp 1749–1755

  20. Browne C, Powley EJ, Whitehouse D, Lucas SM, Cowling PI, Rohlfshagen P, Tavener S, Perez D, Samothrakis S, Colton S (2012) A survey of Monte Carlo, tree search methods. IEEE Trans Comput Intell AI Games 4(1):1–43

    Article  Google Scholar 

  21. Bubeck S, Cesa-Bianchi N (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Found Trends Mach Learn 5(1):1–122

    Article  Google Scholar 

  22. Bubeck S, Munos R, Stoltz G (2011) Pure exploration in finitely-armed and continuous-armed bandits. Theor Comput Sci 412(19):1832–1852

    Article  MathSciNet  Google Scholar 

  23. Chang HS, Givan R, Chong EK (2004) Parallel rollout for online solution of partially observable Markov decision processes. Discret Event Dyn Syst 14(3):309–341

    Article  MathSciNet  Google Scholar 

  24. Chapelle O, Li L (2011) An empirical evaluation of Thompson sampling. In: Advances neural information processing systems, pp 2249–2257

  25. Chaslot G, Bakkes S, Szita I, Spronck P (2008) Monte-Carlo tree search: a new framework for game AI. In: Darken C, Mateas M (eds) Proceedings of the fourth artificial intelligence and interactive digital entertainment conference. The AAAI Press, Stanford

  26. DasGupta A (2008) Asymptotic theory of statistics and probability. Springer, Berlin

    MATH  Google Scholar 

  27. Dearden R, Friedman N, Russell S (1998) Bayesian Q-learning. In: AAAI conference on artificial intelligence, pp 761–768

  28. DeGroot MH, Schervish MJ (2002) Probability and statistics. Addison Wesley, Boston

    Google Scholar 

  29. Dietterich TG (1999) Hierarchical reinforcement learning with the MAXQ value function decomposition. J Mach Learn Res 13(1):63

    MathSciNet  Google Scholar 

  30. Eyerich P, Keller T, Helmert M (2010) High-quality policies for the Canadian traveler’s problem. In: AAAI conference on artificial intelligence, pp 51–58

  31. Feldman Z, Domshlak C (2012) Simple regret optimization in online planning for Markov decision processes. In: AAAI conference on artificial intelligence

  32. Feldman Z, Domshlak C (2014) On MABs and separation of concerns in Monte-Carlo planning for MDPs. In: Chien SA, Do MB, Fern A, Ruml W (eds) ICAPS. AAAI

  33. Feng Z, Hansen E (2002) Symbolic heuristic search for factored Markov decision processes. In: AAAI/IAAI, pp 455–460

  34. Finnsson H, Björnsson Y (2008) Simulation-based approach to general game playing. AAAI 8:259–264

    Google Scholar 

  35. Forbes C, Evans M (2011). In: Hastings N, Peacock B (eds) Statistical distributions. Wiley, Nwe York

  36. Gelly S, Silver D (2007) Combining online and offline knowledge in UCT. In: Proceedings of the 24th international conference on machine learning. ACM, pp 273–280

  37. Gelly S, Silver D (2011) Monte-Carlo Tree search and rapid action value estimation in computer Go. Artif Intell 175(11):1856–1875

    Article  MathSciNet  Google Scholar 

  38. Gopalan A, Mannor S, Mansour Y (2014) Thompson sampling for complex online problems. In: Proceedings of the 31st international conference on machine learning, pp 100–108

  39. Gordon NJ, Salmond DJ, Smith AF (1993) Novel approach to nonlinear/non-Gaussian bayesian state estimation. In: IEE Proceedings F (radar and signal processing), vol 140. IET, pp 107–113

  40. Grzes M, Poupart P (2014) Pomdp planning and execution in an augmented space. In: Proceedings of the 2014 international conference on autonomous agents and multi-agent systems. International Foundation for Autonomous Agents and Multiagent Systems, pp 757–764

  41. Grześ M, Poupart P, Hoey J (2013) Isomorph-free branch and bound search for finite state controllers. In: Proceedings of the twenty-third international joint conference on artificial intelligence. AAAI Press, pp 2282–2290

  42. Guez A, Silver D, Dayan P (2012) Efficient Bayes-adaptive reinforcement learning using sample-based search. In: Advances in neural information processing systems, pp 1034–1042

  43. Hansen E, Zilberstein S (2001) LAO* A heuristic search algorithm that finds solutions with loops. Artif Intell 129(1-2):35–62

    Article  MathSciNet  Google Scholar 

  44. Jaynes ET (1968) Prior probabilities. IEEE Trans Syst Sci Cybern 4(3):227–241

    Article  Google Scholar 

  45. Jones GL (2004) On the Markov chain central limit theorem. Probab Surv 1:299–320

    Article  MathSciNet  Google Scholar 

  46. Kaelbling LP, Littman ML, Cassandra AR (1998) Planning and acting in partially observable stochastic domains. Artif Intell 101(1-2):99–134

    Article  MathSciNet  Google Scholar 

  47. Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning: a survey. J Artif Intell Res 4:237–285

    Article  Google Scholar 

  48. Kaufmann E, Korda N, Munos R (2012) Thompson sampling An optimal finite time analysis. In: Algorithmic Learning Theory, pp 199–213

  49. Kearns M, Mansour Y, Ng A (1999) A sparse sampling algorithm for near-optimal planning in large Markov decision processes. In: Proceedings of the 16th international joint conference on artificial intelligence, vol 2. Morgan Kaufmann Publishers Inc, pp 1324–1331

  50. Keller T, Eyerich P (2012) Prost: Probabilistic planning based on UCT. In: ICAPS12

  51. Keller T, Helmert M (2013) Trial-based heuristic tree search for finite horizon MDPs. In: Proceedings of the 23rd international conference on automated planning and scheduling (ICAPS), pp 135–143

  52. Kocsis L, Szepesvári C (2006) Bandit based Monte-Carlo planning. In: European conference on machine learning, pp 282–293

    Google Scholar 

  53. Korda N, Kaufmann E, Munos R (2013) Thompson sampling for 1-dimensional exponential family bandits. In: Burges C, Bottou L, Welling M, Ghahramani Z, Weinberger K (eds) Advances in neural information processing systems 26. Curran Associates, Inc, pp 1448–1456

  54. Kurniawati H, Hsu D, Lee WS (2008) SARSOP efficient point-based POMDP planning by approximating optimally reachable belief spaces. In: Robotics: science and systems, pp 65–72

  55. Lai T, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Adv Appl Math 6:4–22

    Article  MathSciNet  Google Scholar 

  56. Macindoe O, Kaelbling LP, Lozano-Pérez T (2012) POMCoP: Belief space planning for sidekicks in cooperative games. In: Riedl M, Sukthankar G (eds) Proceedings of the eighth AAAI conference on artificial intelligence and interactive digital entertainment, AIIDE-12. The AAAI Press, Stanford

  57. McAllester DA, Singh S (1999) Approximate planning for factored pomdps using belief state simplification. In: Proceedings of the fifteenth conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc, pp 409–416

  58. McMahan HB, Likhachev M, Gordon G (2005) Bounded real-time dynamic programming: Rtdp with monotone upper bounds and performance guarantees. In: Proceedings of the 22nd international conference on machine learning. ACM, pp 569–576

  59. Osband I, Russo D, Van Roy B (2013) (more) efficient reinforcement learning via posterior sampling. In: Advances in neural information processing systems, pp 3003–3011

  60. Papadimitriou CH, Yannakakis M (1991) Shortest paths without a map. Theor Comput Sci 84(1):127–150

    Article  MathSciNet  Google Scholar 

  61. Paquet S, Chaib-draa B, Ross S (2006) Hybrid POMDP Algorithms. In: Proceedings of the workshop on multi-agent sequential decision making in uncertain domains (MSDM-06). Citeseer, pp 133–147

  62. Paquet S, Tobin L, Chaib-draa B (2005) Real-time decision making for large POMDPs. In Advances in artificial intelligence. Springer, pp 450–455

  63. Pineau J, Gordon G, Thrun S, et al. (2003) Point-based value iteration: an anytime algorithm for POMDPs. In: IJCAI, vol 3, pp 1025–1032

  64. Puterman ML (1994) Markov decision processes: discrete stochastic dynamic programming. Wiley, New York

    Book  Google Scholar 

  65. Ross S, Chaib-Draa B, et al. (2007) Aems: an anytime online search algorithm for approximate policy refinement in large POMDPs. In: IJCAI, pp 2592–2598

  66. Ross S, Pineau J, Paquet S, Chaib-Draa B (2008) Online planning algorithms for POMDPs. J Artif Intell Res 32(1):663–704

    Article  MathSciNet  Google Scholar 

  67. Sanner S, Goetschalckx R, Driessens K, Shani G (2009) Bayesian real-time dynamic programming. In: IJCAI, pp 1784–1789

  68. Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M et al (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529(7587):484–489

    Article  Google Scholar 

  69. Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai M, Bolton A et al (2017) Mastering the game of go without human knowledge. Nature 550(7676):354

    Article  Google Scholar 

  70. Silver D, Veness J (2010) Monte-Carlo planning in large POMDPs. In: Advances in neural information processing systems, pp 2164–2172

  71. Smith T, Simmons R (2004) Heuristic search value iteration for POMDPs. In: Proceedings of the 20th conference on uncertainty in artificial intelligence. AUAI Press, pp 520–527

  72. Somani A, Ye N, Hsu D, Lee WS (2013) DESPOT: Online POMDP planning with regularization. In: Burges C, Bottou L, Welling M, Ghahramani Z, Weinberger K (eds) Advances in neural information processing systems 26. Curran Associates, Inc, pp 1772–1780

  73. Sutton RS, Barto AG (1998) Reinforcement learning: An introduction. The MIT Press, Cambridge

    Google Scholar 

  74. Tesauro G, Rajan VT, Segal R (2010) Bayesian inference in Monte-Carlo tree search. In: Uncertainty in artificial intelligence, pp 580–588

  75. Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25:285–294

    Article  Google Scholar 

  76. Thrun S (1999) Monte Carlo POMDPs. In: NIPS, vol 12, pp 1064–1070

  77. Tolpin D, Shimony SE (2012) MCTS based on simple regret. In: AAAI conference on artificial intelligence

  78. Vien NA, Ertel W, Dang V -H, Chung T (2013) Monte-Carlo tree search for bayesian reinforcement learning. Appl Intell 39(2):345–353

    Article  Google Scholar 

  79. Wang T, Lizotte D, Bowling M, Schuurmans D (2005) Bayesian sparse sampling for on-line reward optimization. In: Proceedings of the 22nd international conference on machine learning. ACM, pp 956–963

  80. Washington R (1997) BI-POMDP: bounded, incremental partially-observable Markov-model planning. In: Recent advances in AI planning. Springer, pp 440–451

  81. Winands MH, Bjornsson Y, Saito J (2010) Monte Carlo tree search in lines of action. IEEE Trans Comput Intell AI Games 2(4):239–250

    Article  Google Scholar 

  82. Wu F, Zilberstein S, Chen X (2011) Online planning for ad hoc autonomous agent teams. In: International joint conference on artificial intelligence, pp 439–445

  83. Zhang Z, Chen X (2012) FHHOP a factored hybrid heuristic online planning algorithm for large POMDPs. In: Proceedings of the 28th conference on uncertainty in artificial intelligence. Catalina Island, pp 934–943

Download references

Acknowledgements

Feng Wu was supported in part by National Natural Science Foundation of China under grant No. 61603368, the Youth Innovation Promotion Association of CAS (No. 2015373), and Natural Science Foundation of Anhui Province under grant No. 1608085QF134. Aijun Bai was supported in part by the National Research Foundation for the Doctoral Program of China under grant 20133402110026, the National Hi-Tech Project of China under grant 2008AA01Z150 and the Natural Science Foundation of China under grant 60745002 and 61175057. We are grateful to the reviewers for their constructive comments and suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Feng Wu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bai, A., Wu, F. & Chen, X. Posterior sampling for Monte Carlo planning under uncertainty. Appl Intell 48, 4998–5018 (2018). https://doi.org/10.1007/s10489-018-1248-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-018-1248-5

Keywords

Navigation