Posterior sampling for Monte Carlo planning under uncertainty

Bai, Aijun; Wu, Feng; Chen, Xiaoping

doi:10.1007/s10489-018-1248-5

Posterior sampling for Monte Carlo planning under uncertainty

Published: 15 August 2018

Volume 48, pages 4998–5018, (2018)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

526 Accesses
1 Citation
Explore all metrics

Abstract

Monte Carlo tree search (MCTS) has recently been drawing great interest in the domain of planning and learning under uncertainty. One of the fundamental challenges is the trade-off between exploration and exploitation. To address this problem, we propose to balance between exploration and exploitation via posterior sampling in the contexts of Markov decision process (MDP) and partially observable Markov decision process (POMDP). Specifically, we treat the cumulative reward returned by taking an action from a search node in the MCTS search tree as a random variable following an unknown distribution. We parametrize this distribution by introducing necessary hidden parameters, and infer the posterior distribution of the hidden parameters in a Bayesian way. We further expand a node in the search tree by using Thompson sampling to select an action based on its posterior probability of being optimal. Following this idea, we develop Dirichlet-NormalGamma based Monte Carlo tree search (DNG-MCTS) and Dirichlet-Dirichlet-NormalGamma based partially observable Monte Carlo planning (D²NG-POMCP) algorithms respectively for Monte Carlo planning in MDPs and POMDPs. Experimental results show that the proposed algorithms outperform the state-of-the-art with better values on several benchmark problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Importance Sampling for Online Planning under Uncertainty

A Partially-Observable Markov Decision Process for Dealing with Dynamically Changing Environments

A Non-stationary Infinite Partially-Observable Markov Decision Process

Notes

More details about the CTP problem is in Section 2.1.
Discussion on the advantage of using a simulator is in Section 6.1.
https://en.wikipedia.org/wiki/Canadian_traveller_problem
https://code.google.com/p/mdp-engine/
http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Applications.html

References

Agrawal S, Goyal N (2012) Analysis of thompson sampling for the multi-armed bandit problem. In: Conference on learning theory, pp 39.1–39.26
Agrawal S, Goyal N (2013) Further optimal regret bounds for Thompson sampling. In: Artificial intelligence and statistics, pp 99–107
Anand A, Mausam GA, Singla P (2015) ASAP-UCT: Abstraction of state-action pairs in UCT. In: Yang Q, Wooldridge M (eds) IJCAI. AAAI Press, pp 1509–1515
Anand A, Mausam RN, Singla P (2016) OGA-UCT: On-the-go abstractions in UCT. In: Coles AJ, Coles A, Edelkamp S, Magazzeni D, Sanner S (eds) ICAPS. AAAI Press, pp 29– 37
Asmuth J, Littman ML (2011) Learning is planning: near Bayes-optimal reinforcement learning via Monte-Carlo tree search. In: Uncertainty in artificial intelligence, pp 19–26
Auer P (2003) Using confidence bounds for exploitation-exploration trade-offs. J Mach Learn Res 3:397–422
MathSciNet MATH Google Scholar
Auer P, Cesa-Bianchi N, Fischer P (2002) Finite-time analysis of the multiarmed bandit problem. Mach Learn 47(2):235– 256
Article Google Scholar
Bai A, Srivastava S, Russell S (2016) Markovian state and action abstractions for MDPs via hierarchical MCTS. In: 25th international joint conference on artificial intelligence (IJCAI). New York
Bai A, Wu F, Chen X (2012) Online planning for large MDPs with MAXQ decomposition (extended abstract). In: van der Hoek W, Padgham L, Conitzer V, Winikoff M (eds) International conference on autonomous agents and multiagent systems, AAMAS 2012, Valencia, Spain, June 4-8, 2012 (3 volumes). IFAAMAS, pp 1215–1216
Bai A, Wu F, Chen X (2013) Bayesian Mixture modelling and inference based Thompson sampling in Monte-Carlo tree search. In: Advances in neural information processing systems 26, pp 1646–1654
Bai A, Wu F, Chen X (2015) Online planning for large Markov decision processes with hierarchical decomposition. ACM Trans Intell Syst Technol (TIST) 6(4):45:1–45:28
Google Scholar
Bai A, Wu F, Zhang Z, Chen X (2014) Thompson sampling based Monte-Carlo planning in POMDPs. In: International conference on automated planning and scheduling (ICAPS)
Barrett S, Agmon N, Hazon N, Kraus S, teammates P. Stone. (2014) Communicating with unknown. In: Proceedings of 13th international conference on autonomous agents and multiagent systems (AAMAS 2012)
Barrett S, Stone P, Kraus S, Rosenfeld A (2013) Teamwork with limited knowledge of teammates. In: Proceedings of the twenty-seventh AAAI conference on artificial intelligence
Barto A, Bradtke S, Singh S (1995) Learning to act using real-time dynamic programming. Artif Intell 72(1-2):81–138
Article Google Scholar
Bellman R (1957) Dynamic programming, 1st edn. Princeton University Press, Princeton
MATH Google Scholar
Bertsekas DP, Castanon DA (1999) Rollout algorithms for stochastic scheduling problems. J Heuristics 5 (1):89–108
Article Google Scholar
Bonet B, Geffner H (2003) Labeled rtdp: Improving the convergence of real-time dynamic programming. In: International conference on automated planning and scheduling, vol 3
Bonet B, Geffner H (2012) Action selection for MDPs Anytime AO* vs. UCT. In: AAAI conference on artificial intelligence, pp 1749–1755
Browne C, Powley EJ, Whitehouse D, Lucas SM, Cowling PI, Rohlfshagen P, Tavener S, Perez D, Samothrakis S, Colton S (2012) A survey of Monte Carlo, tree search methods. IEEE Trans Comput Intell AI Games 4(1):1–43
Article Google Scholar
Bubeck S, Cesa-Bianchi N (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Found Trends Mach Learn 5(1):1–122
Article Google Scholar
Bubeck S, Munos R, Stoltz G (2011) Pure exploration in finitely-armed and continuous-armed bandits. Theor Comput Sci 412(19):1832–1852
Article MathSciNet Google Scholar
Chang HS, Givan R, Chong EK (2004) Parallel rollout for online solution of partially observable Markov decision processes. Discret Event Dyn Syst 14(3):309–341
Article MathSciNet Google Scholar
Chapelle O, Li L (2011) An empirical evaluation of Thompson sampling. In: Advances neural information processing systems, pp 2249–2257
Chaslot G, Bakkes S, Szita I, Spronck P (2008) Monte-Carlo tree search: a new framework for game AI. In: Darken C, Mateas M (eds) Proceedings of the fourth artificial intelligence and interactive digital entertainment conference. The AAAI Press, Stanford
DasGupta A (2008) Asymptotic theory of statistics and probability. Springer, Berlin
MATH Google Scholar
Dearden R, Friedman N, Russell S (1998) Bayesian Q-learning. In: AAAI conference on artificial intelligence, pp 761–768
DeGroot MH, Schervish MJ (2002) Probability and statistics. Addison Wesley, Boston
Google Scholar
Dietterich TG (1999) Hierarchical reinforcement learning with the MAXQ value function decomposition. J Mach Learn Res 13(1):63
MathSciNet Google Scholar
Eyerich P, Keller T, Helmert M (2010) High-quality policies for the Canadian traveler’s problem. In: AAAI conference on artificial intelligence, pp 51–58
Feldman Z, Domshlak C (2012) Simple regret optimization in online planning for Markov decision processes. In: AAAI conference on artificial intelligence
Feldman Z, Domshlak C (2014) On MABs and separation of concerns in Monte-Carlo planning for MDPs. In: Chien SA, Do MB, Fern A, Ruml W (eds) ICAPS. AAAI
Feng Z, Hansen E (2002) Symbolic heuristic search for factored Markov decision processes. In: AAAI/IAAI, pp 455–460
Finnsson H, Björnsson Y (2008) Simulation-based approach to general game playing. AAAI 8:259–264
Google Scholar
Forbes C, Evans M (2011). In: Hastings N, Peacock B (eds) Statistical distributions. Wiley, Nwe York
Gelly S, Silver D (2007) Combining online and offline knowledge in UCT. In: Proceedings of the 24th international conference on machine learning. ACM, pp 273–280
Gelly S, Silver D (2011) Monte-Carlo Tree search and rapid action value estimation in computer Go. Artif Intell 175(11):1856–1875
Article MathSciNet Google Scholar
Gopalan A, Mannor S, Mansour Y (2014) Thompson sampling for complex online problems. In: Proceedings of the 31st international conference on machine learning, pp 100–108
Gordon NJ, Salmond DJ, Smith AF (1993) Novel approach to nonlinear/non-Gaussian bayesian state estimation. In: IEE Proceedings F (radar and signal processing), vol 140. IET, pp 107–113
Grzes M, Poupart P (2014) Pomdp planning and execution in an augmented space. In: Proceedings of the 2014 international conference on autonomous agents and multi-agent systems. International Foundation for Autonomous Agents and Multiagent Systems, pp 757–764
Grześ M, Poupart P, Hoey J (2013) Isomorph-free branch and bound search for finite state controllers. In: Proceedings of the twenty-third international joint conference on artificial intelligence. AAAI Press, pp 2282–2290
Guez A, Silver D, Dayan P (2012) Efficient Bayes-adaptive reinforcement learning using sample-based search. In: Advances in neural information processing systems, pp 1034–1042
Hansen E, Zilberstein S (2001) LAO* A heuristic search algorithm that finds solutions with loops. Artif Intell 129(1-2):35–62
Article MathSciNet Google Scholar
Jaynes ET (1968) Prior probabilities. IEEE Trans Syst Sci Cybern 4(3):227–241
Article Google Scholar
Jones GL (2004) On the Markov chain central limit theorem. Probab Surv 1:299–320
Article MathSciNet Google Scholar
Kaelbling LP, Littman ML, Cassandra AR (1998) Planning and acting in partially observable stochastic domains. Artif Intell 101(1-2):99–134
Article MathSciNet Google Scholar
Kaelbling LP, Littman ML, Moore AW (1996) Reinforcement learning: a survey. J Artif Intell Res 4:237–285
Article Google Scholar
Kaufmann E, Korda N, Munos R (2012) Thompson sampling An optimal finite time analysis. In: Algorithmic Learning Theory, pp 199–213
Kearns M, Mansour Y, Ng A (1999) A sparse sampling algorithm for near-optimal planning in large Markov decision processes. In: Proceedings of the 16th international joint conference on artificial intelligence, vol 2. Morgan Kaufmann Publishers Inc, pp 1324–1331
Keller T, Eyerich P (2012) Prost: Probabilistic planning based on UCT. In: ICAPS12
Keller T, Helmert M (2013) Trial-based heuristic tree search for finite horizon MDPs. In: Proceedings of the 23rd international conference on automated planning and scheduling (ICAPS), pp 135–143
Kocsis L, Szepesvári C (2006) Bandit based Monte-Carlo planning. In: European conference on machine learning, pp 282–293
Google Scholar
Korda N, Kaufmann E, Munos R (2013) Thompson sampling for 1-dimensional exponential family bandits. In: Burges C, Bottou L, Welling M, Ghahramani Z, Weinberger K (eds) Advances in neural information processing systems 26. Curran Associates, Inc, pp 1448–1456
Kurniawati H, Hsu D, Lee WS (2008) SARSOP efficient point-based POMDP planning by approximating optimally reachable belief spaces. In: Robotics: science and systems, pp 65–72
Lai T, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Adv Appl Math 6:4–22
Article MathSciNet Google Scholar
Macindoe O, Kaelbling LP, Lozano-Pérez T (2012) POMCoP: Belief space planning for sidekicks in cooperative games. In: Riedl M, Sukthankar G (eds) Proceedings of the eighth AAAI conference on artificial intelligence and interactive digital entertainment, AIIDE-12. The AAAI Press, Stanford
McAllester DA, Singh S (1999) Approximate planning for factored pomdps using belief state simplification. In: Proceedings of the fifteenth conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc, pp 409–416
McMahan HB, Likhachev M, Gordon G (2005) Bounded real-time dynamic programming: Rtdp with monotone upper bounds and performance guarantees. In: Proceedings of the 22nd international conference on machine learning. ACM, pp 569–576
Osband I, Russo D, Van Roy B (2013) (more) efficient reinforcement learning via posterior sampling. In: Advances in neural information processing systems, pp 3003–3011
Papadimitriou CH, Yannakakis M (1991) Shortest paths without a map. Theor Comput Sci 84(1):127–150
Article MathSciNet Google Scholar
Paquet S, Chaib-draa B, Ross S (2006) Hybrid POMDP Algorithms. In: Proceedings of the workshop on multi-agent sequential decision making in uncertain domains (MSDM-06). Citeseer, pp 133–147
Paquet S, Tobin L, Chaib-draa B (2005) Real-time decision making for large POMDPs. In Advances in artificial intelligence. Springer, pp 450–455
Pineau J, Gordon G, Thrun S, et al. (2003) Point-based value iteration: an anytime algorithm for POMDPs. In: IJCAI, vol 3, pp 1025–1032
Puterman ML (1994) Markov decision processes: discrete stochastic dynamic programming. Wiley, New York
Book Google Scholar
Ross S, Chaib-Draa B, et al. (2007) Aems: an anytime online search algorithm for approximate policy refinement in large POMDPs. In: IJCAI, pp 2592–2598
Ross S, Pineau J, Paquet S, Chaib-Draa B (2008) Online planning algorithms for POMDPs. J Artif Intell Res 32(1):663–704
Article MathSciNet Google Scholar
Sanner S, Goetschalckx R, Driessens K, Shani G (2009) Bayesian real-time dynamic programming. In: IJCAI, pp 1784–1789
Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M et al (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529(7587):484–489
Article Google Scholar
Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai M, Bolton A et al (2017) Mastering the game of go without human knowledge. Nature 550(7676):354
Article Google Scholar
Silver D, Veness J (2010) Monte-Carlo planning in large POMDPs. In: Advances in neural information processing systems, pp 2164–2172
Smith T, Simmons R (2004) Heuristic search value iteration for POMDPs. In: Proceedings of the 20th conference on uncertainty in artificial intelligence. AUAI Press, pp 520–527
Somani A, Ye N, Hsu D, Lee WS (2013) DESPOT: Online POMDP planning with regularization. In: Burges C, Bottou L, Welling M, Ghahramani Z, Weinberger K (eds) Advances in neural information processing systems 26. Curran Associates, Inc, pp 1772–1780
Sutton RS, Barto AG (1998) Reinforcement learning: An introduction. The MIT Press, Cambridge
Google Scholar
Tesauro G, Rajan VT, Segal R (2010) Bayesian inference in Monte-Carlo tree search. In: Uncertainty in artificial intelligence, pp 580–588
Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25:285–294
Article Google Scholar
Thrun S (1999) Monte Carlo POMDPs. In: NIPS, vol 12, pp 1064–1070
Tolpin D, Shimony SE (2012) MCTS based on simple regret. In: AAAI conference on artificial intelligence
Vien NA, Ertel W, Dang V -H, Chung T (2013) Monte-Carlo tree search for bayesian reinforcement learning. Appl Intell 39(2):345–353
Article Google Scholar
Wang T, Lizotte D, Bowling M, Schuurmans D (2005) Bayesian sparse sampling for on-line reward optimization. In: Proceedings of the 22nd international conference on machine learning. ACM, pp 956–963
Washington R (1997) BI-POMDP: bounded, incremental partially-observable Markov-model planning. In: Recent advances in AI planning. Springer, pp 440–451
Winands MH, Bjornsson Y, Saito J (2010) Monte Carlo tree search in lines of action. IEEE Trans Comput Intell AI Games 2(4):239–250
Article Google Scholar
Wu F, Zilberstein S, Chen X (2011) Online planning for ad hoc autonomous agent teams. In: International joint conference on artificial intelligence, pp 439–445
Zhang Z, Chen X (2012) FHHOP a factored hybrid heuristic online planning algorithm for large POMDPs. In: Proceedings of the 28th conference on uncertainty in artificial intelligence. Catalina Island, pp 934–943

Download references

Acknowledgements

Feng Wu was supported in part by National Natural Science Foundation of China under grant No. 61603368, the Youth Innovation Promotion Association of CAS (No. 2015373), and Natural Science Foundation of Anhui Province under grant No. 1608085QF134. Aijun Bai was supported in part by the National Research Foundation for the Doctoral Program of China under grant 20133402110026, the National Hi-Tech Project of China under grant 2008AA01Z150 and the Natural Science Foundation of China under grant 60745002 and 61175057. We are grateful to the reviewers for their constructive comments and suggestions.

Author information

Authors and Affiliations

Cloud & AI One Microsoft Way, Redmond, WA, 98052, USA
Aijun Bai
University of Science and Technology of China, 96 Jinzhai Road, Hefei, Anhui, 230026, China
Feng Wu & Xiaoping Chen

Authors

Aijun Bai
View author publications
You can also search for this author in PubMed Google Scholar
Feng Wu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoping Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Feng Wu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bai, A., Wu, F. & Chen, X. Posterior sampling for Monte Carlo planning under uncertainty. Appl Intell 48, 4998–5018 (2018). https://doi.org/10.1007/s10489-018-1248-5

Download citation

Published: 15 August 2018
Issue Date: December 2018
DOI: https://doi.org/10.1007/s10489-018-1248-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Posterior sampling for Monte Carlo planning under uncertainty

Abstract

Access this article

Similar content being viewed by others

Importance Sampling for Online Planning under Uncertainty

A Partially-Observable Markov Decision Process for Dealing with Dynamically Changing Environments

A Non-stationary Infinite Partially-Observable Markov Decision Process

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Posterior sampling for Monte Carlo planning under uncertainty

Abstract

Access this article

Similar content being viewed by others

Importance Sampling for Online Planning under Uncertainty

A Partially-Observable Markov Decision Process for Dealing with Dynamically Changing Environments

A Non-stationary Infinite Partially-Observable Markov Decision Process

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation