Monte-Carlo tree search for Bayesian reinforcement learning

Vien, Ngo Anh; Ertel, Wolfgang; Dang, Viet-Hung; Chung, TaeChoong

doi:10.1007/s10489-012-0416-2

Monte-Carlo tree search for Bayesian reinforcement learning

Published: 22 February 2013

Volume 39, pages 345–353, (2013)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Ngo Anh Vien¹,
Wolfgang Ertel¹,
Viet-Hung Dang² &
…
TaeChoong Chung³

1211 Accesses
12 Citations
Explore all metrics

Abstract

Bayesian model-based reinforcement learning can be formulated as a partially observable Markov decision process (POMDP) to provide a principled framework for optimally balancing exploitation and exploration. Then, a POMDP solver can be used to solve the problem. If the prior distribution over the environment’s dynamics is a product of Dirichlet distributions, the POMDP’s optimal value function can be represented using a set of multivariate polynomials. Unfortunately, the size of the polynomials grows exponentially with the problem horizon. In this paper, we examine the use of an online Monte-Carlo tree search (MCTS) algorithm for large POMDPs, to solve the Bayesian reinforcement learning problem online. We will show that such an algorithm successfully searches for a near-optimal policy. In addition, we examine the use of a parameter tying method to keep the model search space small, and propose the use of nested mixture of tied models to increase robustness of the method when our prior information does not allow us to specify the structure of tied models exactly. Experiments show that the proposed methods substantially improve scalability of current Bayesian reinforcement learning methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Bayesian reinforcement learning approach in markov games for computing near-optimal policies

Article 10 June 2023

Bayes-adaptive hierarchical MDPs

Article 29 January 2016

Planning in Discrete and Continuous Markov Decision Processes by Probabilistic Programming

References

Asmuth J, Li L, Littman ML, Nouri A, Wingate D (2009) A Bayesian sampling approach to exploration in reinforcement learning. In: Proceedings of the 25th conference on uncertainty in artificial intelligence (UAI-09)
Google Scholar
Asmuth J, Littman ML (2011) Learning is planning: near Bayes-optimal reinforcement learning via Monte-Carlo tree search. In: Proceedings of the twenty-seventh conference on uncertainty in artificial intelligence, pp 19–26
Google Scholar
Auer P, Cesa-Bianchi N, Fischer P (2002) Finite-time analysis of the multiarmed bandit problem. Mach Learn 47(2–3):235–256
Article MATH Google Scholar
Baxter J, Tridgell A, Weaver L (2000) Learning to play chess using temporal differences. Mach Learn 40(3):243–263
Article MATH Google Scholar
Brafman RI, Tennenholtz M (2002) R-max—a general polynomial time algorithm for near-optimal reinforcement learning. J Mach Learn Res 3:213–231
MathSciNet Google Scholar
Castro PS, Precup D (2007) Using linear programming for Bayesian exploration in Markov decision processes. In: IJCAI 2007. Proceedings of the 20th international joint conference on artificial intelligence, Hyderabad, India, January 6–12, 2007, pp 2437–2442
Google Scholar
Dearden R, Friedman N, Russell SJ (1998) Bayesian Q-learning. In: Proceedings of the fifteenth national conference on artificial intelligence and tenth innovative applications of artificial intelligence conference, AAAI/IAAI 98, Madison, WI, USA, July 26–30, 1998, pp 761–768
Google Scholar
Duff M (2002) Optimal learning: computational procedures for Bayes-adaptive Markov decision processes. PhD thesis, University of Massassachusetts Amherst
Engel Y, Mannor S, Meir R (2003) Bayes meets bellman: the Gaussian process approach to temporal difference learning. In: International conference on machine learning (ICML), pp 154–161
Google Scholar
Engel Y, Mannor S, Meir R (2005) Reinforcement learning with Gaussian processes. In: International conference on machine learning (ICML), pp 201–208
Google Scholar
Gelly S, Silver D (2007) Combining online and offline knowledge in uct. In: International conference on machine learning (ICML), pp 273–280
Google Scholar
Ghavamzadeh M, Engel Y (2006) Bayesian policy gradient algorithms. In: Advances in neural information processing (NIPS), pp 457–464
Google Scholar
Ghavamzadeh M, Engel Y (2007) Bayesian actor-critic algorithms. In: International conference on machine learning (ICML), pp 297–304
Google Scholar
Granmo OC, Glimsdal S (2012) Accelerated Bayesian learning for decentralized two-armed bandit based decision making with applications to the goore game. Appl Intell
Hong J, Prabhu VV (2004) Distributed reinforcement learning control for batch sequencing and sizing in just-in-time manufacturing systems. Appl Intell 20(1):71–87
Article Google Scholar
Hsu D, Lee WS, Rong N (2007) What makes some POMDP problems easy to approximate? In: Advances in neural information processing (NIPS)
Google Scholar
Iglesias A, Martínez P, Aler R, Fernández F (2009) Learning teaching strategies in an adaptive and intelligent educational system through reinforcement learning. Appl Intell 31(1):89–106
Article Google Scholar
Kakade S, Kearns MJ, Langford J (2003) Exploration in metric state spaces. In: International conference on machine learning (ICML), pp 306–312
Google Scholar
Kearns MJ, Singh SP (2002) Near-optimal reinforcement learning in polynomial time. Mach Learn 49(2–3):209–232
Article MATH Google Scholar
Kocsis L, Szepesvári C (2006) Bandit based Monte-Carlo planning. In: European conference on machine learning (ECML), pp 282–293
Google Scholar
Kolter JZ, Ng AY (2009) Near-Bayesian exploration in polynomial time. In: International conference on machine learning (ICML), p 65
Google Scholar
Li J, Li Z, Chen J (2011) Microassembly path planning using reinforcement learning for improving positioning accuracy of a 1 cm³ omni-directional mobile microrobot. Appl Intell 34(2):211–225
Article Google Scholar
Pakizeh E, Palhang M, Pedram MM (2012) Multi-criteria expertness based cooperative Q-learning. Appl Intell
Poupart P, Vlassis NA, Hoey J, Regan K (2006) An analytic solution to discrete Bayesian reinforcement learning. In: International conference on machine learning (ICML), pp 697–704
Google Scholar
Ross S, Chaib-draa B, Pineau J (2007) Bayes-adaptive POMDPs. In: Advances in neural information processing (NIPS)
Google Scholar
Ross S, Pineau J (2008) Model-based Bayesian reinforcement learning in large structured domains. In: Proceedings of the 24th conference in uncertainty in artificial intelligence, pp 476–483
Google Scholar
Russell SJ, Norvig P (2003) Artificial intelligence: a modern approach, 2nd edn. Prentice Hall, Upper Saddle River
Google Scholar
Samuel AL (1959) Some studies in machine learning using the game of checkers. IBM J Res Dev 3(3):210–229
Article MathSciNet Google Scholar
Silver D, Veness J (2010) Monte-Carlo planning in large POMDPs. In: Advances in neural information processing (NIPS), pp 2164–2172
Google Scholar
Singh SP, Bertsekas D (1996) Reinforcement learning for dynamic channel allocation in cellular telephone systems. In: Advances in neural information processing systems, vol NIPS, pp 974–980
Google Scholar
Strehl AL, Littman ML (2008) An analysis of model-based interval estimation for Markov decision processes. J Comput Syst Sci 74(8):1309–1331
Article MathSciNet MATH Google Scholar
Strens MJA (2000) A Bayesian framework for reinforcement learning. In: Proceedings of the seventeenth international conference on machine learning (ICML 2000). Stanford University, Stanford, CA, USA, June 29–July 2, 2000, pp 943–950
Google Scholar
Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. MIT Press, Cambridge
Google Scholar
Szita I, Szepesvári C (2010) Model-based reinforcement learning with nearly tight exploration complexity bounds. In: International conference on machine learning (ICML), pp 1031–1038
Google Scholar
Tesauro G (1992) Practical issues in temporal difference learning. Mach Learn 8:257–277
MATH Google Scholar
Tesauro G (1994) Td-gammon, a self-teaching backgammon program, achieves master-level play. Neural Comput 6(2):215–219
Article Google Scholar
Tesauro G (1995) Temporal difference learning and td-gammon. Commun ACM 38(3):58–68
Article Google Scholar
Vien NA, Viet NH, Lee S, Chung T (2009) Policy gradient SMDP for resource allocation and routing in integrated services networks. IEICE Trans 92-B(6):2008–2022
Google Scholar
Vien NA, Yu H, Chung T (2011) Hessian matrix distribution for Bayesian policy gradient reinforcement learning. Inf Sci 181(9):1671–1685
Article MathSciNet MATH Google Scholar
Walsh TJ, Goschin S, Littman ML (2010) Integrating sample-based planning and model-based reinforcement learning. In: Proceedings of the twenty-fourth AAAI conference on artificial intelligence (AAAI 2010), Atlanta, GA, USA, July 11–15, 2010, pp 11–15
Google Scholar
Wang T, Lizotte DJ, Bowling MH, Schuurmans D (2005) Bayesian sparse sampling for on-line reward optimization. In: International conference on machine learning (ICML), pp 956–963
Google Scholar
Zhang W, Dietterich TG (1995) A reinforcement learning approach to job-shop scheduling. In: International joint conferences on artificial intelligence, pp 1114–1120
Google Scholar

Download references

Acknowledgements

This work was supported by the Collaborative Center of Applied Research on Service Robotics (ZAFH Servicerobotik, http://www.zafh-servicerobotik.de) and the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science, and Technology (2010-0012609).

Author information

Authors and Affiliations

Institute of Artificial Intelligence, Ravensburg-Weingarten University of Applied Sciences, Weingarten, 88250, Germany
Ngo Anh Vien & Wolfgang Ertel
Research and Development Center for Science and Technology, DuyTan University, Da Nang, Vietnam
Viet-Hung Dang
Department of Computer Engineering, Kyung Hee University, Seoul, South Korea
TaeChoong Chung

Authors

Ngo Anh Vien
View author publications
You can also search for this author in PubMed Google Scholar
Wolfgang Ertel
View author publications
You can also search for this author in PubMed Google Scholar
Viet-Hung Dang
View author publications
You can also search for this author in PubMed Google Scholar
TaeChoong Chung
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ngo Anh Vien.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Vien, N.A., Ertel, W., Dang, VH. et al. Monte-Carlo tree search for Bayesian reinforcement learning. Appl Intell 39, 345–353 (2013). https://doi.org/10.1007/s10489-012-0416-2

Download citation

Published: 22 February 2013
Issue Date: September 2013
DOI: https://doi.org/10.1007/s10489-012-0416-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Monte-Carlo tree search for Bayesian reinforcement learning

Abstract

Access this article

Similar content being viewed by others

A Bayesian reinforcement learning approach in markov games for computing near-optimal policies

Bayes-adaptive hierarchical MDPs

Planning in Discrete and Continuous Markov Decision Processes by Probabilistic Programming

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Monte-Carlo tree search for Bayesian reinforcement learning

Abstract

Access this article

Similar content being viewed by others

A Bayesian reinforcement learning approach in markov games for computing near-optimal policies

Bayes-adaptive hierarchical MDPs

Planning in Discrete and Continuous Markov Decision Processes by Probabilistic Programming

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation