Skip to main content
Log in

Approximate planning for bayesian hierarchical reinforcement learning

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

In this paper, we propose to use hierarchical action decomposition to make Bayesian model-based reinforcement learning more efficient and feasible for larger problems. We formulate Bayesian hierarchical reinforcement learning as a partially observable semi-Markov decision process (POSMDP). The main POSMDP task is partitioned into a hierarchy of POSMDP subtasks. Each subtask might consist of only primitive actions or hierarchically call other subtasks’ policies, since the policies of lower-level subtasks are considered as macro actions in higher-level subtasks. A solution for this hierarchical action decomposition is to solve lower-level subtasks first, then higher-level ones. Because each formulated POSMDP has a continuous state space, we sample from a prior belief to build an approximate model for them, then solve by using a recently introduced Monte Carlo Value Iteration with Macro-Actions solver. We name this method Monte Carlo Bayesian Hierarchical Reinforcement Learning. Simulation results show that our algorithm exploiting the action hierarchy performs significantly better than that of flat Bayesian reinforcement learning in terms of both reward, and especially solving time, in at least one order of magnitude.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Abbeel P, Coates A, Quigley M, Ng AY (2006) An application of reinforcement learning to aerobatic helicopter flight. In: Advances in neural information processing systems (NIPS), pp 1–8

  2. Abdoos M, Mozayani N, Bazzan ALC (2014) Hierarchical control of traffic signals using q-learning with tile coding. Appl Intell 40(2):201–213

    Article  Google Scholar 

  3. Asmuth J, Littman ML (2011) Learning is planning: near Bayesoptimal reinforcement learning via Monte-Carlo tree search. In: UAI, pp 19–26

  4. Atkeson CG (1997) Nonparametric model-based reinforcement learning. In: Advances in neural information processing systems (NIPS)

  5. Bai H, Hsu D, Lee WS, Vien NA (2010) Monte Carlo value iteration for continuous-state POMDPs. In: Algorithmic foundations of robotics IX, pp 175–191

  6. Barto AG, Mahadevan S (2003) Recent advances in hierarchical reinforcement learning. Discrete Event Dyn Syst 13(4):341–379

    Article  MathSciNet  Google Scholar 

  7. Baxter J, Tridgell A, Weaver L (2000) Learning to play chess using temporal differences. Mach Learn 40(3):243–263

    Article  MATH  Google Scholar 

  8. Cao F, Ray S (2012) Bayesian hierarchical reinforcement learning. In: Bartlett P, Pereira F, Burges C, Bottou L, Weinberger K (eds) Advances in neural information processing systems (NIPS), pp 73–81

  9. Castro PS, Precup D (2007) Using linear programming for Bayesian exploration in Markov decision processes. In: IJCAI, pp 2437–2442

  10. Dearden R, Friedman N, Russell SJ (1998) Bayesian Q-learning. In: AAAI, pp 761–768

  11. Dietterich TG (2000) Hierarchical reinforcement learning with the MAXQ value function decomposition. J Artif Intell Res (JAIR) 13:227–303

    MathSciNet  MATH  Google Scholar 

  12. Duff M (2002) Optimal learning: Computational procedures for Bayes-adaptive Markov decision processes. PhD thesis, University of Massassachusetts Amherst

  13. Engel Y, Mannor S, Meir R (2003) Bayes meets Bellman: The Gaussian process approach to temporal difference learning. In: Proceedings of the international conference on machine learning, pp 154–161

  14. Engel Y, Mannor S, Meir R (2005) Reinforcement learning with Gaussian processes. In: Proceedings of the International Conference on Machine Learning, pp 201–208

  15. Furmston T, Barber D (2010) Variational methods for reinforcement learning. In: AISTATS, pp 241–248

  16. Ghavamzadeh M, Engel Y (2006) Bayesian policy gradient algorithms. In: Advances in neural information processing systems (NIPS), pp 457–464

  17. Ghavamzadeh M, Engel Y (2007) Bayesian actor-critic algorithms. In: Proceedings of the international conference on machine learning, pp 297–304

  18. Granmo OC, Glimsdal S (2012) Accelerated Bayesian learning for decentralized two-armed bandit based decision making with applications to the goore game. Appl Intell

  19. Guez A, Silver D, Dayan P (2012) Efficient Bayes-adaptive reinforcement learning using sample-based search. In: Advances in neural information processing systems (NIPS), pp 1034–1042

  20. Hauskrecht M, Meuleau N, Kaelbling LP, Dean T, Boutilier C (1998) Hierarchical solution of Markov decision processes using macro-actions. In: UAI, pp 220–229

  21. He R, Brunskill E, Roy N (2010) PUMA: Planning under uncertainty with macro-actions. In: Proceedings of the association for the advancement of artificial intelligence (AAAI)

  22. Hong J, Prabhu VV (2004) Distributed reinforcement learning control for batch sequencing and sizing in just-in-time manufacturing systems. Applied Intelligence 20(1):71–87

    Article  Google Scholar 

  23. Iglesias A, Martínez P, Aler R, Fernández F (2009) Learning teaching strategies in an adaptive and intelligent educational system through reinforcement learning. Appl Intell 31(1):89–106

    Article  Google Scholar 

  24. Jong NK, Stone P (2008) Hierarchical model-based reinforcement learning: Rmax + MAXQ. In: Proceedings of the international

  25. Li J, Li Z, Chen J (2011) Microassembly path planning using reinforcement learning for improving positioning accuracy of a 1 cm3 omni-directional mobile microrobot. Appl Intell 34(2):211–225

    Article  Google Scholar 

  26. Lim ZW, Hsu D, Sun LW(2011) Monte Carlo value iteration with macro-actions. In: Advances in neural information processing systems (NIPS), pp 1287–1295

  27. Ngo H, LuciwM, F¨orster A, Schmidhuber J (2012) Learning skills from play: Artificial curiosity on a Katana robot arm In: Proceedings of the international joint conference of neural networks (IJCNN)

  28. Ngo H, Luciw M, Förster A, Schmidhuber J (2013) Confidence-based progress-driven self-generated goals for skill acquisition in developmental robots. Front Psychol 4

  29. Pakizeh E, Palhang M, Pedram MM (2012) Multi-criteria expertness based cooperative Q-learning. Appl Intell

  30. Pineau J (2004) Tractable planning under uncertainty: exploiting structure. Ph.D. thesis. Robotics Institute, Carnegie Mellon University

    Google Scholar 

  31. Pineau J, Thrun S (2001) An integrated approach to hierarchy and abstraction for POMDPs. Tech. rep. Carnegie Mellon University, Robotics Institute

    Google Scholar 

  32. Porta JM, Vlassis NA, Spaan MTJ, Poupart P (2006) Point-based value iteration for continuous POMDPs. JMLR 7:2329–2367

    MathSciNet  MATH  Google Scholar 

  33. Poupart P, Vlassis NA, Hoey J, Regan K (2006) An analytic solution to discrete Bayesian reinforcement learning. In: Proceedings of the international conference on machine learning, pp 697–704

  34. Ross S, Chaib-draa B, Pineau J (2007) Bayes-adaptive POMDPs. In: Advances in neural information processing systems (NIPS)

  35. Ross S, Pineau J Model-based bayesian reinforcement learning in large structured domains. In: UAI, pp. 476–483, (2008)

  36. Samuel AL (1959) Some studies in machine learning using the game of checkers. IBM J Res Dev 3(3):210–229

    Article  MathSciNet  Google Scholar 

  37. Singh SP, Bertsekas D (1996) Reinforcement learning for dynamic channel allocation in cellular telephone systems. In: Advances in neural information processing systems (NIPS), pp 974–980

  38. Strens MJA (2000) A Bayesian framework for reinforcement learning. In: Proceedings of the international conference on machine learning, pp 943–950

  39. Sun S (2013) A review of deterministic approximate inference techniques for Bayesian machine learning. Neural Comput Applic 23(7-8):2039–2050

    Article  Google Scholar 

  40. Sutton RS, Barto AG (1998) Reinforcement learning: An introduction. MIT Press, Cambridge, MA

    Google Scholar 

  41. Sutton RS, Precup D, Singh SP (1999) Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artif Intell 112(1-2):181–211

    Article  MathSciNet  MATH  Google Scholar 

  42. Szepesvári C (2010) Algorithms for reinforcement learning. Synth Lect Artif Intell Mach Learn 4(1):1–103

    Article  Google Scholar 

  43. Tesauro G (1992) Practical issues in temporal difference learning. Mach Learn 8:257–277

    MATH  Google Scholar 

  44. Tesauro G (1994) TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural Comput 6(2):215–219

    Article  Google Scholar 

  45. Tesauro G (1995) Temporal difference learning and TD-Gammon. Commun ACM 38(3):58–68

    Article  Google Scholar 

  46. Strens MJA (2000) A Bayesian framework for reinforcement learning. In: Proceedings of the international conference on machine learning, pp 943–950

  47. Turkett WH Robust multiagent plan generation and execution with decision theoretic planners. Ph.D. thesis, Department of Computer Science and Engineering, University of South Carolina (1998)

  48. Vien NA, Chung T (2007) Natural gradient policy for average cost SMDP problem. In: Proceedings of the IEEE international conference on tools with artificial intelligence, pp 11– 18

  49. Vien NA, Chung T (2008) Policy gradient semi-Markov decision process. In: Proceedings of the IEEE international conference on tools with artificial intelligence, pp 11–18

  50. Vien NA, Ertel W, Chung T (2013) Learning via human feedback in continuous state and action spaces. Appl Intell 39(2)

  51. Vien NA, Ertel W, Dang VH, Chung T (2013) Monte-Carlo tree search for Bayesian reinforcement learning. Appl Intell 39(2):345–353

    Article  Google Scholar 

  52. Vien NA, Ngo H, Ertel W (2014) Monte Carlo Bayesian hierarchical reinforcement learning. In: Proceedings of the international conference on autonomous agents and multi-agent systems (AAMAS), pp 1551–1552. International Foundation for Autonomous Agents and Multiagent Systems (2014)

  53. Vien NA, Viet NH, Lee S, Chung T (2007) Heuristic search based exploration in reinforcement learning. In: IWANN, pp 110–118

  54. Vien NA, Viet NH, Lee S, Chung T (2007) Obstacle avoidance path planning for mobile robot based on ant-q reinforcement learning algorithm. In: ISNN (1), pp 704–713

  55. Vien NA, Viet NH, Lee S, Chung T (2009) Policy gradient SMDP for resource allocation and routing in integrated services networks. IEICE Trans 92-B (6):2008–2022

    Article  Google Scholar 

  56. Vien NA, Yu H, Chung T (2011) Hessian matrix distribution for Bayesian policy gradient reinforcement learning. Info Sci 181(9):1671–1685

    Article  MathSciNet  MATH  Google Scholar 

  57. Viet NH, Vien NA, Chung T (2008) Policy gradient SMDP for resource allocation and routing in integrated services networks. In: ICNSC, pp 1541–1546

  58. Wang T, Lizotte DJ, Bowling MH, Schuurmans D (2005) Bayesian sparse sampling for on-line reward optimization. In: Proceedings of the international conference on machine learning, pp 956–963

  59. Wang Y, Won KS, Hsu D, Lee WS (2010) Monte Carlo Bayesian reinforcement learning. In: Proceedings of the international conference on machine learning

  60. White CC (1976) Procedures for the solution of a finite-horizon, partially observed, semi-Markov optimization problem. Oper Res 24(2):348–358

    Article  MATH  Google Scholar 

  61. Wu B, Zheng HY, Feng YP (2014) Point-based online value iteration algorithm in large pomdp. Appl Intell:546–555

  62. Zhang W, Dietterich TG (1995) A reinforcement learning approach to job-shop scheduling. In: International joint conferences on artificial intelligence, pp 1114–1120

Download references

Acknowledgments

This work was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science, and Technology (2010-0012609).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Ngo Anh Vien or TaeChoong Chung.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Vien, N., Ngo, H., Lee, S. et al. Approximate planning for bayesian hierarchical reinforcement learning. Appl Intell 41, 808–819 (2014). https://doi.org/10.1007/s10489-014-0565-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-014-0565-6

Keywords

Navigation