Abstract
One of the major difficulties in applying Q-learning to real-world domains is the sharp increase in the number of learning steps required to converge towards an optimal policy as the size of the state space is increased. In this paper we propose a method, PLANQ-learning, that couples a Q-learner with a STRIPS planner. The planner shapes the reward function, and thus guides the Q-learner quickly to the optimal policy. We demonstrate empirically that this combination of high-level reasoning and low-level learning displays significant improvements in scaling-up behaviour as the state-space grows larger, compared to both standard Q-learning and hierarchical Q-learning methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Barto, A., Mahadevan, S.: Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems 13(4), 341–379 (2003)
Dietterich, T.G.: Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research 13, 227–303 (2000)
Watkins, C.J.C.H.: Learning from Delayed Rewards. PhD thesis, Cambridge University, U.K. (1989)
Fikes, R., Nilsson, N.: STRIPS: A new approach to the application of theorem proving to problem solving. Artificial Intelligence 2, 189–208 (1971)
Blum, A.L., Furst, M.L.: Fast planning through planning graph analysis. Artificial Intelligence 90, 281–300 (1997)
Hoffmann, J.: A heuristic for domain independent planning and its use in an enforced hill-climbing algorithm. In: Proceedings of the 12th International Symposium on Methodologies for Intelligent Systems, pp. 216–227 (2000)
Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming. Athena Scientific (1996)
Ghallab, M., Howe, A., Knoblock, C., McDermott, D., Ram, A., Veloso, M., Weld, D., Wilkins, D.: PDDL—the planning domain definition language. Technical Report CVC TR-98-003/DCS TR-1165, Yale Center for Computational Vision and Control (1998)
Ryan, M.: Using abstract models of behaviours to automatically generate reinforcement learning hierarchies. In: Proceedings of the 19th International Conference on Machine Learning (2002)
Boutilier, C., Brafman, R.I., Geib, C.: Prioritized goal decomposition of Markov decision processes: Towards a synthesis of classical and decision theoretic planning. In: International Joint Conference on Artificial Intelligence (1997)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Grounds, M., Kudenko, D. (2008). Combining Reinforcement Learning with Symbolic Planning. In: Tuyls, K., Nowe, A., Guessoum, Z., Kudenko, D. (eds) Adaptive Agents and Multi-Agent Systems III. Adaptation and Multi-Agent Learning. AAMAS ALAMAS ALAMAS 2005 2007 2006. Lecture Notes in Computer Science(), vol 4865. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-77949-0_6
Download citation
DOI: https://doi.org/10.1007/978-3-540-77949-0_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-77947-6
Online ISBN: 978-3-540-77949-0
eBook Packages: Computer ScienceComputer Science (R0)