Abstract
QMDP-net is a recurrent network architecture that combines the features of model-free learning and model-based planning for planning under partial observability. The architecture represents a policy by connecting a partially observable Markov decision process (POMDP) model with the QMDP algorithm that uses value iteration to handle the POMDP model. However, as the value iteration used in QMDP iterates through the entire state space, it may suffer from the “curse of dimensionality”. Besides, as the policies based on the QMDP will not take actions to gain information, this may lead to bad policies in domains where information gathering is necessary. To address these two issues, this paper introduces two deep recurrent policy networks, asynchronous QMDP-net and ReplicatedQ-net, based on the plain QMDP-net. The former takes advantage of the idea of asynchronous update into the value iteration process of QMDP to learn a smaller abstract state space representation for planning. The latter partially replaces the QMDP with the replicated Q-learning algorithm to take informative actions. Experimental results demonstrate the proposed networks perform better than the plain QMDP-net on the robotic tasks in simulation.
Keywords
This work is in part supported by the National Natural Science Foundation of China under Grant Nos. 61876119 and 61502323, and the Natural Science Foundation of Jiangsu under Grant No. BK20181432.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Russell, S.J., Norvig, P.: Artificial Intelligence - A Modern Approach, 2nd Edn. Prentice Hall (2003). https://doi.org/10.1016/0004-3702(96)00007-0
Spaan, M.T.J., Vlassis, N.A.: Perseus: randomized point-based value iteration for POMDPs. J. Artif. Intell. Res. 24, 195–220 (2005). https://doi.org/10.1613/jair.1659
Karkus, P., Hsu, D., Lee, W.S.: QMDP-Net: deep learning for planning under partial observability. In: 30th Advances Neural Information Processing Systems (NIPS), pp. 4697–4707. arXiv preprint arXiv:1703.06692 (2017)
Sondik, E.J.: The optimal control of partially observable markov processes over the infinite horizon: discounted costs. Oper. Res. 26, 282–304 (1978). https://doi.org/10.1287/opre.26.2.282
Lovejoy, W.S.: Computationally feasible bounds for partially observed markov decision processes. Oper. Res. 39, 162–175 (1991). https://doi.org/10.1287/opre.39.1.162
Kaelbling, L.P., Littman, M.L., Cassandra, A.R.: Planning and acting in partially observable stochastic domains. Artif. Intell. 101, 99–134 (1998). https://doi.org/10.1016/S0004-3702(98)00023-X
Kurniawati, H., Hsu, D., Lee, W.S.: SARSOP: efficient point-based POMDP planning by approximating optimally reachable belief spaces. In: Robotics: Science and Systems (2008). https://doi.org/10.15607/rss.2008.iv.009
Pineau, J., Gordon, G.J., Thrun, S.: Applying metric-trees to belief-point POMDPs. In: 16th Advances Neural Information Processing Systems (NIPS), pp. 759–766 (2003)
Silver, D., Veness, J.: Monte-carlo planning in large POMDPs. In: 24th Advances Neural Information Processing Systems (NIPS), pp. 2164–2172 (2010)
Ye, N., Somani, A., Hsu, D., Lee, W.S.: DESPOT: online POMDP planning with regularization. J. Artif. Intell. Res. 58, 231–266 (2017). https://doi.org/10.1613/jair.5328
Bertsekas, D.P.: Distributed asynchronous computation of fixed points. Math. Program. 27, 107–120 (1983). https://doi.org/10.1007/bf02591967
Littman, M.L., Cassandra, A.R., Kaelbling, L.P.: Learning policies for partially observable environments: scaling up. In: 12th International Conference on Machine Learning (ICML), pp. 362–370 (1995). https://doi.org/10.1016/b978-1-55860-377-6.50052-9
Bellman, R.: Dynamic Programming. Princeton University Press, Princeton (1957)
Pan, Z., Zhang, Z., Chen, Z.: Asynchronous value iteration network. In: Cheng, L., Leung, A.C.S., Ozawa, S. (eds.) ICONIP 2018. LNCS, vol. 11302, pp. 169–180. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-04179-3_15
Moore, A.W., Atkeson, C.G.: Prioritized sweeping: reinforcement learning with less data and less time. Mach. Learn., 103–130 (1993). https://doi.org/10.1007/bf00993104
Barto, A.G., Bradtke, S.J., Singh, S.P.: Learning to act using real-time dynamic programming. Artif. Intell. 72, 81–138 (1995). https://doi.org/10.1016/0004-3702(94)00011-O
Schaul, T., Quan, J., Antonoglou, I., Silver, D.: Prioritized experience replay. In: 4th International Conference on Learning Representations (ICLR). arXiv preprint arXiv:1511.05952 (2016)
Aviv, T., Yi, W., Garrett, T., Sergey, L., Pieter, A.: Value iteration networks. In: 29th Advances in Neural Information Processing Systems (NIPS), pp. 2154–2162 (2016)
Oh, J., Singh, S., Lee, H.: Value prediction network. In: 30th Advances in Neural Information Processing Systems (NIPS), pp. 6120–6130 (2017)
Gupta, S., Davidson, J., Levine, S., Sukthankar, R., Malik, J.: Cognitive mapping and planning for visual navigation. In: 35th Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7272–7281 (2017). https://doi.org/10.1109/cvpr.2017.769
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Chen, Z., Zhang, Z. (2019). Deep Recurrent Policy Networks for Planning Under Partial Observability. In: Tetko, I., Kůrková, V., Karpov, P., Theis, F. (eds) Artificial Neural Networks and Machine Learning – ICANN 2019: Theoretical Neural Computation. ICANN 2019. Lecture Notes in Computer Science(), vol 11727. Springer, Cham. https://doi.org/10.1007/978-3-030-30487-4_46
Download citation
DOI: https://doi.org/10.1007/978-3-030-30487-4_46
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30486-7
Online ISBN: 978-3-030-30487-4
eBook Packages: Computer ScienceComputer Science (R0)