Abstract
There has been a problem called “exploration-exploitation problem” in the field of reinforcement learning. An agent must decide whether to explore a better action which may not necessarily exist, or to exploit many rewards by taking the current best action. In this article, we propose an off-policy reinforcement learning method based on a natural policy gradient learning, as a solution of the exploration-exploitation problem. In our method, the policy gradient is estimated based on a sequence of state-action pairs sampled by performing an arbitrary “behavior policy”; this allows us to deal with the exploration-exploitation problem by handling the generation process of behavior policies. By applying to an autonomous control problem of a three-dimensional cart-pole, we show that our method can realize an optimal control efficiently in a partially observable domain.
An erratum to this chapter can be found at http://dx.doi.org/10.1007/11550907_163 .
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aberdeen, D., Baxter, J.: Scaling internal-state policy-gradient methods for pomdps. In: Proceedings of the International Conference on Machine Learning (2002)
Aberdeen, D.: A survey of approximate methods for solving partially observable markov decision processe. Technical report, National ICT Australia, Canberra, Austalia (2003)
Ishii, S., Yoshida, W., Yoshimoto, J.: Control of exploitation-exploration meta-parameter in reinforcement learning. Neural Networks 15(4), 665–687 (2002)
Kakade, S.: A natural policy gradient. Advances in Neural Information Processing Systems 14, 1531–1538 (2001)
Nakamura, Y., Mori, T., Ishii, S.: In. International conference on parallel problem solving from nature (PPSN VIII), pp. 972–981 (2004)
Precup, D., Sutton, R.S., Dasgupta, S.: Off-policy temporal-difference learning with function approximation. In: Proceedings of the 18th international conference on machine learning, pp. 417–424 (2001)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)
Sutton, R.S., McAllester, D., Singh, S., Manour, Y.: Policy gradient method for reinforcement learning with function approximation. In: Advances in Neural Information Processing Systems, vol. 12, pp. 1057–1063 (2000)
Yoshimoto, J., Ishii, S., Sato, M.: System identification based on on-line variational bayes method and its application to reinforcement learning. In: Kaynak, O., Alpaydın, E., Oja, E., Xu, L. (eds.) ICANN 2003 and ICONIP 2003. LNCS, vol. 2714, pp. 123–131. Springer, Heidelberg (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Nakamura, Y., Mori, T., Ishii, S. (2005). An Off-Policy Natural Policy Gradient Method for a Partial Observable Markov Decision Process. In: Duch, W., Kacprzyk, J., Oja, E., Zadrożny, S. (eds) Artificial Neural Networks: Formal Models and Their Applications – ICANN 2005. ICANN 2005. Lecture Notes in Computer Science, vol 3697. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11550907_68
Download citation
DOI: https://doi.org/10.1007/11550907_68
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28755-1
Online ISBN: 978-3-540-28756-8
eBook Packages: Computer ScienceComputer Science (R0)