Abstract
In this study, we discuss a baseline function for the estimation of a natural policy gradient with respect to variance, and demonstrate a condition in which an optimal baseline function that reduces the variance is equivalent to the state value function. However, outside of this condition, the state value could be considerably different from the optimal baseline. For such cases, an extended version of the NTD algorithm is proposed, where an auxiliary function is estimated to adjust the baseline, being state value estimates in the original NTD version, to the optimal baseline. The proposed algorithm is applied to simple MDPs and a challenging pendulum swing-up problem.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Baxter J, Bartlett P (2001) Infinite-horizon policy-gradient estimation. J Artif Intell Res 15:319–350
Kakade S (2002) A natural policy gradient. In: Advances in neural information processing systems, vol 14, MIT Press, Cambridge
Peters J, Vijayakumar S, Schaal S (2003) Reinforcement learning for humanoid robotics. IEEE-RAS International Conference on Humanoid Robots, Karlsruhe, Germany
Morimura T, Uchibe E, Doya K (2005) Utilizing natural gradient in temporal difference reinforcement learning with eligibility traces. In: International Symposium on Information Geometry and its Applications, Tokyo, pp 256–263
Greensmith E, Bartlett P, Baxter J (2004) Variance reduction techniques for gradient estimates in reinforcement learning. J Mach Learning Res 5:1471–1530
Amari S (1998) Natural gradient works efficiently in learning. Neural Comput 10:251–276
Bertsekas DP (1995) Dynamic programming and optimal control, vol 1 and 2. Athena Scientific, Belmont, MA
Sutton RS, Barto AG (1998) Reinforcement learning. MIT Press, Cambridge
Bagnell D, Schneider J (2003) Covariant policy search. Proceedings of the International Joint Conference on Artificial Intelligence, July, Morgan Kaufmann, Acapulco, Mexico
Peters J, Schaal S (2006) Policy gradient methods for robotics. IEEE International Conference on Intelligent Robots and Systems, Beijing, China
Baxter J, Bartlett P, Weaver L (2001) Experiments with infinitehorizon policy-gradient estimation. J Artif Intell Res 15:351–381
Kimura H, Kobayashi S (1998) An analysis of actor/critic algorithms using eligibility traces: reinforcement learning with imperfect value function. International Conference on Machine Learning, Morgan Kaufmann, WI, pp 278–286
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was presented in part at the 13th International Symposium on Artificial Life and Robotics, Oita, Japan, January 31–February 2, 2008
About this article
Cite this article
Morimura, T., Uchibe, E. & Doya, K. Natural actor-critic with baseline adjustment for variance reduction. Artif Life Robotics 13, 275–279 (2008). https://doi.org/10.1007/s10015-008-0514-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10015-008-0514-8