Skip to main content
Log in

Natural actor-critic with baseline adjustment for variance reduction

  • Original Article
  • Published:
Artificial Life and Robotics Aims and scope Submit manuscript

Abstract

In this study, we discuss a baseline function for the estimation of a natural policy gradient with respect to variance, and demonstrate a condition in which an optimal baseline function that reduces the variance is equivalent to the state value function. However, outside of this condition, the state value could be considerably different from the optimal baseline. For such cases, an extended version of the NTD algorithm is proposed, where an auxiliary function is estimated to adjust the baseline, being state value estimates in the original NTD version, to the optimal baseline. The proposed algorithm is applied to simple MDPs and a challenging pendulum swing-up problem.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Baxter J, Bartlett P (2001) Infinite-horizon policy-gradient estimation. J Artif Intell Res 15:319–350

    Article  MATH  MathSciNet  Google Scholar 

  2. Kakade S (2002) A natural policy gradient. In: Advances in neural information processing systems, vol 14, MIT Press, Cambridge

    Google Scholar 

  3. Peters J, Vijayakumar S, Schaal S (2003) Reinforcement learning for humanoid robotics. IEEE-RAS International Conference on Humanoid Robots, Karlsruhe, Germany

  4. Morimura T, Uchibe E, Doya K (2005) Utilizing natural gradient in temporal difference reinforcement learning with eligibility traces. In: International Symposium on Information Geometry and its Applications, Tokyo, pp 256–263

  5. Greensmith E, Bartlett P, Baxter J (2004) Variance reduction techniques for gradient estimates in reinforcement learning. J Mach Learning Res 5:1471–1530

    MathSciNet  Google Scholar 

  6. Amari S (1998) Natural gradient works efficiently in learning. Neural Comput 10:251–276

    Article  Google Scholar 

  7. Bertsekas DP (1995) Dynamic programming and optimal control, vol 1 and 2. Athena Scientific, Belmont, MA

    MATH  Google Scholar 

  8. Sutton RS, Barto AG (1998) Reinforcement learning. MIT Press, Cambridge

    Google Scholar 

  9. Bagnell D, Schneider J (2003) Covariant policy search. Proceedings of the International Joint Conference on Artificial Intelligence, July, Morgan Kaufmann, Acapulco, Mexico

  10. Peters J, Schaal S (2006) Policy gradient methods for robotics. IEEE International Conference on Intelligent Robots and Systems, Beijing, China

  11. Baxter J, Bartlett P, Weaver L (2001) Experiments with infinitehorizon policy-gradient estimation. J Artif Intell Res 15:351–381

    MATH  MathSciNet  Google Scholar 

  12. Kimura H, Kobayashi S (1998) An analysis of actor/critic algorithms using eligibility traces: reinforcement learning with imperfect value function. International Conference on Machine Learning, Morgan Kaufmann, WI, pp 278–286

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tetsuro Morimura.

Additional information

This work was presented in part at the 13th International Symposium on Artificial Life and Robotics, Oita, Japan, January 31–February 2, 2008

About this article

Cite this article

Morimura, T., Uchibe, E. & Doya, K. Natural actor-critic with baseline adjustment for variance reduction. Artif Life Robotics 13, 275–279 (2008). https://doi.org/10.1007/s10015-008-0514-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10015-008-0514-8

Key words

Navigation