Skip to main content

Optimizing Average Reward Using Discounted Rewards

  • Conference paper
  • First Online:
Computational Learning Theory (COLT 2001)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2111))

Included in the following conference series:

Abstract

In many reinforcement learningproblems, it is appropriate to optimize the average reward. In practice, this is often done by solving the Bellman equations usinga discount factor close to 1. In this paper, we provide a bound on the average reward of the policy obtained by solving the Bellman equations which depends on the relationship between the discount factor and the mixingtime of the Markov chain. We extend this result to the direct policy gradient of Baxter and Bartlett, in which a discount parameter is used to find a biased estimate of the gradient of the average reward with respect to the parameters of a policy. We show that this biased gradient is an exact gradient of a related discounted problem and provide a bound on the optima found by followingthese biased gradients of the average reward. Further, we show that the exact Hessian in this related discounted problem is an approximate Hessian of the average reward, with equality in the limit the discount factor tends to 1. We then provide an algorithm to estimate the Hessian from a sample path of the underlyingMark ov chain, which converges with probability 1.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. P. Bartlett and J. Baxter. Estimation and approximation bounds for gradient based reinforcement learning. Technical report, Australian National University, 2000.

    Google Scholar 

  2. J. Baxter and P. Bartlett. Direct gradient-based reinforcement learning. Technical report, Australian National University, Research School of Information Sciences and Engineering, July 1999.

    Google Scholar 

  3. J. Baxter and P. Bartlett. Algorithms for infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 2001. (forthcoming).

    Google Scholar 

  4. D.P. Bertsekas. Dynamic Programming and Optimal Control, Volumes 1 and 2. Athena Scientific, 1995.

    Google Scholar 

  5. P. Marbach and J. Tsitsiklis. Simulation-based optimization of markov reward processes. Technical report, Massachusetts Institute of Technology, 1998.

    Google Scholar 

  6. S. Singh, T. Jaakkola, and M.I. Jordan. Learning without state-estimation in partially observable markovian decision processes. Proc.11th International Conference on Machine Learning, 1994.

    Google Scholar 

  7. R. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learningwith function approximation. Neural Information Processing Systems, 13, 2000.

    Google Scholar 

  8. R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.

    Google Scholar 

  9. John N. Tsitsiklis and Benjamin Van Roy. Average cost temporal-difference learning. Automatica, 35:319–349, 1999.

    Article  Google Scholar 

  10. John N. Tsitsiklis and Benjamin Van Roy. On average versus discounted reward temporal-difference learning. Machine Learning, 2001. (forthcoming).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2001 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kakade, S. (2001). Optimizing Average Reward Using Discounted Rewards. In: Helmbold, D., Williamson, B. (eds) Computational Learning Theory. COLT 2001. Lecture Notes in Computer Science(), vol 2111. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44581-1_40

Download citation

  • DOI: https://doi.org/10.1007/3-540-44581-1_40

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-42343-0

  • Online ISBN: 978-3-540-44581-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics