Skip to main content

Average Reward Optimization with Multiple Discounting Reinforcement Learners

  • Conference paper
  • First Online:
Book cover Neural Information Processing (ICONIP 2017)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10634))

Included in the following conference series:

Abstract

Maximization of average reward is a major goal in reinforcement learning. Existing model-free, value-based algorithms such as R-Learning use average adjusted values. We propose a different framework, the Average Reward Independent Gamma Ensemble (AR-IGE). It is based on an ensemble of discounting Q-learning modules with a different discount factor for each module. Existing algorithms only learn the optimal policy and its average reward. In contrast, the AR-IGE learns different policies and their resulting average rewards. We prove the optimality of the AR-IGE in episodic and deterministic problems where rewards are given at several goal states. Furthermore, we show that the AR-IGE outperforms existing algorithms in such problems, especially in situations where policies have to be changed due to changes in the task. The AR-IGE represents a new way to optimize average reward that could lead to further improvements in the field.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Das, T.K., Gosavi, A., Mahadevan, S., Marchalleck, N.: Solving Semi-Markov decision problems using average reward reinforcement learning. Manage. Sci. 45(4), 560–574 (1999)

    Article  MATH  Google Scholar 

  2. Deisenroth, M.P., Neumann, G., Peters, J.: A survey on policy search for robotics. Found. Trends Robot. 2(1–2), 1–142 (2011)

    Google Scholar 

  3. Gosavi, A.: Reinforcement learning for long-run average cost. Eur. J. Oper. Res. 155(3), 654–674 (2004)

    Article  MATH  MathSciNet  Google Scholar 

  4. Kurth-Nelson, Z., Redish, A.D.: Temporal-difference reinforcement learning with distributed representations. PLoS One 4(10), e7362 (2009)

    Article  Google Scholar 

  5. Mahadevan, S., Marchalleck, N., Das, T.K., Gosavi, A.: Self-improving factory simulation using continuous-time average-reward reinforcement learning. In: Proceedings of the 14th International Conference on Machine Learning, pp. 202–210 (1997)

    Google Scholar 

  6. Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming, 1st edn. Wiley, New York (1994)

    Book  MATH  Google Scholar 

  7. Reinke, C., Uchibe, E., Doya, K.: Maximizing the average reward in episodic reinforcement learning tasks. In: 2015 International Conference on Intelligent Informatics and Biomedical Sciences (ICIIBMS), pp. 420–421. IEEE (2015)

    Google Scholar 

  8. Schwartz, A.: A reinforcement learning method for maximizing undiscounted rewards. In: Proceedings of the Tenth International Conference on Machine Learning, vol. 298, pp. 298–305 (1993)

    Google Scholar 

  9. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. Cambridge University Press, Cambridge (1998)

    Google Scholar 

  10. Tanaka, S.C., Schweighofer, N., Asahi, S., Shishida, K., Okamoto, Y., Yamawaki, S., Doya, K.: Serotonin differentially regulates short-and long-term prediction of rewards in the ventral and dorsal striatum. PLoS One 2(12), e1333 (2007)

    Article  Google Scholar 

  11. Tsitsiklis, J.N.: Asynchronous stochastic approximation and Q-learning. Mach. Learn. 16(3), 185–202 (1994)

    MATH  Google Scholar 

  12. Watkins, C.J.C.H., Dayan, P.: Q-learning. Mach. Learn. 8(3–4), 279–292 (1992)

    MATH  Google Scholar 

  13. Watkins, C.J.C.H.: Learning from delayed rewards. Ph.D. thesis, University of Cambridge, England (1989)

    Google Scholar 

  14. Yang, S., Gao, Y., An, B., Wang, H., Chen, X.: Efficient average reward reinforcement learning using constant shifting values. In: Thirtieth AAAI Conference on Artificial Intelligence (2016)

    Google Scholar 

Download references

Acknowledgement

We thank Tadashi Kozuno for his help with parts of the optimality proof.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chris Reinke .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Reinke, C., Uchibe, E., Doya, K. (2017). Average Reward Optimization with Multiple Discounting Reinforcement Learners. In: Liu, D., Xie, S., Li, Y., Zhao, D., El-Alfy, ES. (eds) Neural Information Processing. ICONIP 2017. Lecture Notes in Computer Science(), vol 10634. Springer, Cham. https://doi.org/10.1007/978-3-319-70087-8_81

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-70087-8_81

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-70086-1

  • Online ISBN: 978-3-319-70087-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics