Average-Reward Reinforcement Learning

Tadepalli, Prasad

doi:10.1007/978-0-387-30164-8_49

Prasad Tadepalli

947 Accesses
1 Citations

Synonyms

ARL; Average-cost neuro-dynamic programming; Average-cost optimization; Average-payoff reinforcement learning

Definition

Average-reward reinforcement learning (ARL) refers to learning policies that optimize the average reward per time step by continually taking actions and observing the outcomes including the next state and the immediate reward.

Motivation and Background

Reinforcement learning (RL) is the study of programs that improve their performance at some task by receiving rewards and punishments from the environment (Sutton & Barto, 1998). RL has been quite successful in automatic learning of good procedures for complex tasks such as playing Backgammon and scheduling elevators (Tesauro, 1992; Crites & Barto, 1998). In episodic domains in which there is a natural termination condition such as the end of the game in Backgammon, the obvious performance measure to optimize is the expected total reward per episode. But some domains such as elevator scheduling are recurrent,...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Recommended Reading

Abounadi, J., Bertsekas, D. P., & Borkar, V. (2002). Stochastic approximation for non-expansive maps: Application to Q-learning algorithms. SIAM Journal of Control and Optimization, 41(1), 1–22.
MATH MathSciNet Google Scholar
Barto, A. G., Bradtke, S. J., & Singh, S. P. (1995). Learning to act using real-time dynamic programming. Artificial Intelligence, 72(1), 81–138.
Google Scholar
Bertsekas, D. P. (1995). Dynamic programming and optimal control. Belmont, MA: Athena Scientific.
MATH Google Scholar
Brafman, R. I., & Tennenholtz, M. (2002). R-MAX – a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 2, 213–231.
Google Scholar
Crites, R. H., & Barto, A. G. (1998). Elevator group control using multiple reinforcement agents. Machine Learning, 33(2/3), 235–262.
MATH Google Scholar
Ghavamzadeh, M., & Mahadevan, S. (2006). Hierarchical average reward reinforcement learning. Journal of Machine Learning Research, 13(2), 197–229.
Google Scholar
Kearns, M., & Singh S. (2002). Near-optimal reinforcement learning in polynomial time. Machine Learning, 49(2/3), 209–232.
MATH Google Scholar
Mahadevan, S. (1996). Average reward reinforcement learning: Foundations, algorithms, and empirical results. Machine Learning, 22(1/2/3), 159–195.
Google Scholar
Marbach, P., Mihatsch, O., & Tsitsiklis, J. N. (2000). Call admission control and routing in integrated service networks using neuro-dynamic programming. IEEE Journal on Selected Areas in Communications, 18(2), 197–208.
Google Scholar
Proper, S., & Tadepalli, P. (2006). Scaling model-based average-reward reinforcement learning for product delivery. In European conference on machine learning (pp. 725–742). Springer.
Google Scholar
Puterman, M. L. (1994). Markov decision processes: Discrete dynamic stochastic programming. New York: Wiley.
MATH Google Scholar
Schwartz, A. (1993). A reinforcement learning method for maximizing undiscounted rewards. In Proceedings of the tenth international conference on machine learning (pp. 298–305). San Mateo, CA: Morgan Kaufmann.
Google Scholar
Seri, S., & Tadepalli, P. (2002). Model-based hierarchical average-reward reinforcement learning. In Proceedings of international machine learning conference (pp. 562–569). Sydney, Australia: Morgan Kaufmann.
Google Scholar
Sutton, R., & Barto, A. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press.
Google Scholar
Tadepalli, P., & Ok, D. (1998). Model-based average-reward reinforcement learning. Artificial Intelligence, 100, 177–224.
MATH Google Scholar
Tesauro, G. (1992). Practical issues in temporal difference learning. Machine Learning, 8(3–4), 257–277.
MATH Google Scholar
Tsitsiklis, J., & Van Roy, B. (1999). Average cost temporal-difference learning. Automatica, 35(11), 1799–1808.
MATH Google Scholar
Van Roy, B., & Tsitsiklis, J. (2002). On average versus discounted temporal-difference learning. Machine Learning, 49(2/3), 179–191.
MATH Google Scholar
Wang, G., & Mahadevan, S. (1999). Hierarchical optimization of policy-coupled semi-Markov decision processes. In Proceedings of the 16th international conference on machine learning (pp. 464–473). Bled, Slovenia.
Google Scholar

Download references

Author information

Authors and Affiliations

Authors

Prasad Tadepalli
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer Science and Engineering, University of New South Wales, Sydney, Australia, 2052
Claude Sammut
Faculty of Information Technology, Clayton School of Information Technology, Monash University, P.O. Box 63, Victoria, Australia, 3800
Geoffrey I. Webb

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Tadepalli, P. (2011). Average-Reward Reinforcement Learning. In: Sammut, C., Webb, G.I. (eds) Encyclopedia of Machine Learning. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-30164-8_49

Download citation

DOI: https://doi.org/10.1007/978-0-387-30164-8_49
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-30768-8
Online ISBN: 978-0-387-30164-8
eBook Packages: Computer ScienceReference Module Computer Science and Engineering

Publish with us

Policies and ethics