Markov Decision Processes with Arbitrary Reward Processes

Yu, Jia Yuan; Mannor, Shie; Shimkin, Nahum

doi:10.1007/978-3-540-89722-4_21

Jia Yuan Yu³,
Shie Mannor³ &
Nahum Shimkin⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5323))

Included in the following conference series:

European Workshop on Reinforcement Learning

1132 Accesses

Abstract

We consider a control problem where the decision maker interacts with a standard Markov decision process with the exception that the reward functions vary arbitrarily over time. We extend the notion of Hannan consistency to this setting, showing that, in hindsight, the agent can perform almost as well as every deterministic policy. We present efficient online algorithms in the spirit of reinforcement learning that ensure that the agent’s performance loss, or regret, vanishes over time, provided that the environment is oblivious to the agent’s actions. However, counterexamples indicate that the regret does not vanish if the environment is not oblivious.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: The nonstochastic multiarmed bandit problem. SIAM J. Computing 32(1), 48–77 (2002)
Article MathSciNet MATH Google Scholar
Aumann, R.J.: Markets with a continuum of traders. Econometrica 32, 39–50 (1964)
Article MathSciNet MATH Google Scholar
Bertsekas, D.P.: Dynamic programming and optimal control, 2nd edn., vol. 2. Athena Scientific (2001)
Google Scholar
Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-dynamic programming. Athena Scientific (1996)
Google Scholar
Bobkov, S.G., Tetali, P.: Modified logarithmic Sobolev inequalities in discrete settings. Journal of Theoretical Probability 19(2), 289–336 (2006)
Article MathSciNet MATH Google Scholar
Borkar, V.S., Meyn, S.P.: The O.D.E. method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control and Optimization 38(2), 447–469 (2000)
Article MathSciNet MATH Google Scholar
Brafman, R.I., Tennenholtz, M.: R-max—a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research 3, 213–231 (2003)
MathSciNet MATH Google Scholar
Cesa-Bianchi, N., Lugosi, G.: Prediction, learning, and games. Cambridge University Press, Cambridge (2006)
Book MATH Google Scholar
Crites, R.H., Barto, A.G.: An actor/critic algorithm that is equivalent to Q-learning. In: Advances in Neural Information Processing Systems, pp. 401–408 (1995)
Google Scholar
Duffield, N.G., Massey, W.A., Whitt, W.: A nonstationary offered-load model for packet networks. Telecommunication Systems 16(3–4), 271–296 (2001)
Article MATH Google Scholar
Even-Dar, E., Kakade, S., Mansour, Y.: Experts in a Markov decision process. In: NIPS, pp. 401–408 (2004)
Google Scholar
Fudenberg, D., Kreps, D.M.: Learning mixed equilibria. Games and Economic Behavior 5(3), 320–367 (1993)
Article MathSciNet MATH Google Scholar
Hannan, J.: Approximation to Bayes risk in repeated play. In: Contributions to the Theory of Games, vol. 3, pp. 97–139. Princeton University Press, Princeton (1957)
Google Scholar
Herbster, M., Warmuth, M.K.: Tracking the best expert. Machine Learning 32(2), 151–178 (1998)
Article MATH Google Scholar
Kalai, A., Vempala, S.: Efficient algorithms for online decision problems. Journal of Computer and System Sciences 71(3), 291–307 (2005)
Article MathSciNet MATH Google Scholar
Mannor, S., Shimkin, N.: The empirical Bayes envelope and regret minimization in competitive Markov decision processes. Mathematics of Operations Research 28(2), 327–345 (2003)
Article MathSciNet MATH Google Scholar
Merhav, N., Ordentlich, E., Seroussi, G., Weinberger, M.J.: On sequential strategies for loss functions with memory. IEEE Trans. Inf. Theory 48(7), 1947–1958 (2002)
Article MathSciNet MATH Google Scholar
Shapley, L.: Stochastic games. PNAS 39(10), 1095–1100 (1953)
Article MathSciNet MATH Google Scholar
Yu, J.Y., Mannor, S., Shimkin, N.: Markov decision processes with arbitrarily varying rewards (Preprint, 2008), http://www.cim.mcgill.ca/~jiayuan/mdp.pdf

Download references

Author information

Authors and Affiliations

McGill University, Canada
Jia Yuan Yu & Shie Mannor
Technion, Israel
Nahum Shimkin

Authors

Jia Yuan Yu
View author publications
You can also search for this author in PubMed Google Scholar
Shie Mannor
View author publications
You can also search for this author in PubMed Google Scholar
Nahum Shimkin
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

INRIA Lille-Nord Europe, 59650, Villeneuve d’Ascq, France
Sertan Girgin
INRIA, LIFL, CNRS, Université de Lille, Villeneuve d’Ascq, France
Manuel Loth , Rémi Munos , Philippe Preux & Daniil Ryabko , , &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yu, J.Y., Mannor, S., Shimkin, N. (2008). Markov Decision Processes with Arbitrary Reward Processes. In: Girgin, S., Loth, M., Munos, R., Preux, P., Ryabko, D. (eds) Recent Advances in Reinforcement Learning. EWRL 2008. Lecture Notes in Computer Science(), vol 5323. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-89722-4_21

Download citation

DOI: https://doi.org/10.1007/978-3-540-89722-4_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-89721-7
Online ISBN: 978-3-540-89722-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics