Abstract
We consider the problem of maximizing the average reward in a controlled Markov environment, which also contains some arbitrarily varying elements. This problem is captured by a two-person stochastic game model involving the reward maximizing agent and a second player, which is free to use an arbitrary (non-stationary and unpredictable) control strategy. While the minimax value of the associated zero-sum game provides a guaranteed performance level, the fact that the second player’s behavior is observed as the game unfolds opens up the opportunity to improve upon this minimax value if the second player is not playing a worst-case strategy. This basic idea has been formalized in the context of repeated matrix games by the classical notions of regret minimization with respect to the Bayes envelope, where an attainable performance goal is defined in terms of the empirical frequencies of the opponent’s actions. This paper presents an extension of these ideas to problems with Markovian dynamics, under appropriate recurrence conditions. The Bayes envelope is first defined in a natural way in terms of the observed state action frequencies. As this envelope may not be attained in general, we define a proper convexification thereof as an attainable solution concept. In the specific case of single-controller games, where the opponent alone controls the state transitions, the Bayes envelope itself turns out to be convex and attainable. Some concrete examples are shown to fit in this framework.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Special issue on learning in games. Games and Economic Behavior, 29(1), November 1999.
P. Auer, N. Cesa-Bianchi, Y. Freund, and R.E. Schapire. Gambling in a rigged casino: The adversarial multi armed bandit problem. In Proc. 36th Annual Symposium on Foundations of Computer Science, pages 322–331. IEEE Computer Society Press, 1995.
D.P. Bertsekas and J.N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1995.
D. Blackwell. An analog of the minimax theorem for vector payoffs. Pacific J. Math., 6(1):1–8, 1956.
D. Blackwell. Controlled random walks. In Proc. International Congress of Mathematicians, 1954, volume III, pages 336–338. North-Holland, 1956.
J. Filar and K. Vrieze. Competitive Markov Decision Processes. Springer Verlag, 1996.
Y. Freund and R. Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29:79–103, November 1999.
D. Fudenberg and D. Levine. Universal consistency and cautious fictitious play. Journal of Economic Dynamic and Control, 19:1065–1990, 1995.
J. Hannan. Approximation to bayes risk in repeated play. In M. Dresher, A.W. Tucker, and P. Wolde, editors, Contribution to The Theory of Games, III, pages 97–139. Princeton University Press, 1957.
S. Hart and A. Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. DP 166, The Hebrew University of Jerusalem, Center for Rationality, 1998.
E. Lehrer. Approachability in infinite dimensional spaces and an application: A universal algorithm for generating extended normal numbers. Preprint, May 1998.
M.L. Littman. Markov games as a framework for multi-agent reinforcement learning. In Morgan Kaufman, editor, Eleventh International Conference on Machine Learning, pages 157–163, 1994.
S. Mannor and N. Shimkin. The empirical bayes envelope approach to regret minimization in stochastic games. Technical report EE-1262, Faculty of Electrical Engineering, Technion, Israel, October 2000. available from: http://tiger.technion.ac.il/~shie/Public/drmOct23techreport.ps.gz.
S. Mannor and N. Shimkin. Regret minimization in signal space for repeated matrix games with partial observations. Technical report EE-1242, Faculty of Electrical Engineering, Technion, Israel, March 2000. available from: http://tiger.technion.ac.il/~shie/Public/beMar16.ps.gz.
T. Parthasarathy and M. Stern. Markov games-a survey. Differential Games and Control Theory, 1977.
S.D. Patek. Stochastic Shortest Path Games. PhD thesis, LIDS MIT, January 1997.
M. Puterman. Markov Decision Processes. Wiley-Interscience, 1994.
E. Rasmunsen. Games and Information: An Introduction to Game Theory. Blackwell, 1994.
A. Rustichini. Minimizing regret: the general case. Games and Economic Behavior, 29:224–243, November 1999.
N. Shimkin and A. Shwartz. Guaranteed performance regions in markovian systems with competing decision makers. IEEE Trans. on Automatic Control, 38(1):84–95, January 1993.
X. Spiant. An approachability condition for general sets. Technical Report 496, Ecole Polytechnique, Paris, 1999.
V. Vovk. A game of prediction with experts advice. Journal of Computer and Systems Sciences, 56(2):153–173, April 1998.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Mannor, S., Shimkin, N. (2001). Adaptive Strategies and Regret Minimization in Arbitrarily Varying Markov Environments. In: Helmbold, D., Williamson, B. (eds) Computational Learning Theory. COLT 2001. Lecture Notes in Computer Science(), vol 2111. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44581-1_9
Download citation
DOI: https://doi.org/10.1007/3-540-44581-1_9
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42343-0
Online ISBN: 978-3-540-44581-4
eBook Packages: Springer Book Archive