Abstract
This paper studies reinforcement learning in which players base their action choice on valuations they have for the actions. We identify two general conditions on valuation updating rules that together guarantee that the probability of playing a subgame perfect Nash equilibrium (SPNE) converges to one in games where no player is indifferent between two outcomes without every other player being also indifferent. The same conditions guarantee that the fraction of times a SPNE is played converges to one almost surely. We also show that for additively separable valuations, in which valuations are the sum of empirical and error terms, the conditions guaranteeing convergence can be made more intuitive. In addition, we give four examples of valuations that satisfy our conditions. These examples represent different degrees of sophistication in learning behavior and include well-known examples of reinforcement learning.
Similar content being viewed by others
Notes
For example, even in a relatively simple game like tic-tac-toe, where each player has at most four action choices, the game tree contains 255,168 play paths (terminal nodes). If rotational and reflectional symmetries are considered, the number is reduced to 26,830. Either way, forming a complete strategy for the game, or even solving the game through backward induction, appears to be beyond the ability of human players.
Whether experimentation is viewed as a conscious choice to explore or simply as a mistake, assuming that experimentation probability stays the same no matter how much experience a player has gained seems unnatural in a model of learning. A more plausible model of learning should reflect the fact that the rate at which a player experiments, or makes mistakes, when playing the same game for the millionth time would be lower than when she has played it only few times.
For a similar critique of finite state space Markov learning models, see Ellison (1993).
See, for example, Durrett (2010), Theorem 2.3.1 and Theorem 2.3.6 for Borel–Cantelli lemmas.
That is, suppose we are on \(\bigcap _{T+1}^\infty B_t^c\), which occurs with probability \(\eta \).
For example, even if a decision node is encountered for the first time in the thousandth time the game has been played, the player will still experiment with probability \(\frac{\varepsilon }{\sqrt{1000}}\) at this node, despite the fact that she has learned nothing about the actions at this particular node.
Models similar to this have been studied widely in normal-form games. For example, Sarin and Vahid (1999) provide convergence results for a reinforcement learning model in single-player decision problems. Borgers and Sarin (1997) connect reinforcement learning with replicator dynamics, and Hopkins (2002) explores the connection between reinforcement learning and stochastic fictitious play. Beggs (2005) and Laslier et al. (2001) show conditions under which reinforcement learning rule converges to a Nash equilibrium in normal-form games.
See, for example, Osborne and Rubinstein (1994, pp. 100–101).
We believe this approach to be more natural in our setting, where players are assumed to treat each game as an end in itself. In such setting, it is not clear why players would choose to experiment. Since they are not concerned with future payoffs, there is no reason why they would be willing to sacrifice current payoff and take an action that they believe to be suboptimal.
This argument is based on Borel–Cantelli lemmas, but it glosses over the fact that M and T are random and that the events being considered here are not independent. The proofs given in the paper provide a formal argument.
We also show that the fraction of times the SPNE is played converges to one with probability one.
Random variable \(\tau _n^z\) is a stopping time. The following facts about stopping times are used throughout the paper. For any stopping time \(\tau \), \({\mathscr {F}}_{\tau } = \{B \in {\mathscr {F}} : \forall n \ B \cap \{\tau \le n\} \in {\mathscr {F}}_n\}\) is a \(\sigma \)-field consisting of events up to (random) time \(\tau \). If \(\tau _0 < \tau _1 < \tau _2 < \cdots \) almost surely, then \(\{{\mathscr {F}}_{\tau _n} : n \in {\mathbb {Z}}_{\scriptscriptstyle +}\}\), where \({\mathbb {Z}}_{\scriptscriptstyle +} = \{0,1,2,\ldots \}\), is a filtration. Moreover, if \(\{Y_t : t \in {\mathbb {Z}}_{\scriptscriptstyle +}\}\) is adapted to \(\{{\mathscr {F}}_t : t \in {\mathbb {Z}}_{\scriptscriptstyle +}\}\), then \(Y_{\tau _n}\) is adapted to \(\{{\mathscr {F}}_{\tau _n} : n \in {\mathbb {Z}}_{\scriptscriptstyle +}\}\), and if \(Y_t \rightarrow Y\) almost surely as \(t \rightarrow \infty \), then \(Y_{\tau _n} \rightarrow Y\) almost surely as \(n \rightarrow \infty \).
We use phrase “on B, C” (or equivalently, “C on B”) to mean that for every \(\omega \in B\), property C holds.
See, for example, Durrett (2010, Theorem 5.3.2).
This assumption may appear strong at first glance. However, the assumption is in “if...then...” form, and it is the hypothesis part of the condition that is strong, which makes the assumption as a whole weak.
See, for example, Loève (1978, p.53).
Since there are no ties in the payoffs, SPNE of \({\mathscr {G}}_z\) is unique.
The urn analogy is more natural if initial propensities and payoffs are assumed to be integers. Otherwise, one imagines an abstract urn process where balls are perfectly divisible.
See, for example, Pemantle (2007), Section 3 and Theorem 3.3 in particular.
References
Beggs A (2005) On the convergence of reinforcement learning. J Econ Theory 122:1–36
Borgers T, Sarin R (1997) Learning through reinforcement and replicator dynamics. J Econ Theory 77:1–14
Durrett R (2010) Probability: theory and examples, 4th edn. Duxbury Press, New York
Ellison G (1993) Learning, local interaction, and coordination. Econometrica 61:1047–1072
Hopkins E (2002) Two competing models of how people learn in games. Econometrica 70:2141–2166
Jehiel P, Samet D (2000) Learning to play games in extensive form by valuation. J Econ Theory 124:129–148
Laslier J-F, Walliser B (2005) A reinforcement learning process in extensive form games. Int J Game Theory 33:219–227
Laslier J-F, Topol R, Walliser B (2001) A behavioral learning process in games. Games Econ Behav 37:340–366
Loève M (1978) Probability theory, vol II, 4th edn. Springer, Berlin
Marx L, Swinkels J (1997) Order independence for iterated weak dominance. Games Econ Behav 18:219–245
Osborne M, Rubinstein A (1994) A course in game theory. MIT Press, Cambridge
Østerdal L (2005) Iterated weak dominance and subgame dominance.J Math Econ 41:637–645
Pemantle R (2007) A survey of random processes with reinforcement. Probab Surv 4:1–79
Sarin R, Vahid F (1999) Payoff assessments without probabilities: a simple dynamic model of choice. Games Econ Behav 28:294–309
Acknowledgments
The earlier version of this paper benefited greatly from helpful discussions with Chris Shannon. Pak gratefully acknowledges support from NSF grants SES-9710424 and SES-9818759. Xu gratefully acknowledges support from China National Natural Science Foundation grant 71403217.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Pak, M., Xu, B. Generalized reinforcement learning in perfect-information games. Int J Game Theory 45, 985–1011 (2016). https://doi.org/10.1007/s00182-015-0499-1
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00182-015-0499-1