Skip to main content
Log in

Generalized reinforcement learning in perfect-information games

  • Original Paper
  • Published:
International Journal of Game Theory Aims and scope Submit manuscript

Abstract

This paper studies reinforcement learning in which players base their action choice on valuations they have for the actions. We identify two general conditions on valuation updating rules that together guarantee that the probability of playing a subgame perfect Nash equilibrium (SPNE) converges to one in games where no player is indifferent between two outcomes without every other player being also indifferent. The same conditions guarantee that the fraction of times a SPNE is played converges to one almost surely. We also show that for additively separable valuations, in which valuations are the sum of empirical and error terms, the conditions guaranteeing convergence can be made more intuitive. In addition, we give four examples of valuations that satisfy our conditions. These examples represent different degrees of sophistication in learning behavior and include well-known examples of reinforcement learning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. For example, even in a relatively simple game like tic-tac-toe, where each player has at most four action choices, the game tree contains 255,168 play paths (terminal nodes). If rotational and reflectional symmetries are considered, the number is reduced to 26,830. Either way, forming a complete strategy for the game, or even solving the game through backward induction, appears to be beyond the ability of human players.

  2. Whether experimentation is viewed as a conscious choice to explore or simply as a mistake, assuming that experimentation probability stays the same no matter how much experience a player has gained seems unnatural in a model of learning. A more plausible model of learning should reflect the fact that the rate at which a player experiments, or makes mistakes, when playing the same game for the millionth time would be lower than when she has played it only few times.

  3. For a similar critique of finite state space Markov learning models, see Ellison (1993).

  4. See, for example, Durrett (2010), Theorem 2.3.1 and Theorem 2.3.6 for Borel–Cantelli lemmas.

  5. That is, suppose we are on \(\bigcap _{T+1}^\infty B_t^c\), which occurs with probability \(\eta \).

  6. For example, even if a decision node is encountered for the first time in the thousandth time the game has been played, the player will still experiment with probability \(\frac{\varepsilon }{\sqrt{1000}}\) at this node, despite the fact that she has learned nothing about the actions at this particular node.

  7. Models similar to this have been studied widely in normal-form games. For example, Sarin and Vahid (1999) provide convergence results for a reinforcement learning model in single-player decision problems. Borgers and Sarin (1997) connect reinforcement learning with replicator dynamics, and Hopkins (2002) explores the connection between reinforcement learning and stochastic fictitious play. Beggs (2005) and Laslier et al. (2001) show conditions under which reinforcement learning rule converges to a Nash equilibrium in normal-form games.

  8. Similar condition, called “transfer of decision maker indifference,” has been used as a sufficient condition for order independence of removal of weakly dominated strategies in strategic-form games (Marx and Swinkels 1997; Østerdal 2005).

  9. See, for example, Osborne and Rubinstein (1994, pp. 100–101).

  10. We believe this approach to be more natural in our setting, where players are assumed to treat each game as an end in itself. In such setting, it is not clear why players would choose to experiment. Since they are not concerned with future payoffs, there is no reason why they would be willing to sacrifice current payoff and take an action that they believe to be suboptimal.

  11. This argument is based on Borel–Cantelli lemmas, but it glosses over the fact that M and T are random and that the events being considered here are not independent. The proofs given in the paper provide a formal argument.

  12. We also show that the fraction of times the SPNE is played converges to one with probability one.

  13. Random variable \(\tau _n^z\) is a stopping time. The following facts about stopping times are used throughout the paper. For any stopping time \(\tau \), \({\mathscr {F}}_{\tau } = \{B \in {\mathscr {F}} : \forall n \ B \cap \{\tau \le n\} \in {\mathscr {F}}_n\}\) is a \(\sigma \)-field consisting of events up to (random) time \(\tau \). If \(\tau _0 < \tau _1 < \tau _2 < \cdots \) almost surely, then \(\{{\mathscr {F}}_{\tau _n} : n \in {\mathbb {Z}}_{\scriptscriptstyle +}\}\), where \({\mathbb {Z}}_{\scriptscriptstyle +} = \{0,1,2,\ldots \}\), is a filtration. Moreover, if \(\{Y_t : t \in {\mathbb {Z}}_{\scriptscriptstyle +}\}\) is adapted to \(\{{\mathscr {F}}_t : t \in {\mathbb {Z}}_{\scriptscriptstyle +}\}\), then \(Y_{\tau _n}\) is adapted to \(\{{\mathscr {F}}_{\tau _n} : n \in {\mathbb {Z}}_{\scriptscriptstyle +}\}\), and if \(Y_t \rightarrow Y\) almost surely as \(t \rightarrow \infty \), then \(Y_{\tau _n} \rightarrow Y\) almost surely as \(n \rightarrow \infty \).

  14. We use phrase “on B, C” (or equivalently, “C on B”) to mean that for every \(\omega \in B\), property C holds.

  15. See, for example, Durrett (2010, Theorem 5.3.2).

  16. This assumption may appear strong at first glance. However, the assumption is in “if...then...” form, and it is the hypothesis part of the condition that is strong, which makes the assumption as a whole weak.

  17. See, for example, Loève (1978, p.53).

  18. Since there are no ties in the payoffs, SPNE of \({\mathscr {G}}_z\) is unique.

  19. The urn analogy is more natural if initial propensities and payoffs are assumed to be integers. Otherwise, one imagines an abstract urn process where balls are perfectly divisible.

  20. See, for example, Pemantle (2007), Section 3 and Theorem 3.3 in particular.

References

  • Beggs A (2005) On the convergence of reinforcement learning. J Econ Theory 122:1–36

    Article  Google Scholar 

  • Borgers T, Sarin R (1997) Learning through reinforcement and replicator dynamics. J Econ Theory 77:1–14

    Article  Google Scholar 

  • Durrett R (2010) Probability: theory and examples, 4th edn. Duxbury Press, New York

    Book  Google Scholar 

  • Ellison G (1993) Learning, local interaction, and coordination. Econometrica 61:1047–1072

    Article  Google Scholar 

  • Hopkins E (2002) Two competing models of how people learn in games. Econometrica 70:2141–2166

    Article  Google Scholar 

  • Jehiel P, Samet D (2000) Learning to play games in extensive form by valuation. J Econ Theory 124:129–148

    Article  Google Scholar 

  • Laslier J-F, Walliser B (2005) A reinforcement learning process in extensive form games. Int J Game Theory 33:219–227

    Article  Google Scholar 

  • Laslier J-F, Topol R, Walliser B (2001) A behavioral learning process in games. Games Econ Behav 37:340–366

    Article  Google Scholar 

  • Loève M (1978) Probability theory, vol II, 4th edn. Springer, Berlin

    Google Scholar 

  • Marx L, Swinkels J (1997) Order independence for iterated weak dominance. Games Econ Behav 18:219–245

  • Osborne M, Rubinstein A (1994) A course in game theory. MIT Press, Cambridge

    Google Scholar 

  • Østerdal L (2005) Iterated weak dominance and subgame dominance.J Math Econ 41:637–645

  • Pemantle R (2007) A survey of random processes with reinforcement. Probab Surv 4:1–79

    Article  Google Scholar 

  • Sarin R, Vahid F (1999) Payoff assessments without probabilities: a simple dynamic model of choice. Games Econ Behav 28:294–309

    Article  Google Scholar 

Download references

Acknowledgments

The earlier version of this paper benefited greatly from helpful discussions with Chris Shannon. Pak gratefully acknowledges support from NSF grants SES-9710424 and SES-9818759. Xu gratefully acknowledges support from China National Natural Science Foundation grant 71403217.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Maxwell Pak.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pak, M., Xu, B. Generalized reinforcement learning in perfect-information games. Int J Game Theory 45, 985–1011 (2016). https://doi.org/10.1007/s00182-015-0499-1

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00182-015-0499-1

Keywords

JEL Classification

Navigation