Generalized reinforcement learning in perfect-information games

Pak, Maxwell; Xu, Bing

doi:10.1007/s00182-015-0499-1

Generalized reinforcement learning in perfect-information games

Original Paper
Published: 29 September 2015

Volume 45, pages 985–1011, (2016)
Cite this article

International Journal of Game Theory Aims and scope Submit manuscript

Maxwell Pak¹ &
Bing Xu¹

446 Accesses
2 Citations
Explore all metrics

Abstract

This paper studies reinforcement learning in which players base their action choice on valuations they have for the actions. We identify two general conditions on valuation updating rules that together guarantee that the probability of playing a subgame perfect Nash equilibrium (SPNE) converges to one in games where no player is indifferent between two outcomes without every other player being also indifferent. The same conditions guarantee that the fraction of times a SPNE is played converges to one almost surely. We also show that for additively separable valuations, in which valuations are the sum of empirical and error terms, the conditions guaranteeing convergence can be made more intuitive. In addition, we give four examples of valuations that satisfy our conditions. These examples represent different degrees of sophistication in learning behavior and include well-known examples of reinforcement learning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

A practical guide to multi-objective reinforcement learning and planning

Article Open access 13 April 2022

Introduction to Reinforcement Learning

Notes

For example, even in a relatively simple game like tic-tac-toe, where each player has at most four action choices, the game tree contains 255,168 play paths (terminal nodes). If rotational and reflectional symmetries are considered, the number is reduced to 26,830. Either way, forming a complete strategy for the game, or even solving the game through backward induction, appears to be beyond the ability of human players.
Whether experimentation is viewed as a conscious choice to explore or simply as a mistake, assuming that experimentation probability stays the same no matter how much experience a player has gained seems unnatural in a model of learning. A more plausible model of learning should reflect the fact that the rate at which a player experiments, or makes mistakes, when playing the same game for the millionth time would be lower than when she has played it only few times.
For a similar critique of finite state space Markov learning models, see Ellison (1993).
See, for example, Durrett (2010), Theorem 2.3.1 and Theorem 2.3.6 for Borel–Cantelli lemmas.
That is, suppose we are on \(\bigcap _{T+1}^\infty B_t^c\), which occurs with probability \(\eta \).
For example, even if a decision node is encountered for the first time in the thousandth time the game has been played, the player will still experiment with probability \(\frac{\varepsilon }{\sqrt{1000}}\) at this node, despite the fact that she has learned nothing about the actions at this particular node.
Models similar to this have been studied widely in normal-form games. For example, Sarin and Vahid (1999) provide convergence results for a reinforcement learning model in single-player decision problems. Borgers and Sarin (1997) connect reinforcement learning with replicator dynamics, and Hopkins (2002) explores the connection between reinforcement learning and stochastic fictitious play. Beggs (2005) and Laslier et al. (2001) show conditions under which reinforcement learning rule converges to a Nash equilibrium in normal-form games.
Similar condition, called “transfer of decision maker indifference,” has been used as a sufficient condition for order independence of removal of weakly dominated strategies in strategic-form games (Marx and Swinkels 1997; Østerdal 2005).
See, for example, Osborne and Rubinstein (1994, pp. 100–101).
We believe this approach to be more natural in our setting, where players are assumed to treat each game as an end in itself. In such setting, it is not clear why players would choose to experiment. Since they are not concerned with future payoffs, there is no reason why they would be willing to sacrifice current payoff and take an action that they believe to be suboptimal.
This argument is based on Borel–Cantelli lemmas, but it glosses over the fact that M and T are random and that the events being considered here are not independent. The proofs given in the paper provide a formal argument.
We also show that the fraction of times the SPNE is played converges to one with probability one.
Random variable \(\tau _n^z\) is a stopping time. The following facts about stopping times are used throughout the paper. For any stopping time \(\tau \), \({\mathscr {F}}_{\tau } = \{B \in {\mathscr {F}} : \forall n \ B \cap \{\tau \le n\} \in {\mathscr {F}}_n\}\) is a \(\sigma \)-field consisting of events up to (random) time \(\tau \). If \(\tau _0 < \tau _1 < \tau _2 < \cdots \) almost surely, then \(\{{\mathscr {F}}_{\tau _n} : n \in {\mathbb {Z}}_{\scriptscriptstyle +}\}\), where \({\mathbb {Z}}_{\scriptscriptstyle +} = \{0,1,2,\ldots \}\), is a filtration. Moreover, if \(\{Y_t : t \in {\mathbb {Z}}_{\scriptscriptstyle +}\}\) is adapted to \(\{{\mathscr {F}}_t : t \in {\mathbb {Z}}_{\scriptscriptstyle +}\}\), then \(Y_{\tau _n}\) is adapted to \(\{{\mathscr {F}}_{\tau _n} : n \in {\mathbb {Z}}_{\scriptscriptstyle +}\}\), and if \(Y_t \rightarrow Y\) almost surely as \(t \rightarrow \infty \), then \(Y_{\tau _n} \rightarrow Y\) almost surely as \(n \rightarrow \infty \).
We use phrase “on B, C” (or equivalently, “C on B”) to mean that for every \(\omega \in B\), property C holds.
See, for example, Durrett (2010, Theorem 5.3.2).
This assumption may appear strong at first glance. However, the assumption is in “if...then...” form, and it is the hypothesis part of the condition that is strong, which makes the assumption as a whole weak.
See, for example, Loève (1978, p.53).
Since there are no ties in the payoffs, SPNE of \({\mathscr {G}}_z\) is unique.
The urn analogy is more natural if initial propensities and payoffs are assumed to be integers. Otherwise, one imagines an abstract urn process where balls are perfectly divisible.
See, for example, Pemantle (2007), Section 3 and Theorem 3.3 in particular.

References

Beggs A (2005) On the convergence of reinforcement learning. J Econ Theory 122:1–36
Article Google Scholar
Borgers T, Sarin R (1997) Learning through reinforcement and replicator dynamics. J Econ Theory 77:1–14
Article Google Scholar
Durrett R (2010) Probability: theory and examples, 4th edn. Duxbury Press, New York
Book Google Scholar
Ellison G (1993) Learning, local interaction, and coordination. Econometrica 61:1047–1072
Article Google Scholar
Hopkins E (2002) Two competing models of how people learn in games. Econometrica 70:2141–2166
Article Google Scholar
Jehiel P, Samet D (2000) Learning to play games in extensive form by valuation. J Econ Theory 124:129–148
Article Google Scholar
Laslier J-F, Walliser B (2005) A reinforcement learning process in extensive form games. Int J Game Theory 33:219–227
Article Google Scholar
Laslier J-F, Topol R, Walliser B (2001) A behavioral learning process in games. Games Econ Behav 37:340–366
Article Google Scholar
Loève M (1978) Probability theory, vol II, 4th edn. Springer, Berlin
Google Scholar
Marx L, Swinkels J (1997) Order independence for iterated weak dominance. Games Econ Behav 18:219–245
Osborne M, Rubinstein A (1994) A course in game theory. MIT Press, Cambridge
Google Scholar
Østerdal L (2005) Iterated weak dominance and subgame dominance.J Math Econ 41:637–645
Pemantle R (2007) A survey of random processes with reinforcement. Probab Surv 4:1–79
Article Google Scholar
Sarin R, Vahid F (1999) Payoff assessments without probabilities: a simple dynamic model of choice. Games Econ Behav 28:294–309
Article Google Scholar

Download references

Acknowledgments

The earlier version of this paper benefited greatly from helpful discussions with Chris Shannon. Pak gratefully acknowledges support from NSF grants SES-9710424 and SES-9818759. Xu gratefully acknowledges support from China National Natural Science Foundation grant 71403217.

Author information

Authors and Affiliations

Research Institute of Economics and Management, Southwestern University of Finance and Economics, Chengdu, 610074, Sichuan, People’s Republic of China
Maxwell Pak & Bing Xu

Authors

Maxwell Pak
View author publications
You can also search for this author in PubMed Google Scholar
Bing Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Maxwell Pak.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pak, M., Xu, B. Generalized reinforcement learning in perfect-information games. Int J Game Theory 45, 985–1011 (2016). https://doi.org/10.1007/s00182-015-0499-1

Download citation

Accepted: 01 September 2015
Published: 29 September 2015
Issue Date: November 2016
DOI: https://doi.org/10.1007/s00182-015-0499-1

Keywords

JEL Classification

D83

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Generalized reinforcement learning in perfect-information games

Abstract

Access this article

Similar content being viewed by others

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

A practical guide to multi-objective reinforcement learning and planning

Introduction to Reinforcement Learning

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

JEL Classification

Navigation

Generalized reinforcement learning in perfect-information games

Abstract

Access this article

Similar content being viewed by others

Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms

A practical guide to multi-objective reinforcement learning and planning

Introduction to Reinforcement Learning

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

JEL Classification

Search

Navigation