A kernel based learning method for non-stationary two-player repeated games

https://doi.org/10.1016/j.knosys.2020.105820Get rights and content

Highlights

  • String kernel density estimation, capable of predicting the opponent’s actions.

  • Kernel functions with positive-definite property.

  • Long-term memory can be stored in a R-way Trie data structure.

  • Linear computational complexity for calculating the kernel for memory sequences.

Abstract

Repeated games is a branch of game theory, where a game can be played several times by the players involved. In this setting, players may not always play the optimal strategy or they may be willing to engage in collaboration or other types of behavior which might lead to a higher long-term profit. Since the same game is repeated for several rounds, and considering a scenario with complete information, it is possible for a player to analyze its opponent’s behavior in order to find patterns. These patterns can then be used to predict the opponent’s actions. Such a setting, where players have mutual information about past moves and do not always play in equilibrium, leads naturally to non-stationary environments, where the players can frequently modify their strategies in order to get ahead in the game. In this work, we propose a novel algorithm based on a string kernel density estimation, which is capable of predicting the opponent’s actions in repeated games and can be used to optimize the player’s profit over time. The prediction is not limited to the next round action. It can also be used to predict a finite sequence of future rounds, which can be combined with a lookahead search scheme with limited depth. In the experiments section, it is shown that the proposed algorithm is able to learn and adapt rapidly, providing good results even if the opponent also adopts an adaptive strategy.

Introduction

Prediction algorithms are usually based on the assumption of stationarity with respect to time, where a learning agent is adapting its policy based on a static environment. This contrasts with the setting usually found in multi-agent systems, where each agent is able to adapt its strategy based on the environment. In such settings, the environment changes as the learning agents change their strategies, since the actions taken by each agent can affect the environment and, consequently, the future behavior of the other agents. In order to be successful, a learning agent’s model of the environment is expected to change with time, often, without previous additional information about the other agents. This scenario is frequently seen in competitive environments where each agent seeks to maximize its own reward, which also depends on the actions of other agents.

One of such competitive scenarios are repeated games, which consists of a base game played multiple times by a set of players. This base game can be a simple game, ending with just a set of few actions. One common example is the popular game of “Rock, Paper and Scissors”, which can be taken simply as a game of chance. However, when the game is played repeatedly against the same opponents and there is full information about the actions played at each round, it may be possible to find patterns that can be used to predict the opponent’s next move.

An optimal strategy for a game can be found in Nash equilibrium [1], which occurs when the players do not have any incentive to change their current strategy if the other players are not changing theirs. That is, if any change is made by any of the players, it would result in profit loss. The Nash equilibrium for the game of Rock, Paper and Scissors is the strategy of playing each action with the same probability. The expected reward for this strategy is to have the same number of wins, defeats and ties against any strategy that the opponent might be using. However, if the opponent is not playing the Nash equilibrium, a learning agent might take advantage of this and change its strategy in order to increase its profit. This is often the case when these games are played against human opponents, since they are known not to play in Nash equilibrium (see, [2] and references therein). In addition, there are games where the players may mutually benefit if they do not play at equilibrium. Since the game is repeated, it may be possible to detect the intention of an adversary to engage in such types of behavior, which might lead to higher overall profit.

Stationary strategies have been studied in the context of repeated games with simultaneous play. For example, the Tit-for-Tat strategy, where the player chooses its next action based on the opponent’s last action, has been shown both empirically [3] and theoretically [4] to be a good strategy for the “Prisoner’s Dilemma”, a game where each player can choose whether to cooperate with the other player or defect. The disadvantage of such approaches is that they are usually tailored to specific games and cannot adapt their strategy according to the opponent’s behavior.

Online learning in competitive scenarios has been considered before through different means. One of the most well-known methods is the Fictitious Play [5], which keeps a record of the last moves by the opponent and takes its next action based on the frequency that each move was played. The Fictitious Play considers that the opponent has a stationary and non-adaptive strategy and that the players make their moves sequentially. This strategy was shown to converge to Nash equilibrium, however, with a poor convergence that scales exponentially with the number of rounds [6], [7], [8]. More recent online learning approaches are usually based on regret minimization and aimed at approximating Nash equilibrium [8], [9], [10]. These works assume a stationary environment, which implies an oblivious adversary and, therefore, are not suitable for the setting considered here. In particular, the definitions of regret used in these papers are based on those defined for oblivious bandit problems and, therefore, do not account for adaptive opponents [11]. This is also discussed by [12], where the author proposed a modified regret measure, called disappointment, which can be understood as an extension of the policy regret measure presented by [11] and [13]. Although the achievement of no disappointment is impossible, [12] proposes a meta-algorithm that achieves near optimum disappointment by exploring a set of experts strategies. While this approach is interesting, it relies on the existence and the correct choice of these expert strategies for each game considered.

There are also algorithms based on reinforcement learning for repeated games. One of the most well-known algorithms in this class is the so-called WoLF, which was proposed by [14]. The algorithm uses the principle “win or learn fast” (which gives its acronym), whose objective is to learn quickly when losing and more slowly when winning. WoLF uses the gradient ascent method to take over the space of strategies, computing the expected payoffs and making changes to its strategy in order to increase these values. In order to control the convergence and the learning speed, it uses an adaptive learning rate following the WoLF principle. The algorithm starts playing at the Nash equilibrium and, while it guarantees convergence to the optimal strategy against certain opponents, it requires millions of rounds before converging to the optimal policy. Therefore, it is not a good algorithm for non-stationary environments where the opponent may change its strategy at every few rounds. [15] provide an extensive comparison of reinforcement learning algorithms applied to repeated games and present a new algorithm called M-Qubed. Similarly to previous approaches, its state representation accounts for the history of previous moves and, therefore, the number of possible states grows exponentially with the length of the history and the number of actions, which may lead to problems related to the curse of dimensionality. In another reinforcement learning approach for non-stationary opponents in repeated games, the authors of [16] proposed a learning algorithm based on a drift exploration named R-max#, which is coupled with a switch detection mechanism that is used to detect when the opponent changes its strategy. The R-max# algorithm is a variation of R-max [17] a well-known model-based algorithm for reinforcement learning. The state representation for this algorithm is assumed to be provided by an expert and illustrated, in the numerical section, by using the history of past moves. Therefore, this approach will also suffer from the curse of dimensionality, when the state accounts for long histories or games with a large number of possible actions.

Another approach for non-stationary repeated games is through adaptive strategies, which rely on probabilistic opponent modeling. The Entropy Learning Prune Hypothesis Space (ELPH) algorithm [18], [19] is a method based on a statistical learning approach. It tries to predict the opponent’s behavior in order to increase its immediate payoff in repeated games. It works with a hypothesis space which is incremented after each round using the last s past moves performed by the opponent. Each hypothesis is composed of a subsequence pattern of these s past moves. A frequency of the player’s next move following each pattern is used to predict its behavior. Hence, if A is the set of different feasible moves associated with the game, its memory space can have a total of (|A|+1)s different hypothesis. Each memory sequence of size s creates a power set of (2s)1 (excluding the empty pattern) plausible hypothesis that must be added at each round in the hypothesis space, making the size of the memory required to run this algorithm grow exponentially in respect to s. This also increases the computational cost involved in adding and searching the hypothesis. This is a major problem, since having a sufficient size of memory is crucial in getting optimal results, even against opponents that are simple automata [20]. In order to deal with this exponential growth, a pruning strategy is used, which removes from the hypothesis space the patterns that have entropy levels higher than a given threshold.

Reinforcement learning algorithms, such as the Q-Learning, have been integrated with the ELPH algorithm and other prediction algorithms in repeated games [20]. The motivation for this is to consider the possible effects of the current action in future moves. Although, thousands of rounds are necessary during its training stage before it could model the opponents correctly [20], it was shown that lookahead procedures, such as these, can help when playing against opponents with memory. In a related work, [21] develop a comprehensive study of the human model of sequential decision making using the Rock, Paper and Scissors game. In this sense, comparisons based on computational experiments are made to investigate the influence of observational and reinforcement accounts in human behavior. For the observational interpretation of the learning process, the ELPH algorithm is employed to predict the behavior of the patients. For the reinforcement interpretation, the authors developed a non-stationary sequence learning model based on a reinforcement learning variation of the ELPH algorithm named RELPH (Reinforcement and Entropy Learned Pruned Hypothesis Space). The results suggest that humans employ a sub-optimal reinforcement based learning strategy rather than an objectively statistical learning approach.

In this paper, we present the String Kernel Density Estimation (SKDE) algorithm for non-stationary repeated games. Similarly to the ELPH algorithm, the SKDE algorithm uses the history of previous actions in order to construct a probabilistic model for the opponent. However, instead of generating subsequence patterns, the SKDE algorithm is based on a string kernel which was specially designed for repeated games and can be computed in linear time. To the best of our knowledge, this string kernel is new and it is the first time that such an approach is used in repeated games. This string kernel measures how the current short-term memory compares with past game sequences. This helps aggregating information from similar sequences of past moves and it simplifies state representation. The string kernel is used to compose an estimate for the probability that the opponent will play a move in the next round. Using this probability estimate, the algorithm chooses to play the action that gives the highest expected payoff. We show in the experiment section that the performance of SDKE is comparable to the performance of the ELPH algorithm when playing against non-static strategies while consuming less memory and being more computationally efficient. In addition, the SKDE algorithm is able to adapt more quickly to the opponent strategy’s changes.

The remainder of this paper is divided as follows: in Section 2, preliminary concepts related to repeated games and string kernel are presented. In Section 3, the String Kernel Density Estimation algorithm is presented along with a discussion about its computational complexity in time and space. In Section 4, we present the string kernel used in SKDE in more details. We also discuss a data structure that can improve the performance of the algorithm. In Section 5, some extensions to the SKDE algorithm are presented, such as the lookahead procedure, the memory window and the kernel weights adaptation, that can be used to increase its performance in certain game scenarios. In Section 6, some numerical experiments are performed to measure the effectiveness of the proposed algorithm with respect to opponents with non-stationary strategies. Finally, in Section 7, the conclusions are reported.

Section snippets

Preliminary concepts

In this section, we review some preliminary concepts and define the notation that will be used throughout the paper. We begin by defining the type of repeated games considered here and related strategies. Next, string kernels are briefly explained alongside their importance.

Probability density estimation

The goal of the proposed algorithm is to find any repeating pattern in the opponent’s strategy that can be exploited in order to increase the agent’s payoff. This is done in real time. After each round of the repeated game, the player’s memory is updated and it is used to predict the probabilities that each move will be played by the opponent in the next round. These probabilities are given by a kernel which passes over the player’s memory to find the associated probability density function.

String kernel

Given two short-term memories m and m of size s, the kernel compares their similarities. The comparison is made between every pair of substrings of size ns of consecutive moves that start and end on the same relative positions. For example, the substring of the first three elements of m is only compared with the substring of first three elements of m, while the subsequence of m starting in its second position and ending on its fifth position will only be compared with the subsequence of m

Advanced procedures

In this section, we consider some extensions of the proposed algorithm. We begin by discussing the use of the player’s own moves in the prediction process. This can be useful against adaptive or learning opponents. Next, we introduce a lookahead search for possible future moves by the opponent. Then, we discuss the introduction of a memory window in order to limit the long-term memory sequences to the most recent ones. This helps the algorithm to adapt to abrupt changes in the opponent’s

Experiments and results

In order to test the effectiveness of the SKDE algorithm, a set of experiments were carried out. The first set of experiments was designed to verify how the algorithm behaves in a game where the best results are attained through cooperation.

Therefore, for this case, we used the Prisoner’s Dilemma game and the 20 automata strategies presented by Piccolo and Squillero [29] for the opponents. Since these strategies are automata, the SKDE algorithm only used the history of its opponent’s actions in

Conclusions and future work

Repeated Games demand different approaches from the classical methods, such as the min–max algorithm or an equilibrium strategy, in order to learn effective strategies when playing against sub-optimal or adaptive opponents. In this scenario, a learning agent must be able to understand its opponent and to adapt quickly to a changing environment. In this paper, we presented a new algorithm for repeated games capable of adapting its strategy in order to increase its expected payoff. The algorithm

CRediT authorship contribution statement

Renan Motta Goulart: Data curation, Writing - original draft, Software, Validation. Saul C. Leite: Conceptualization, Methodology, Formal analysis, Writing - review & editing. Raul Fonseca Neto: Conceptualization, Methodology, Supervision, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001.

References (29)

  • DaskalakisC. et al.

    Near-optimal no-regret algorithms for zero-sum games

    Games Econ. Behav.

    (2015)
  • BowlingM. et al.

    Multiagent learning using a variable learning rate

    Artificial Intelligence

    (2002)
  • NashJ.F.

    Non-cooperative games

    Ann. Mat.

    (1951)
  • WrightJ.R. et al.

    Level-0 models for predicting human behavior in games

    J. Artificial Intelligence Res.

    (2019)
  • AxelrodR.

    The Evolution of Cooperation

    (1984)
  • LittmanM.L. et al.

    A polynomial-time nash equilibrium algorithm for repeated games

  • BrownG.W.

    Iterative solutions of games by fictitious play

  • BrandtA. et al.

    From external to internal regret

    J. Mach. Learn. Res.

    (2007)
  • RobinsonJ.

    An iterative method of solving a game

    Ann. Math.

    (1951)
  • ZinkevichM. et al.

    Regret minimization in games with incomplete information

  • JohansonM. et al.

    Efficient Nash equilibrium approximation through Monte Carlo counterfactual regret minimization

  • R. Arora, O. Dekel, A. Tewari, Online bandit learning against an adaptive adversary: From regret to policy regret, in:...
  • CrandallJ.W.

    Towards minimizing disappointment in repeated games

    J. Artificial Intelligence Res.

    (2014)
  • Cesa-BianchiN. et al.

    Online learning with switching costs and other adaptive adversaries

  • Cited by (0)

    View full text