Keywords

1 Introduction

For Bjornsson and Finnsson [5], General Game Playing (GGP) is the area of Artificial Intelligence of which the objective is to create intelligent agents who can learn, automatically, how to play a wide variety of board games, based only on the descriptions of the rules of the games. The foregoing implies that without prior knowledge about the game and while playing, the agent must be able to develop strategies that allow it to win. Since its inception GGP makes use of methods based on MinMax or Alpha-Beta tree [11], this is due to the nature of boardgames, to which tree can be associated, where the root node represents the initial state of the game and each child node represents the status of the game after some movement has been made. Due to the above, the leaf nodes of the game tree correspond to the statuses where the game has ended, whereby the agent must only find a leaf node where he achieves a win; this is the reason for using methods based on search trees.

Monte Carlo Tree Search (MCTS) is the search method based on the most popular tree in GGP, as it has a better performance in game trees [6]. MCTS consists of four steps that are repeated cyclically, until a stop criterion is met: Selection, Expansion, Simulation, and Back Propagation. The stop criterion can be a limit number of simulations, execution time or number of iterations [6].

 

Selection :

In this step the method crosses the tree from the root node until it finds a node that still has children to add to the tree, once this node is found the Expansion step is reached. The route taken by this step is guided by a Selection Policy, which indicates which node should be explored in each level; an example of this policy is to choose the node that has the highest ratio between wins and visits.

Expansion :

In this step, a corresponding child node is added to the node found in the selection step.

Simulation :

Starting from the status represented by the newly added node, the method simulates playing the game by performing the movements of the players randomly until a result is obtained.

Back Propagation :

In this step, the result of the simulation step is propagated in all the nodes visited, updating the number of wins and the number of visits of each node.

 

Once the method ends, the movement that the agent must perform is chosen among the children of the root node which could be: the node with the highest number of wins, the node with the highest number of visits, the node that meets the two previous criteria, or the node is chosen based on the selection policy. MCTS has the advantage of being able to be used at any time during the game since the root of the tree can be any status of the game, another feature of MCTS is to be efficient by not having to completely expand the tree as it resorts to probability to choose the movement that has the highest chance of leading to a win, so it is also known as a probabilistic method.

In recent years efforts have been made to improve MCTS, mainly in the Simulation Step where attempts have been made for the simulation to reflect movements of real adversaries without being completely deterministic, highlighting the works of Cazenave [8,9,10], whose idea is to make use of online knowledge, by identifying the movements of previous iterations that led to wins to be used, with a greater probability, in future iterations.

Another step of MCTS where efforts are made to improve it is the Selection Step specific to the Selection Policy; in the beginning the average of wins of each node was used as selection policy. However, new policies have been proposed, such as Upper Confidence Bound. In the Selection Step, at each level of the tree, MCTS has to make the following decision: Which node should be explored? The one in which the highest number of wins is obtained so far, or should we explore less promising nodes that may turn out to be better in future iterations? This decision is an instance of the Explore-Exploit Dilemma, which Auer et al. [3] describe as the search for a balance between exploring the environment to find profitable actions while taking the best empirical action as frequently as possible.

Another instance of the Explore-Exploit Dilemma is the Multi-Armed Bandit Problem (MABP), which consists in set of slot machines each of which has a certain probability of give a reward. The goal is to maximize the accumulative reward that is obtained when a machine is played in a series of rounds.

An algorithm that allows to decide which machine to activate in each round in MABP is known as Activation Policy, where Upper Confidence Bound (UCB) is the most popular, mainly because it is efficient, simple to implement and can be used at any time [3, 4]. However, there are other policies that achieve performance close to UCB such as UCB2, \(\epsilon \)-greedy, UCB-Tuned, UCB-Normal [3], UCB-Improved [4], UCB-V [2], UCB-Minimal [13] and Minimax Optimal Strategy in the Stochastic Case [1].

Because MABP and the decision made by MCTS in the Selection Step are instances of the Explore-Exploit Dilemma, it is possible to use Activation Policies as Selection Policies. In this case, each level of the tree is treated as a MABP where each node is equivalent to a slot machine; this idea was used for the first time by the agent Cadiaplayer, which made use of UCB as a selection policy, giving good results to such an extent that the combination of MCTS and UCB, known as Upper Confidence Bound Applied to Trees (UCT), became the state of the art of GGP.

The approach of activation policies like UCB and its similar ones is to minimize Cumulative Regret which is defined as the loss that is obtained due to the fact that the policy does not always choose the best machine [3]. However, this approach is not necessarily suitable for MCTS since in this the idea is to identify the node that is most likely to lead to a win. Another approach known as Simple Regret [7, 14] has been proposed, and is more suited to MCTS [12], and which is defined as the difference between the expected reward of the optimal machine (the machine with the highest probability of giving a reward) and the expected reward of the machine that has been identified as the optimal machine, from this approach emerge UCB\(_{\sqrt{.}}\)

This paper presents a comparison between five modifications of UCB and one of UCB\(_{\sqrt{.}}\). In order to find a selection policy that is able to identify the machine as quickly as possible, the above would be equivalent in MCTS to identifying the node that has the highest chance of leading to a win at each level of the tree. The comparison was made in the regarding MABP in two scenarios: the first consists of the scenario proposed by Auer et al. [3], in the second one the use of the branching factor of the game tree of different board games is used to generate sets of machines where the proposed policies were tested. The results show that certain policies find the optimal machine in the first iterations, although at close to 10,000 iterations it is UCB that activates the optimal machine.

2 Upper Confidence Bound

Auer et al. [3, 4] formally define MABP by the random variables \(X_{i,n} \in \left\{ 0,1 \right\} \) with \(1 \le i \le K\) and \(n \ge 1\), where each i is the index of a slot machine, and K the machines available. By successively activating the i machine, the rewards \(X_{i,1},X_{i,2}, \cdots \) are obtained, which are independent and identically distributed according to an unknown law with unknown expectation \(\mu _i\).

UCB is the most widely used policy in MABP because it achieves logarithmic and uniform regret as n increases, and does not require information about probability distributions and is easy to implement.

UCB consists in the following:

  1. 1.

    Play each machine once.

  2. 2.

    Play the machine j that maximize \(\bar{x}_{j}+\sqrt{\frac{2\ln n}{n_{j}}}\)

    where \(\bar{x}_{j}\) is the average reward obtained by the j machine, \(n_j\) is the number of the times that the j machine has been played and n is the total number of plays done so far.

  3. 3.

    The previous step is repeated until a certain number of rounds is reached.

3 Upper Confidence Bound\(_{\sqrt{.}}\)

Proposed by Tolpin and Shimony [14], the UCB\(_{\sqrt{.}}\) policy is the one used in MCTS and is focused on minimizing simple regret, and consists of:

  1. 1.

    Play each machine once

  2. 2.

    Play the machine j that maximize

    $$\begin{aligned} \bar{x}_{j}+\sqrt{\frac{c\sqrt{n}}{n_{j}}} \end{aligned}$$
    (1)

    where \(\bar{x}_{j}\) is the average reward obtained by the j machine, \(n_j\) is the number of the times that the j machine has been played and n is the total number plays done so far.

  3. 3.

    The previous step is repeated until a certain number of rounds is reached.

4 Policies Proposals

UCB is the most used policy in MABP and consequently in Monte Carlo Tree Search; in this section five modifications to this policy are presented:

$$\begin{aligned} \textit{UCB-A}=\overline{x}_{j}+\sqrt{\frac{2\log n}{n}} \end{aligned}$$
(2)
$$\begin{aligned} \textit{UCB-B}=\overline{x}_{j}+\sqrt{\frac{2\log n_{j}}{n_{j}}} \end{aligned}$$
(3)
$$\begin{aligned} \textit{UCB-C}=\overline{x}_{j}+\sqrt{\frac{2\log n_{j}}{n}} \end{aligned}$$
(4)
$$\begin{aligned} \textit{UCB-D}=\overline{x}_{j} \end{aligned}$$
(5)
$$\begin{aligned} \textit{UCB-E}=\overline{x}_{j}+\frac{n_{j}}{n} \end{aligned}$$
(6)

The UCB-A policy makes use only of the total number of Machine Activations (number of simulations of the parent node in MCTS). The UCB-B policy makes use of the number of activations of the machine (number of simulations in the child node). The policy UCB-C is similar to UCB but with \(n_j\) y n exchanged. The UCB-D policy only takes the average of rewards obtained in the machine (the average of wins per node in MCTS) that means that this policy is only for exploitation. The UCB-E policy requires the average of the plays. Finally, UCB-F is a modification of the policy UCB\(_{\sqrt{.}}\) with \(n_j\) y n exchanged.

$$\begin{aligned} \textit{UCB-F}=\overline{x}_{j}+\sqrt{\frac{2\sqrt{n_{j}}}{n}} \end{aligned}$$
(7)

5 Comparative of Policies

In this section we compare the performance of the proposed policies with respect to UCB and UCB\({\sqrt{.}}\). Specifically, we can see how good the policies are in choosing the optimal machine. The choice to measure how much a policy chooses the optimal machine is because in the MCTS field it is equivalent to choosing the child node in which the highest number of wins is given.

The policies were compared in the MABP in two scenarios; the first is the one proposed by Auer et al. [3] and the second scenario is where the branching factor of a set of board games is used.

5.1 First Scenario

This scenario is the one proposed by Auer et al. [3] to prove the policies UCB, UCB-T, UCB2, UCB-Normal and \(\epsilon \)-greedy. Auer et al. propose that the policies should be proven in 7 sets of machines, the Table 1 shows these sets with the probabilities of giving a reward of each of their machines.

Table 1. Sets of slots machine

For Auer et al. the sets A and D are easy to contrast because the reward of the optimal machine has low variance and the difference between the expected value of the optimal machine and suboptimal is wide. Sets C and G are hard sets because the reward of the optimal machine has high variance and the difference between the expected value of the optimal machine and suboptimal is small.

The policies were compared with the following conditions:

  • Each of the sets proposed by Auer et al. were used.

  • Each policy was tested 100 times in each set, from which the average of activation of the optimal machine was obtained.

  • The policies were limited to 100,000 rounds.

Results. This Figs. 1, 2, 3 and 4 show the results obtained in each set and the Table 2 shows the average percentage of plays of the optimal machine. We can note at 100,000 rounds, UCB is the best policy because it has the best performance due it activates the optimal machine over \(95.8\%\) on average, the same happens at 10,000 rounds where the optimal machine is activated \(80.6\%\). However, in lower rounds UCB-A and UCB-B are the policies that activate the optimal machine more frequently, over \(71\%\) at 1, 000 rounds and over \(57\%\) at 100 rounds. It is worth highlighting that \(UCB-B\) has performance similar to the performance of UCB, and it is the second-best performance at 100,000 and 10,000 rounds. The other policies have a performance bellow UCB, UCB-A, UCB-B and UCB-D, and it is UCB-E the worst policy due it only reaches \(37\%\) of the activation of the optimal machine. From the figures we can note that UCB in the first rounds it is dedicated to exploration in order to find the optimal machine without underestimate any other suboptimal machine, in these same rounds UCB-A and UCB-D are the polices that most quickly activate the optimal machine in all sets except for the set C. However, both policies tend to stagnate after 1,000 rounds and they do not overcome to UCB.

Fig. 1.
figure 1

Activations of optimal machine in sets A (left) and B (right)

Fig. 2.
figure 2

Activations of optimal machine in sets C (left) and D (right)

Fig. 3.
figure 3

Activations of optimal machine in sets E (left) and F (right)

Fig. 4.
figure 4

Activations of optimal machine in set G

Table 2. Activation percentage of optimal machine

5.2 Second Scenario

In this scenario the proposal policies were tested in the field of the MABP problem however we used the branching factor of the games that we show in the Table 3.

Table 3. Branching factor of games

The branching factor is used because this will be the number of machines the polices will face if implemented in the Monte Carlo Tree Search

For this scenario the following conditions are required:

  • The game branching factor is used.

  • For each branching factor, five sets of machines with random probabilities were created.

  • The policies were tested 100 times in 10,000 rounds in these sets, of which the average was obtained.

  • We obtained the average of activations of the optimal machine.

Results. From the Table 4 we can note that when we have a branching factor under 11, it is UCB-B the policy with the best performance due it reaches between \(74\%\) and \(83\%\) of activations of the optimal machine, except when we have a branching factor of 4 where UCB is the best policy. For the rest of branching factors, we can note that UCB-A and UCB-B have the best performance due to these are the policies that activate the optimal machine more frequently. Form the Table 5 we can note that in all games the best policies are UCB-A and UCB-B given that in average they activate the optimal machine more frequently. From the Table 6 we can note that the behavior of the policies changes with UCB-A and UCB-D the best policies when we have a branching factor under 11. Surprisingly, in the Table 6, we can note for the branching factors 29, 35, 38 and 40, UCB-F has the best performance.

Table 4. Activation percentage of optimal machine in games at 10,000 rounds
Table 5. Activation percentage of optimal machine in games at 1,000 rounds
Table 6. Activation percentage of optimal machine in games at 100 rounds

6 Conclusions and Future Work

From the first scenario we could note that UCB is the policy with the best performance due to it activates the optimal machine over \(80\%\) of the time after 10,000 rounds. UCB-B had a similar performance to UCB however it did not reach the percentage of UCB. In this scenario we could note that UCB-A and UCB-B are the policies that activate the optimal machine as quick as possible, however, in the last rounds they are outperformed by UCB. This behavior was repeated when we used set of machines based in branching factor of games and we could note that the performance of UCB decreased as the number of rounds increased, probably in late rounds UCB can outperformed the other polices.

Because UCB-A and UCB-D are policies that only use exploitation and due to the results that we got, we can conclude that when we have low number of rounds below 10,000, it is better to use exploitation polices but with a high number of round is better use UCB. However, we need to apply these policies in MCTS and GGP in order to get the real behavior. In both scenarios \(UCB_{\sqrt{.}}\) had the worst performance, this may be due to the wrong choice of the value of its constant, so we leave as future work to tune this value and compare its performance with the exploitation polices.