1 Introduction

In the game of RoboCup soccer simulation 2D league, player agents make a decision at each cycle in real time. A game consists of 6000 cycles (excluding periods of set-play), thus the decision making process of each player is executed about 6000 times. Therefore, the performance of a team highly leans on the decision process of its agents. Akiyama et al. [1] implemented a tree search algorithm for the decision making process in the RoboCup soccer simulation. Each node (which corresponds to an action) is assessed by an evaluation function during the tree search. Then, the node with the highest value is selected as the action to take. Evaluation functions are commonly tuned by hand: for example, additional points if the ball is close to the goal, or deducting points if the possibility of an opponent’s interception is high. Akiyama and Nakashima [2] also described that using an evaluation function including such rules gives higher team performance than using a simple function with no rules. However, tuning such an evaluation function is laborious and provides sub-optimal results most of the time. In addition, since there is no perfect strategy, it is difficult to win against all teams with a single game plan. As a result, when implementing a team, it is common to define various strategies. In this case, each strategy might require its own evaluation function. Therefore, it is desirable to have an automatic method to tune them.

In a previous work, we tried to solve this task by using a four-layered neural network [3]. Evaluation functions could be tuned successfully by using supervised learning with a training set that was extracted from an expert team’s behavior. However, it did not improve the performance of the team. This was caused by a strong relationship between evaluation functions and action candidates. Action candidates are generated around the kicker or receivers, thus team formations should correspond to the expert players’ positioning in order to solve this problem.

The aim of this paper is to solve this problem by proposing a method that makes the team mimic an expert team known to play soccer well. If player agents imitate the experts’ behaviors, the team would be able to win against opponents it could not defeat with a simple or hand-coded evaluator. In this work, we model the expert’s decision making process by using a neural network. The neural network evaluates the action for the next cycle, and is trained by supervised learning. In the experiments, we evaluate the performance of the team using the modeled decision making process by counting the number of times the ball enters into a target area, scored goals and successful through passes. The proposed method is compared with a team using a simple or hand-coded evaluation function. Moreover, we investigate whether all players should use the same evaluation function or not.

2 Related Work

The recent advances in deep learning have allowed the design of successful methods in various control domains by using either supervised learning or reinforcement learning. For example, Warnell et al. [4] proposed a method that uses the representational power of deep neural networks in order to learn complex tasks, such as the Atari game BOWLING, in a short amount of time with a human trainer. Stanescu et al. [5] presented a deep convolutional neural network to evaluate states in real-time strategy games. Silver et al. [6, 7] used deep neural networks to evaluate board positions and to select moves in the game of Go. In the case of soccer game, Hong et al. [8] proposed a deep policy inference Q-network that targets multi-agent systems. Their model is evaluated in a simulated soccer game whose field is a grid world. In the RoboCup environment, especially the soccer simulation league, it is difficult to train deep neural networks to evaluate actions because the soccer field is a continuous environment. Therefore, supervised learning and reinforcement learning are applied to simple experimental settings [9, 10], such as “one on one” or “keepaway”. Deep learning methods are also used for offline game analysis [11]. These researches were applied to improve not team performances but a single player’s behavior or decision making. Therefore, it would be difficult to apply these approaches to multi-agent systems.

On the other hand, not an individual policy but team strategies improvement are required to defeat opponents in a soccer game. In soccer simulation 2D league, various strategies are implemented by teams to win the competition, and it becomes difficult to win against all teams with a single strategy. For these reasons, in a previous work, we proposed a model that determines the best player formation for corner-kick situations to switch our strategies [12]. Moreover, we proposed a model that identifies the opponent defensive strategies in an online manner [13].

Floyd et al. [14] proposed a case-based reasoning approach to imitate player agents in terms of action selection. Their approach focused on imitating low-level actions (i.e., dash, kick, turn). In this paper, in order to create new strategies, we approximate the evaluation function of an expert team to score high level actions (i.e., pass, dribble, shoot) by using a deep/shallow neural network. Unlike [9] and [10], the neural networks can learn strategies because the training data consist of kick sequences. Moreover, the aim of our method is not improving an individual behavior but mimicking team strategies.

3 Action Selection

A cooperative action planning by tree search method [1] is employed to model the players’ decision making process. In this model, an action plan is created by generating and exploring a decision tree at the time of kicking the ball. Nodes of the tree correspond to situations of the soccer field, and edges correspond to the actions that players take. An evaluation value is assigned to each node. The action plan is defined as a fixed-length action sequence that the player should perform from the next cycle. In this work, we explore the tree by using a best-first search strategy.

The generation of an action plan is done as follows. First, the current state is stored in the root node of the decision tree. Actions (pass, dribble, shoot, etc.) that can involve other players as well as itself are generated based on the current state and predicted state observed by itself. At this moment, it precisely calculates whether it is an executable action or not. If it is not the case, the action is deleted. Therefore, only possible actions are generated as a candidate. Then, generated actions are evaluated by an evaluation function. The action, the state, and the evaluation value are stored as a child node in the decision tree. Once all nodes have been added, the node with the highest value is selected, and further action candidates are generated from this node with the predicted state. The decision tree is expanded by repeating this procedure, and the action planning is executed. When the depth of the tree reaches a fixed threshold, if an action cannot be generated from the predicted state of the node, or even when an action set in the terminal condition of the action sequence is generated, the child node generation at the leaf node is not performed. Connecting the node strings to the generated decision tree provides action sequences. Once the search process completes, the node with the maximal value in the generated action sequences is taken as the action plan. Thanks to the action planning, players can select higher strategic action by considering the above-mentioned proactive approach.

Figure 1 depicts an example decision tree. For the sake of simplicity, only an evaluation value of each action is indicated in each node and actions are written on the edges. This example shows that actions generated in the soccer field are one dribble and two passes. Then, each action is evaluated by the evaluation function. Since in this example, dribble is the first action that maximizes the evaluation function, executable actions are generated from this action node. In Fig. 1, the resulting action plan would be the sequence of a pass and a shoot.

In this paper, we focus on the efficient development of evaluation functions, which is an important factor of designing decision making. In order to make strong teams, it is necessary for each player to select the best actions. Generally, evaluation functions are almost made by hand. Thus, they are not necessarily optimal. In addition, designing such a function needs a trial-and-error iterative process. Therefore, in this work, we investigate the use of supervised learning to automatically design the optimal function.

Fig. 1.
figure 1

Example of an action plan

4 Learning Evaluation Functions by Neural Networks

Positive and negative episodes of kick sequences have to be defined in order to train neural networks by using a supervised learning approach. In this paper, we consider as positive episodes, sequences of kicks that end up into the opponent’s penalty area. On the other hand, sequences that end up outside this area, are considered as negative episodes (i.e., opponents’ interceptions). The target value for the negative episodes is defined as 0, while that for the positive ones is 1. Figure 2 depicts examples of such episodes, where red lines represent positive episodes and dotted blue lines correspond to negative ones. From Fig. 2, we can see that an episode consists of a series of ball coordinates.

Neural networks are employed to model evaluation functions. The main reason we employ neural networks is that they are universal function approximators. Additionally, the architecture of a neural network can be easily changed. Therefore, neural networks allow us to investigate various settings. In this paper, two versions of input features are used. One is the position at the next kick \((x_n, y_n)\), which means that a two-dimensional input feature vector is used for training data. And the other input features are the position at the current kick and the ball position at the next kick (\((x_c, y_c)\) and \((x_n, y_n)\)). In this case, the training data consist of four features.

Fig. 2.
figure 2

Extracted positive episodes (red lines) and negative episodes (dotted blue lines) (Color figure online)

The extracted episodes from log files are converted to generate training data for the learning of neural networks. As there are two versions of input features as described above, an extracted feature is analyzed in two ways. For generating training data for two-dimensional training data, the ball positions in an episode are separated into individual ball positions. Each of such ball positions is used as a training vector which consists of the ball position \((x_n, y_n)\) as well as a positive/negative target value. This process is shown in the above side of Fig. 3. On the other hand, in the case of four-dimensional training data, a pair of successive two ball positions are used to generate a training vector. The former term of the pair is regarded as the current ball position and the latter is the predicted ball position at the next kick. Each of the two ball positions in the pair is concatenated to generate a four-dimensional input vector \((x_c, y_c, x_n, y_n)\). The target value for the generated vector is determined by the label (i.e., positive or negative) that is associated with the episode that the four-dimensional vector was generated from. The lower part of Fig. 3 shows this process.

Fig. 3.
figure 3

Conversion of an episode into training data

5 Experiments

5.1 Experimental Settings

We evaluate the performance of evaluation functions modeled by neural networks that are trained with using supervised learning. Performances are evaluated by counting the number of times the ball enters into a target area, scored goals and successful through passes. Training data such as passes and dribbles were extracted from the game logs between an expert and an opponent team. HELIOS2017 [15], which won the RoboCup2017 tournament in soccer simulation 2D league, was employed as the expert. On the other hand, HillStone [16], which won the eighth position was employed as the opponent team. In this experiment, we tried to defeat the target team by making our own team, opuSCOM, mimic the expert team (i.e., HELIOS2017). opuSCOM is developed by Osaka Prefecture University for JapanOpen competitions, which is the Japanese national RoboCup contest. We designated Hillstone as our target team because while it is not a top ranked team, it is much more stronger than our team opuSCOM. On the other hand, we chose HELIOS2017 as the expert team since it is one of the top rank teams for several years. In addition, opuSCOM and HELIOS2017 share the same base code, Agent2D (HELIOS base) [17], which is currently one of the most popular base code for the RoboCup soccer simulation 2D league. Particularly, their formation configuration files are almost the same. Therefore, it should be easy for opuSCOM to copy the HELIOS2017’s formation strategies.

In this experiment, we set three types of formation strategies. The first formation consists of four defenders, three midfielders, and three attackers, named as 433-formation that is mainly used by our team. The second is the HELIOS2017’s formation, 4231-formation. The last is the 442-formation, that have 2-top attackers. The main reason for employing the last formation is that we want to investigate the effect of using a different number of attackers.

In addition, we investigate several neural network’s architectures. The different architectures are summarized in Table 1. The four-layered (2or4-100-100-1) neural networks’ activation functions are sigmoid functions. On the other hand, the Leaky-ReLU function [18] is employed as an activation function in the seven-layered (2or4-50-50-50-50-50-1) neural networks to prevent vanishing gradient problems and dead neurons. Moreover, we investigate two types of output layer’s activation function. The first one is the sigmoid function, and the second one is a linear activation (i.e., no activation function). The output layer using a sigmoid function outputs values in [0, 1], thus the output can be considered as the probability of entering the opponent’s penalty area at the end of the sequence. On the other hand, the output layer with a linear activation outputs an unbounded value. This function is usually employed for regression problem. The learning rate are set as 0.001 for all structures.

In addition, we investigate two types of training procedures. One uses all kick sequences, thus all players have the same evaluation function. The other one uses only the sequences involving the learning player itself. Therefore, players learn their own evaluation functions. While this procedure requires to train several neural networks, the training may be easier.

We assume that players of the expert team consider a team strategy when selecting an action. Therefore, the expert’s kick sequences are expected to include the information of the considerations. By modeling each player’s action selector to the expert’s one, a team behaviors are close to the expert.

Learned evaluation functions are implemented in opuSCOM. We evaluate the performance by making it play against HillStone. Performances are measured over 100 games.

Table 1. Summary of experimental settings

5.2 Results

Figure 4 shows the performance of several trained neural networks in comparison with the opuSCOM’s default hand-coded evaluation functions. Comparisons are based on three different criteria: the number of scored goals, the number of times the ball entered the opponent’s penalty area and the number of successful through passes. Evaluation functions modeled by neural networks outperform those designed by humans regardless the criterion.

Fig. 4.
figure 4

Team performance with various neural network models

Tables 2, 3 and 4 summarize the opuSCOM’s win rate against Hillstone for each experimental settings. Automatically designed evaluation functions helped to increase the win rate by a factor greater than or equal to 10%. Neural network evaluators helped to win more than 50% of games, in spite of default evaluators whose win rate is at most 40%. On the other hand, there was also a performance decrease in some experimental settings. This is particularly the case with the sig-sig model when used for formations that involve a few number of top attackers (formation 442).

Table 2. opuSCOM’s win rate against HillStone when using a hand-coded evaluator
Table 3. opuSCOM’s win rate against HillStone when using an “all” evaluator
Table 4. opuSCOM’s win rate against HillStone when using an “each” evaluator
Fig. 5.
figure 5

Evaluation function learned by the sig-sig model

Fig. 6.
figure 6

Evaluation function learned by the sig-reg model

Fig. 7.
figure 7

Evaluation function learned by the relu-sig model

Fig. 8.
figure 8

Evaluation function learned by the relu-reg model

Figures 5, 6, 7 and 8 depict examples of evaluation functions modeled by neural networks trained by supervised learning. Note that the \(x-y\) plane represents the soccer field. For example, the area \(x > 0\) is the opponent’s side while the area is our side when \(x < 0\). They represent the functions learned by the models: sig-sig-all-2input, sig-reg-all-2input, relu-sig-all-2input and relu-reg-all-2input according to the abbreviated name in Table 1. In order to draw such visualization, we discretized the soccer field and evaluated every position by using the different trained models. As shown in Figs. 5, 6, 7 and 8, neural networks learned meaningful evaluation functions regardless the experimental settings. Actions that could bring the ball inside the opponent’s penalty area have a high evaluation value. On the other hand, actions with a low predicted success probability tend to have a lower evaluation value, even if the ball is close to the goal. The observations suggest the neural networks learn the expert’s kick sequences to mimic their action selections. In Fig. 6, the state value produced by the neural network exceeds 1.0 because of an activation function and many contradictions in the training data. Some situations (e.g., a corner kick) are labeled differently even though the field information is completely the same.

6 Conclusion

In this paper, we proposed a method that improves the performance of a team by making it mimic a stronger one. For this purpose, a neural network is employed to model the team to mimic its evaluation function. The neural network is trained by using positive and negative episodes of action sequences. The proposed method can train the behavior of a given team, and outperform the evaluation function designed by human beings. This method allows to easily and automatically improve the performance of a team. By automatically designing evaluation function, we could focus on the development of efficient strategies. In a future work, we will try to improve our mimic performance by investigating various neural network’s structures. In addition, we will consider the use of reinforcement learning in order to outperform the expert team itself.