1 Introduction

The RoboCup Soccer 2D Simulation League provides a rich dynamic environment, facilitated by the RoboCup Soccer Simulator (RCSS), aimed to test advances in decentralised collective behaviours of autonomous agents. The challenges include concurrent adversarial actions, computational nondeterminism, noise and latency in asynchronous perception and actuation, and limited processing time [3, 5, 7, 29, 37, 38, 42, 43, 46]. The League progress has been supported by several important base code releases, covering both low-level skills and standardised world models of simulated agents [1, 22, 45, 47]. The release in 2010 of the base code of HELIOS team, agent2d-3.0.0, later upgraded to agent2d-3.1.1, was particularly influential. By 2016, about 80% of the teams adopted agent2d as their base code, including the champion team, Gliders2016 [33, 38], which also used fragments of MarliK source code [47], and by 2019 this fraction exceeded 90%.

Gliders2016 was developed using Human-based Evolutionary Computation (HBEC) [23, 33]. In 2018, we released the code base Gliders2d [40], version v1, comprising six HBEC steps. In this paper we present the second version of Gliders2d (v2), with six additional steps:

  • Gliders2d-v2.1: blocking behaviour (disrupting the opponents in possession of the ball; based on simplified MarliK source code [47]; bhv_basic_move.cpp);

  • Gliders2d-v2.2: offside trap behaviour (bhv_basic_move.cpp);

  • Gliders2d-v2.3: assignment of heterogeneous player types (sample_coach.cpp);

  • Gliders2d-v2.4: back action (allowing players to select actions which evaluate worse than the current action; action_chain_graph.cpp);

  • Gliders2d-v2.5: tackle action (higher risk in defense); bhv_basic_move.cpp);

  • Gliders2d-v2.6: wing attacking formation (strategy.cpp).

Some of these steps are relatively simple, while some involve substantial code changes. Nevertheless, the performance improvements are tangible, and we will detail these in Sect. 2. Gliders2d uses librcsc 4.1.0 [1]. It is different from the competition branch (Gliders2012—Gliders2016), being a separate evolutionary branch, created over 2018–2019 to experiment with Fractals2019. We will exemplify how Guided Self-Organisation (GSO) allows us to optimise the performance, using the transition between v2.2 to v2.3 which improved an assignment of heterogeneous player types.

Fractals2019 is a new team which is partially based on Gliders2d [39], while retaining some elements of our previous champion team Gliders2016 [33, 38]. To a large extent, Fractals2019 is an experimental entry, motivated by a new set of aims. Specifically, we redefined the fitness landscape used by optimisation in terms of dynamic constraints, rather than in terms of the performance metrics alone. Our overall approach uses guided self-organisation of tactical behaviour, shown here for the transition between v2.2 to v2.3, now extended by a thermodynamic characterisation of collective action, Sect. 3, with results described in Sect. 4.

2 Gliders2d: Version v2

Each HBEC solution can be seen as a “genotype”, encoding the entire team behaviour in a set of “design points” [10, 40], which may vary from a single parameter (e.g., blocking depth), to an ordering of heterogeneous types with respect to some criterion (e.g., a list of integers representing players’ roles), to complex multi-agent communication schemes [17, 38, 50].

Each solution is typically evaluated against a specific opponent, over thousands of games, with the fitness function being the average goal difference, while the average points and standard error provide tie-breakers [40]. In other words, a design point (possibly conditioned on the name of a specific opponent) is accepted only if it outperforms every single opponent in the pool of available opponents. This strict acceptance criterion may be adjusted by assigning different weights to the available opponents, set in proportion to their respective strengths which are “statically” determined in advance. These weights can be based on the competition ranking of the previous year, or can be specified more precisely, using the points achieved in a multi-game round-robin tournament involving these opponents [41]. In this evolutionary/optimisation approach, the overall fitness is determined as a weighted average computed across the opponents in the pool. As a consequence, a design point may be accepted even it underperformed against a relatively weaker opponent (if it performed strongly against other, stronger, opponents). Rarely, a design point may even be accepted if it underperformed against the strongest opponent, provided that the other performances outweighed this loss on average.

The weighted fitness function, accommodating relative strengths of the opponents, achieves several aims. Firstly, it better accounts for the absence of transitivity in teams’ relative strengths [41]. Secondly, it reduces the development and computation time required to ensure that any given design point outperforms all available opponents. Thirdly, it produces a simple and more general-purpose code, with fewer conditional branches. It is interesting to point out the similarity of the weighted fitness function with the concept of “Nash averaging” recently introduced by Balduzzi et al. [6], who distinguished between two basic scenarios: agent-vs-agent and agent-vs-task. Nash averaging was utilised in a “dynamic” setting, where at each round the strength of the evolving team (the “agent”) itself contributed to the relative strengths of the opponents. Most recently this was carried out by DeepMind in the AlphaStar League, while evolving an agent excelling in the real-time strategy game StarCraft II. The final AlphaStar agent comprised the elements of the Nash distribution of the league, capturing the most effective mixture of strategies discovered by the evolution guided by Nash average [14]. In RoboCup Simulation League scenarios, this would mean that the weights of the opponents in a pool are re-evaluated in a round-robin tournament which includes the evolving solution (design point) itself, providing a better quality of the solutions—but at a prohibitive computational cost (at this stage).

A balanced pool of opponents (an “ecosystem”) in which Gliders2d were evolved from v2.1 to v2.6, using a weighted fitness function, included four benchmark teams: Gliders2013 [32], the 2018 world champion team, HELIOS2018 [27], the 2018 third-ranked team, MT2018 [49], and the 2018 sixth-ranked team, YuShan2018 [9]. For each sequential step and for the baseline Gliders2d-v1.6 (the latest step in version v1), 2000 games were played against these four benchmarks.

Against Gliders2013, the goal difference achieved by Gliders2d-v2.6 improves from \(-0.2\) to zero. Against YuShan2018, the goal difference improves from \(-0.8\) to zero, achieving parity as well. Against MT2018, the goal difference improves from \(-2.1\) to \(-1.0\). Finally, against HELIOS2018, the goal difference improves from \(-4.4\) to \(-2.4\). The latter case will be detailed specifically, for the transition between v2.2 and v2.3, when the goal difference improved from \(-4.2\) to \(-3.0\), i.e., by more than a single goal, merely due to a better assignment of heterogeneous players. Table 1 summarises the performance dynamics, including the overall goal difference and the standard error of the mean, for each of the match-ups. Adoption of the weighted fitness function ensures the progression at several steps: for example, from v1.6 to v2.1.

Table 1. Performance evaluation of Gliders2d against benchmarks, over \(\sim \! \! 2000\) games for each version of Gliders2d against the opponent. The match-up marked by \({\star }\) involved \(\sim \! \!16000\) games (Sect. 4).

The released code, including the six sequential steps comprising version v2, is located at:

http://www.prokopenko.net/gliders2d.html.

The differences can be easily identified by consecutively comparing the files indicated for each step across versions v2(n) and v2(n+1), for \(n \ge 0\) (with v2.0 being v1.6). For example, differentiating the file bhv_basic_move.cpp between v1.6 and v2.1 will show the changes implementing the blocking behaviour (with the acknowledged fragments of Marlik source code [47]), while checking the differences in the same file between v2.1 and v2.2 will highlight the offside trap behaviour of Gliders2d. The offside trap is produced by collective motion of the defenders synchronised by their perception of relevant situational variables. The changes in the last three steps are self-explanatory, and also involve several variables describing situational patterns, forming design points evolved by HBEC or GSO. The step from v2.2 to v2.3 deserves a more detailed analysis, being quite illustrative of the difference between HBEC and GSO approaches, and showing how self-organisation can be guided by thermodynamic properties of collective behaviour.

3 Fractals2019: Thermodynamically Driven Collective Behaviours

Guided Self-Organisation constrains self-organisation within a dynamical system to paths leading towards specific attractors or outcomes. It integrates two techniques: (i) self-organising exploration of the search space, and (ii) traditional design following a “blueprint” [34, 35]. GSO has been studied in several robotic scenarios, combining (i) universal objective functions, and (ii) task-dependent constraints on the system dynamics, often using generic information-theoretic or thermodynamic methods [4, 15, 18, 20, 26, 28, 30, 31, 35, 36]. An example of combinatorial optimisation carried out in a noisy potential field using information-theoretic tools is described by Kim et al. [19] in a study which complemented hill-climbing using information entropy. The class of simulated annealing algorithms is also characterised by a thermodynamic analogy [25].

3.1 Problem Formulation

In statistical-mechanical terms, each candidate solution (a design point) can be interpreted as a configuration of the suitably defined statistical system, so that the fitness of the solution can be considered as the energy of the system in that particular configuration. Hence, one may in general represent an optimisation problem as a search for an equilibrium state of a system which optimises the appropriate thermodynamic potential (e.g., minimises the free energy).

In developing Fractals2019 we combine (i) the standard (weighted) fitness function defined by the goal difference, which is inevitably measured only imprecisely, and (ii) local constraints which indirectly represent collective behaviour. The constraints are defined with respect to the elements of the design point under consideration, restricting the search-space. In HBEC approach, these constraints are specified before the search begins, by human designers. Importantly, under our GSO approach, these constraints are not given in advance, but are induced during the first phase of combinatorial optimisation which is driven by maximising the goal difference alone. Given the landscape of the fitness function, partially discovered during the first phase, the method identifies the regions around local maxima, and induces partial constraints that represent these local fitness sub-spaces. At the second phase of optimisation, we use the thermodynamic analogy and guide the search from the attained local optimum towards the constrained sub-space, shaped by the discovered constraints.

For example, a design point can be specified as an assignment of heterogeneous player types to player roles, represented as an ordered list of integers, e.g., the assignment of agent2d-3.1.1 sets \(X = \{11, 2, 3, 10, 9, 6, 4, 5, 7, 8\}\), with function \(\rho : \mathbb {N} \rightarrow \mathbb {N}\) defining the rank in this list, e.g., \(\rho (11) = 1\). This list is used by the coach agent in assigning the strongest heterogeneous player type (defined according to some criterion, for instance, the fastest player type) to the first player from this list, i.e., player 11 (centreforward). The second best type is assigned to the second player on the list, i.e., to player 2 (left central defender), and so on, so that the weakest, slowest, type is assigned to player 8 (right midfielder). The goalkeeper, player 1, is assigned a type separately. A constraint can be specified as a preference over the ranks, e.g., \(\rho (6) < \rho (10)\) would mean that player 6 should have preference in the assignment over player 10. In general, ranking constraints have the form: \(\rho (i) < \rho (j)\) for players i and j.

Finding the assignment which maximises the fitness function \(f_X\) over all possible 10! candidate solutions \(X \in \mathcal {X}\) in the search-space is the specific optimisation problem solved at the transition between v2.2 and v2.3. Even on a high-performance computing cluster with 100 two-minute games running in parallel, checking all 3,628,800 permutations over 1000 games each (to account for the score fluctuations due to the simulation noise), would require over 138 years of continuous computation. A hill-climbing optimisation method would cut this time considerably but is likely to find only inferior local optima. We advocate instead a GSO approach which combines elements of hill-climbing optimisation and dynamic constraint satisfaction problems, with insights from simulated annealing, informed by the thermodynamic analogy.

One may think of our assignment problem as a variant of the Traveling Salesman Problem (TSP), where each player needs to be assigned only one role, and the objective is to maximise the fitness (rather than minimise the cost). The “distances” between the players (the nodes along a directed path) are not known, so we cannot simply aggregate the path segments into the total path length d. Instead, the approximate overall path distance is provided by the noisy fitness function (the average goal difference resulting from the assignment), and so the values \(f_X\) and \(f_Y\) of two neighbouring solutions X and Y can produce the difference \(\delta = f_X - f_Y\).

3.2 Phase 1: Hill-Climbing and Dynamically Induced Constraints

During the first phase, we use a hill-climbing algorithm [44] based on a variant of insertion sort [21]. Hill climbing is a greedy local search algorithm which always moves in the direction of increasing fitness f by comparing with values of local neighbours, terminating when it reaches a solution without neighbours with a higher fitness. Defining a suitable neighbour function makes a significant impact on the success of this algorithm. In TSP problems one often defines a simple k-node flip neighbour function where a sequence of k nodes of one solution is flipped to produce a candidate. For example, a 2-flip neighbour function applied to \(X = \{11, 2, 3, 10, 9, 6, 4, 5, 7, 8\}\) at rank 4 produces a candidate \(Y = \{11, 2, 3, 9, 10, 6, 4, 5, 7, 8\}\), with the ranks of players 10 and 9 reversed, not unlike a bubble sort algorithm that would compare \(f_X\) and \(f_Y\) before deciding if the flip should be accepted. Insertion sort is a more efficient sorting algorithm that produces a sorted list by finding a location for a given element within the current list. Starting with the initial assignment, the insertion sort algorithm picks one element from the data (e.g., in the ranked order), and finds the optimal location for this element within the list by comparing the fitness values corresponding to different locations. Having inserted this element, the algorithm iterates to the next element, until all elements are properly inserted. At every location test, only a better candidate is accepted, following the hill climbing approach.

The challenge is that the fitness function \(f_Y\) must be computed for every location choice, that is, thousands of games need to be played for the assignment Y against the benchmark. This is needed in order to reduce the effects of fluctuations—an aspect particularly relevant when assigning heterogeneous types which are stochastically generated before each game. Nevertheless, the fitness remains noisy even for a large number of games, making hill climbing problematic.

To address this challenge and reduce the overall number of tests, we make two extensions. Firstly, we use a heuristic: an assumption of “iteration” convexity, that is, we assume that the fitness landscape is convex along the insertion path of each element. Secondly, in addition to identifying the best location for an element, the algorithm checks the neighbourhood of the identified location, and induces local constraints over the ranks, whenever possible. This approach is broadly in line with dynamic constraint satisfaction problem (DCSP) solvers in which the constraints are dynamically evolved while the CSP is being solved [48]. It differs from constraint recording techniques, in which newly learned constraints represent inconsistencies in the problem formulation, rather than the fitness landscape as such.

Let us consider the iteration of inserting 11, traced in Table 2. Initially the rank of the player is 1, i.e., \(\rho (11) = 1\), and the corresponding fitness is \(f = -4.17144\). Several locations are then iteratively tested, until a local maximum \(f^* = -3.89289\) is found at rank \(\rho (11) = 4\). The assumption of convexity along an iteration allows us to stop after the first maximum is identified during the iteration, that is, after checking the rank \(\rho (11) = 5\) with the inferior fitness. At this stage we induce ranking constraints for 11: specifically, \(\rho (10) < \rho (11)\) is induced by comparing tests 2 and 3, and \(\rho (11) < \rho (9)\) is induced by comparing tests 3 and 4. Basically, the assignment differences between the tests are converted into constraints, to match the corresponding differences in fitness (if these differences are sufficiently large). For example, the difference between the tests 2 and 3 is the 2-flip at rank 3, picking 11 and 10, and since \(f_2 < f_3\), it is induced that \(\rho (10) < \rho (11)\). This preference impacts the collective behaviour of the team, resulting in the overall fitness (the thermodynamic potential), and so the fitness is used to induce the ranking.

Inducing such local constraints may appear redundant, as the maximum identified in this example reflects them already. However, not every iteration will improve on the current maximum, but may still identify local constraints that partially represent the structure of the search-space (see Sect. 4 for more examples, summarised in Table 3). All these constraints will be used in the second phase, guiding a more refined search.

Our use of DCSP is motivated by the noisy fitness function. The fluctuations in the fitness function appear due to a dynamic and distributed RoboCup environment where the outcomes change from game to game—and so estimating the fitness across multiple games forms a changing problem environment. In general, problems tend to have structure, and the local constraints induced during the first phase partially discover this structure.

Table 2. Hill climbing with insertion sort. A local maximum, \(\rho (11) = 4\), is marked \(*\). Two constraints are induced: \(\rho (10) < \rho (11)\) and \(\rho (11) < \rho (9)\). Each test \(i > 0\) involved \(\sim \! \!1000\) games, test 0: \(\sim \! \!2000\) games.

3.3 Phase 2: Constraint Satisfaction via Annealing

After carrying out iterations for all elements of the design point, and obtaining a local maximum X, as well as a set of local constraints, we begin the second phase guided by a thermodynamic characterisation of the fitness landscape. In a seminal work, Černý [8] argued that the analogy with thermodynamics offers a new insight into optimisation problems such as TSP: the length \(d_X\) of a given trip, defined as a sequence/permutation of the nodes (in our terms, a given assignment of the player types), can be seen as the energy \(E_X\) of the system in that particular configuration X. Crucially, “simulating the transition to the equilibrium and decreasing the temperature, one can find smaller and smaller values of the mean energy of the system (= length of the trip)” [8]. This argument provided strong motivation for simulated annealing algorithms [16, 25], where the probability of a candidate solution X generated during an exploration of the search-space, while minimising the path length, is given by the Boltzmann-Gibbs distribution, for some temperature T and normalisation constant Z:

$$\begin{aligned} P(X) = Z e^{-E_X / T}. \end{aligned}$$
(1)

Given a current minimum \(d_Y\), the chances of accepting a worse candidate X with \(d_X > d_Y\), are not zero, but depend on the (energy) difference \(\delta = d_X - d_Y\). More precisely, a worse candidate is accepted with probability \(e^{-\delta / T}\). When the annealing temperature becomes sufficiently low, the acceptance probability for non-optimising solutions reduces as well.

We use simulated annealing in guiding the exploration of the search-space. Once a locally optimal solution X is obtained, we generate candidate solutions Y in some proximity of the local optimum, but keeping closer to, and preferably within, the region of the search-space restricted by the discovered local constraints (“closer” in terms of a simple neighbourhood function, e.g., insertion sort again). The probability of accepting a worse candidate Y, with \(d_X > d_Y\) is given by \(e^{-\delta / T}\), for \(\delta = d_X - d_Y\), with acceptable Y replacing X as the current best candidate. Thus, this algorithm shares Černý’s insight quoted earlier, in exploring the search-space thermodynamically, i.e., along the landscape of a thermodynamic potential, guided by local constraints towards an optimum (equilibrium) state. In the beginning of the search, when the temperature T is higher, acceptable solutions can be found deeper within the constrained sub-space, but when the search cools down, candidate solutions may sit very close to the currently identified optimum. Section 4 demonstrates the second search phase for our main example, see Table 4.

We refer to our second phase as constraint satisfaction via annealing. It differs from the “constraint annealing” technique [24] which interprets constraints as objective functions and applies standard simulated annealing to the redefined problem. The main overall difference, of course, lies in our use of dynamically induced constraints, and so we refer to the entire optimisation method introduced in this paper as Dynamic Constraint Annealing.

In general, the objective function and, thus the dynamic local constraints, may be chosen to represent different aspects of the problem structure: maximisation of spatiotemporal coordination [30, 31], maximisation of information flows [11, 12], maximisation of thermodynamic efficiency which contrasts the gain in the uncertainty reduction with the required work [13], etc.

4 Results

In this section we develop our leading example, detailing the two optimisation phases. Let us continue with the hill-climbing phase: having inserted element 11, we iterate insertion sort algorithm for other elements, while applying the convexity heuristic and avoiding testing the assignments which have been checked earlier, as shown in Table 3.

Table 3. Hill climbing with insertion sort: continuing after first four tests which inserted 11 at rank 4 (Table 2). The current maxima are marked with \(*\). Tests 3 and 4 are shown repeatedly, to clarify the comparisons which induced constraints. When the difference between fitness values is smaller than the standard error, a possible constraint, shown within [], is not induced. Each test is carried out over \(\sim \! \!1000\) games.

The dynamically induced local constraints construct a partial ordering underlying an optimal assignment, shown in Fig. 1. It is evident that the local maximum attained by test 34 does not satisfy all these local constraints, highlighting the need for a second, constraint satisfaction, phase. The algorithm continues with the best solution \(X_{34} = \{2, 3, 5, 4, 8, 10, 11, 9, 6, 7\}\), re-evaluated after 16000 games in order to ensure a higher precision, attaining \(f_{34} = -3.14496\), and then generates the candidates closer to and within the constrained sub-space, as shown in Table 4, for a specified number of tests, e.g., 10 additional tests. The maximum attained at the end of this phase is given by \(X_{44} = \{5, 4, 2, 3, 7, 6, 8, 10, 11, 9\}\) with \(f_{44} = -2.95471\), reducing the goal deficit by further 0.2. This solution fully satisfies the local constraints.

Not surprisingly, this assignment which was optimised against the world champion of 2018, is assortative: it prefers to place the fastest players in defence (defenders 5, 4, 3, 2, with the wing defenders 5 and 4 assigned the best types), followed by midfielders (7, 6, 8), and leaving the weakest types for forwards (10, 11, 9). To re-iterate, this is carried out at the transition step from Gliders2d-v2.2 to v2.3, which is still a relatively weak team overall. In general, such an optimisation should be repeated after each tactical improvement, adjusting the collective behaviour of heterogeneous players to the latest tactics. In fact, every time when the optimal assignment changes in response to newest tactics, it signifies an important step in the development of collective behaviour. Our winning entry, Fractals2019, maintained a more balanced assignment than Gliders2d-v2.6, disassortatively mixing the roles across types to fit a more attacking style—the final optimisation was carried out in the last days before the championship. The Dynamic Constraint Annealing method described in this study takes about three days of computation on a high-performance cluster with 100 parallel games in the RCSS synchronisation mode (about 2 min per game), with the first and second phase taking roughly 12 h (\(\sim \! \!35000\) games) and 54 h (\(\sim \! \!160000\) games) respectively. And so such a re-optimisation becomes feasible.

Optimisation of other design points can also be carried out with Dynamic Constraint Annealing, given a suitably chosen discretisation of situational variables, so that relevant local constraints can be induced during the iterative hill-climbing phase.

Fig. 1.
figure 1

A partial ordering underlying an optimal assignment implied by induced local constraints. Ranking preference \(\rho (i) < \rho (j)\) is depicted as \(i \rightarrow j \).

Table 4. Constraint satisfaction via annealing: continuing after the first phase (Table 3) for 10 additional tests. The current maxima are marked with \(*\), worse but accepted solutions – by a, and worse but rejected solutions – by r. Each test is carried out over \(\sim \! \!16000\) games.

5 Conclusions

Team Fractals2019 is based on recently released Gliders2d code base [40]. The second version of Gliders2d is described and traced in this study against a pool of benchmark opponents, using a fitness function weighted by relative strengths of the benchmarks. We followed the methodology of Guided Self-Organisation in optimising for strong team performance, within a noisy fitness landscape affected by fluctuations inherent in the nondeterministic RoboCup simulation environment. This is achieved by employing a thermodynamic analogy in characterising the global objective function, as well as local constraints which are dynamically discovered during the combinatorial optimisation. We proposed a new method, Dynamic Constraint Annealing, which improves on greedy search such as hill-climbing techniques, by (i) dynamically inducing local constraints, and (ii) guiding the constraint satisfaction phase by a simulated annealing carried out along a trajectory approaching or within the constrained sub-space. We illustrated this method for a specific problem, an optimal assignment of heterogeneous player types to player roles, interpreted as a variant of the Traveling Salesman Problem (TSP), demonstrating a tangible increase of fitness over a series of tests. This technique may be generally applicable to combinatorial optimisation and constraint satisfaction problems in changing distributed environments.

Fractals2019 allowed us to verify the applicability of the GSO approach to combinatorial optimisation problems within a challenging environment of RoboCup simulation. RoboCup-2019 competition included 15 teams from 7 countries: Australia, Brazil, China, Germany, Iran, Japan, and Portugal. Fractals2019 played 22 games during several rounds, winning 16 times, with the total score of 59:10, or 2.68:0.45 on average. The final game against team HELIOS2019 (Japan) went into the extra time, and ended with Fractals2019 winning 1:0. Post-tournament we carried out a 16000-game experiment between the two finalists, using their released binaries [2, 39], with the champion team, Fractals2019, outperforming the runner-up by 0.22 goals (the average score 0:79:0.57) with 0.009 of standard error.