Abstract
Real-world multi-agent tasks often involve varying types and quantities of agents. These agents connected by complex interaction relationships causes great difficulty for policy learning because they need to learn various interaction types to complete a given task. Therefore, simplifying the learning process is an important issue. In multi-agent systems, agents with a similar type often interact more with each other and exhibit behaviors more similar. That means there are stronger collaborations between these agents. Most existing multi-agent reinforcement learning (MARL) algorithms expect to learn the collaborative strategies of all agents directly in order to maximize the common rewards. This causes the difficulty of policy learning to increase exponentially as the number and types of agents increase. To address this problem, we propose a type-based hierarchical group communication (THGC) model. This model uses prior domain knowledge or predefine rule to group agents, and maintains the group’s cognitive consistency through knowledge sharing. Subsequently, we introduce a group communication and value decomposition method to ensure cooperation between the various groups. Experiments demonstrate that our model outperforms state-of-the-art MARL methods on the widely adopted StarCraft II benchmarks across different scenarios, and also possesses potential value for large-scale real-world applications.
Similar content being viewed by others
References
Bear A, Kagan A, Rand DG (2017) Co-evolution of cooperation and cognition: the impact of imperfect deliberation and context-sensitive intuition. Proc Royal Soc B Biol Sci 284(1851):20162326
Bresciani PG, Giunchiglia P, Mylopoulos F, Perini J, TROPOS A (2004) An agent oriented software development methodology. Journal of autonomous agents and multiagent systems, Kluwer Academic Publishers
Butler E (2012) The condensed wealth of nations. Centre for Independent Studies
Carion N, Usunier N, Synnaeve G, Lazaric A (2019) A structured prediction approach for generalization in cooperative multi-agent reinforcement learning. In: Advances in neural information processing systems, pp 8130–8140
Chen Y, Zhou M, Wen Y, Yang Y, Su Y, Zhang W, Zhang D, Wang J, Liu H (2018) Factorized q-learning for large-scale multi-agent systems. arXiv:1809.03738
Chuang L, Chao X, Jie H, Wenzhuo L, et al. (2017) Hierarchical architecture design of computer system. Chinese J Comput 40(09):1996–2017
Clevert DA, Unterthiner T, Hochreiter S (2015) Fast and accurate deep network learning by exponential linear units (elus). arXiv:1511.07289
Cossentino M, Gaglio S, Sabatucci L, Seidita V (2005) The passi and agile passi mas meta-models compared with a unifying proposal. In: International central and eastern european conference on multi-agent systems, pp 183–192. Springer
Cossentino M, Hilaire V, Molesini A, Seidita V (2014) Handbook on agent-oriented design processes. Springer, Berlin
Das A, Gervet T, Romoff J, Batra D, Parikh D, Rabbat M, Pineau J (2018) Tarmac: Targeted multi-agent communication. arXiv:1810.11187
Dugas C, Bengio Y, Bélisle F., Nadeau C, Garcia R (2009) Incorporating functional knowledge in neural networks. J Mach Learn Res 10(Jun):1239–1262
Foerster JN, Farquhar G, Afouras T, Nardelli N, Whiteson S (2018) Counterfactual multi-agent policy gradients. In: Thirty-second AAAI conference on artificial intelligence
Gordon DM (1996) The organization of work in social insect colonies. Nature 380(6570):121–124
Ha D, Dai A, Le QV (2016) Hypernetworks. arXiv:1609.09106
Henriques R, Madeira SC (2016) Bicnet: Flexible module discovery in large-scale biological networks using biclustering. Algorithms Mol Biol 11(1):14
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Computat 9(8):1735–1780
Iqbal S, Sha F (2018) Actor-attention-critic for multi-agent reinforcement learning. arXiv:1810.02912
Jeanson R, Kukuk PF, Fewell JH (2005) Emergence of division of labour in halictine bees: contributions of social interactions and behavioural variance. Anim Behav 70(5):1183–1193
Jiang J, Dun C, Lu Z (2018) Graph convolutional reinforcement learning for multi-agent cooperation. arXiv:1810.09202,2(3)
Jiang J, Lu Z (2018) Learning attentional communication for multi-agent cooperation. In: Advances in neural information processing systems, pp 7254–7264
Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv:1412.6980
Liu Y, Hu Y, Gao Y, Chen Y, Fan C (2019) Value function transfer for deep multi-agent reinforcement learning based on n-step returns. In: Proceedings of the twenty-eighth international joint conference on artificial intelligence, pp 457–463
Liu Y, Wang W, Hu Y, Hao J, Chen X, Gao Y (2019) Multi-agent game abstraction via graph attention neural network. arXiv:1911.10715
Long Q, Zhou Z, Gupta A, Fang F, Wu Y, Wang X (2020) Evolutionary population curriculum for scaling multi-agent reinforcement learning. arXiv:2003.10423
Lowe R, Wu YI, Tamar A, Harb J, Abbeel OP, Mordatch I (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. In: Advances in neural information processing systems, pp 6379–6390
Mao H, Liu W, Hao J, Luo J, Li D, Zhang Z, Wang J, Xiao Z (2019) Neighborhood cognition consistent multi-agent reinforcement learning. arXiv:1912.01160
Melo FS, Veloso M (2011) Decentralized mdps with sparse interactions. Artif Intell 175 (11):1757–1789
Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533
Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines. In: ICML
Oliehoek FA, Amato C, et al. (2016) A concise introduction to decentralized POMDPs, vol 1. Springer, Berlin
OroojlooyJadid A, Hajinezhad D (2019) A review of cooperative multi-agent deep reinforcement learning. arXiv:1908.03963
Pal SK, Mitra S (1992) Multilayer perceptron, fuzzy sets classifiaction
Ryu H, Shin H, Park J (2020) Multi-agent actor-critic with hierarchical graph attention network. In: AAAI, pp 7236–7243
Samvelyan M, Rashid T, de Witt CS, Farquhar G, Nardelli N, Rudner TG, Hung CM, Torr PH, Foerster J, Whiteson S (2019) The starcraft multi-agent challenge. arXiv:1902.04043
Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. arXiv:1707.06347
Singh A, Jain T, Sukhbaatar S (2018) Learning when to communicate at scale in multiagent cooperative and competitive tasks. arXiv:1812.09755
Son K, Kim D, Kang WJ, Hostallero DE, Yi Y (2019) Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. arXiv:1905.05408
Stone P, Veloso M (2000) Multiagent systems: a survey from a machine learning perspective. Auton Robot 8(3):345–383
Sukhbaatar S, Fergus R, et al. (2016) Learning multiagent communication with backpropagation. In: Advances in neural information processing systems, pp 2244–2252
Sunehag P, Lever G, Gruslys A, Czarnecki WM, Zambaldi V, Jaderberg M, Lanctot M, Sonnerat N, Leibo JZ, Tuyls K et al (2017) Value-decomposition networks for cooperative multi-agent learning. arXiv:1706.05296
Sutton RS, McAllester DA, Singh SP, Mansour Y (2000) Policy gradient methods for reinforcement learning with function approximation. In: Advances in neural information processing systems, pp 1057–1063
Veličković P, Cucurull G, Casanova A, Romero A, Lio P, Bengio Y (2017) Graph attention networks. arXiv:1710.10903
Wang W, Yang T, Liu Y, Hao J, Hao X, Hu Y, Chen Y, Fan C, Gao Y (2020) From few to more: large-scale dynamic multiagent curriculum learning. In: AAAI, pp 7293–7300
Werbos PJ (1990) Backpropagation through time: what it does and how to do it. Proc IEEE 78 (10):1550–1560
Whiteson S (2018) Qmix: Monotonic value function factorisation for deep multi- agent reinforcement learning
Wooldridge M, Jennings NR, Kinny D (2000) The gaia methodology for agent-oriented analysis and design. Auton Agents Multi-Agent Syst 3(3):285–312
Yang Y, Luo R, Li M, Zhou M, Zhang W, Wang J (2018) Mean field multi-agent reinforcement learning. arXiv:1802.05438
Yu C, Zhang M, Ren F, Tan G (2015) Multiagent learning of coordination in loosely coupled multiagent systems. IEEE Trans Cybern 45(12):2853–2867
Zhang Z, Yang J, Zha H (2019) Integrating independent and centralized multi-agent reinforcement learning for traffic signal network optimization. arXiv:1909.10651
Acknowledgments
This work was supported in part by the National Key Research and Development Program of China under Grant No.2017YFB1001901, in part by the Key Program of Tianjin Science and Technology Development Plan under Grant No.18ZXZNGX00120 and in part by the China Postdoctoral Science Foundation under Grant No.2018M643900.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Environment details
We follow the settings of SMAC [34], which could be referred in the SMAC paper. For clarity and completeness, we state these environment details again.
1.1 A.1 States and observations
At each time step, agents receive local observations within their field of view. This encompasses information about the map within a circular area around each unit with a radius equal to the sight range, which is set to 9. The sight range makes the environment partially observable for agents. An agent can only observe others if they are both alive and located within its sight range. Hence, there is no way for agents to distinguish whether their teammates are far away or dead. If one unit (both for allies and enemies) is dead or unseen from another agent’s observation, then its unit feature vector is reset to all zeros. The feature vector observed by each agent contains the following attributes for both allied and enemy units within the sight range: distance, relative x, relative y, health, shield, and unit type. If agents are homogeneous, the unit type feature will be omitted. All Protos units have shields, which serve as a source of protection to offset the damage and can regenerate if no new damage is received. Lastly, agents can observe the terrain features surrounding them, in particular, the values of eight points at a fixed radius indicating height and walkability.
The global state is composed of the joint unit features of both ally and enemy soldiers. Specifically, the state vector includes the coordinates of all agents relative to the center of the map, together with unit features present in the observations. Additionally, the state stores the energy/cooldown of the allied units based on the unit property, which represents the minimum delay between attacks/healing. All features, both in the global state and in individual observations of agents, are normalized by their maximum values
1.2 A.2 Action space
The discrete set of actions which agents are allowed to take consists of move[direction], attack[enemy id], stop and no-op. Dead agents can only take no-op action while live agents cannot. Agents can only move with a fixed movement amount 2 in four directions: north, south, east, or west. To ensure decentralization of the task, agents are restricted to use the attack[enemy id] action only towards enemies in their shooting range. This additionally constrains the ability of the units to use the built-in attack-move micro-actions on the enemies that are far away. The shooting range is set to be 6 for all agents. Having a larger sight range than a shooting range allows agents to make use of the move commands before starting to fire. The unit behavior of automatically responding to enemy fire without being explicitly ordered is also disabled. As healer units, Medivacs use heal[agent id] actions instead of attack[enemy id].
1.3 A.3 Rewards
At each time step, the agents receive a joint reward equal to the total damage dealt on the enemy units. In addition, agents receive a bonus of 10 points after killing each opponent, and 200 points after killing all opponents for winning the battle. The rewards are scaled so that the maximum cumulative reward achievable in each scenario is around 20.
Appendix B: Training configurations
The training time is about 14 hours to 24 hours on these maps (Intel (R) Core (TM) i7-8700 CPU @ 3.20GHz, 32 GB RAM, Nvidia GTX 1050 GPU), which is ranging based on the agent numbers and map features of each map. The number of the total training steps is about 2 million and every 10 thousand steps we train and test the model. When training, a batch of 32 epochs are retrieved from the replay buffer which contains the most recent 1000 epochs. We use 𝜖-greedy policy for exploration. The starting exploration rate is set to 1 and the end exploration rate is 0.05. Exploration rate decays linearly at the first 50 thousand steps. We keep the default configurations of environment parameters. Hyperparameters were based on the PyMARL [34] implementation of QMIX and are listed in Table 3. All hyperparameters are the same in StarCraft II.
Rights and permissions
About this article
Cite this article
Jiang, H., Shi, D., Xue, C. et al. Multi-agent deep reinforcement learning with type-based hierarchical group communication. Appl Intell 51, 5793–5808 (2021). https://doi.org/10.1007/s10489-020-02065-9
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-020-02065-9