Abstract
We study reinforcement learning methods in multi-agent domains where a central controller collects all information and decides an action for every agent. However, multi-agent reinforcement learning (MARL) suffers from the combinatorial explosion of action space. In this work, we propose an improved proximal policy optimization (PPO) algorithm, whose neural network is based on attention mechanism, to solve the combinatorial explosion issue. Our model outputs joint-action instead of distributed action. Parameter sharing of attention mechanism enables the size of neural network linearly with local observation’s length of single agent regardless of the agents’ number. Besides, credit assignment of multi-agent is naturally addressed by gradient ascent in the attention layer. Experiment results demonstrate that our method outperforms independent PPO and centralized PPO with other networks.
Similar content being viewed by others
References
Mnih V, Kavukcuoglu K, Silver D et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533
Vinyals O, Babuschkin I, Czarnecki WM et al (2019) Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575(7782):350–354
Levine S, Finn C, Darrell T et al (2016) End-to-end training of deep visuomotor policies. J Mach Learn Res 17(1):1334–1373
Oliehoek FA, Spaan MTJ, Vlassis N (2008) Optimal and approximate Q-value functions for decentralized POMDPs. J Artif Intell Res 32:289–353
Kraemer L, Banerjee B (2016) Multi-agent reinforcement learning as a rehearsal for decentralized planning. Neurocomputing 190:82–94
Tavakoli A, Pardo F, Kormushev P 2018 Action branching architectures for deep reinforcement learning. In: Proceedings of the 32nd AAAI conference on artificial intelligence (AAAI 2018)
Tan M (1993) Multi-agent reinforcement learning: independent vs. cooperative agents. In: Proceedings of the tenth international conference on machine learning, pp 330–337
Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
Devlin J, Chang M W, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. https://arxiv.org/abs/1810.04805
Brown T B, Mann B, Ryder N, et al (2020) Language models are few-shot learners. https://arxiv.org/abs/2005.14165
Dosovitskiy A, Beyer L, Kolesnikov A, et al (2020) An image is worth 16x16 words: Transformers for image recognition at scale. https://arxiv.org/abs/2010.11929
Zhang S, Yao L, Sun A et al (2019) Deep learning based recommender system: a survey and new perspectives. ACM Comput Surv (CSUR) 52(1):1–38
Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: 3rd International conference on learning representations, ICLR 2015
Sunehag P, Lever G, Gruslys A, et al Value-decomposition networks for cooperative multi-agent learning based on team reward. In: AAMAS. 2018: 2085–2087
Lowe R, Wu Y I, Tamar A, et al (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. In: Advances in neural information processing systems, pp 6379–6390
illicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D (2016) Continuous control with deep reinforcement learning. In: International conference on learning representations
Iqbal S, Sha F (2019) Actor-attention-critic for multi-agent reinforcement learning. In: International conference on machine learning. PMLR, pp 2961–2970
Jiang J, Lu Z (2018) Learning attentional communication for multi-agent cooperation. In: Advances in neural information processing systems, pp 7254–7264
Khan A, Zhang C, Lee D D, et al (2018) Scalable centralized deep multi-agent reinforcement learning via policy gradients. https://arxiv.org/abs/1805.08776
Sutton R S, McAllester D A, Singh S P, et al (2000) Policy gradient methods for reinforcement learning with function approximation. In: Advances in neural information processing systems, pp 1057–1063
Schulman J, Levine S, Abbeel P, et al (2015) Trust region policy optimization. In: International conference on machine learning, pp 1889–1897
Schulman J, Wolski F, Dhariwal P, et al (2017) Proximal policy optimization algorithms. https://arxiv.org/abs/1707.06347
Schulman J, Moritz P, Levine S, et al (2015) High-dimensional continuous control using generalized advantage estimation. https://arxiv.org/abs/1506.02438
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Tang Y, Agrawal S (2020) Discretizing continuous action space for on-policy optimization. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, no (04), pp 5981–5988
Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. MIT press, Cambridge
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Lu, C., Bao, Q., Xia, S. et al. Centralized reinforcement learning for multi-agent cooperative environments. Evol. Intel. 17, 267–273 (2024). https://doi.org/10.1007/s12065-022-00703-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12065-022-00703-4