Multi-agent actor centralized-critic with communication
Introduction
Many complex reinforcement learning problems can be modeled as cooperative multi-agent systems (MAS), including robotic navigation [1], [2], traffic monitoring [3], and satellite formation [4]. The recent popularity of deep reinforcement learning has achieved great results in highly complex single-agent environments [5], through the use of neural networks to approximate policies. However, single-agent algorithms typically underperform in multi-agent environments, as the joint action-space of the agents grows exponentially with the team size.
It becomes a necessity for policies to be executed in an independent and decentralized manner, where an agent only has access to its own local observation of the environment, its action-observation history, and communicated information sent by other team members. Various research efforts [7], [6], [8] have shown that achieving coordination among agents under such conditions remains a complex challenge with open questions.
Hence, there is a great need for new reinforcement learning methods that can efficiently learn decentralized policies. In many cases, learning can take place in a simulator or a laboratory in which extra state information is available and agents can communicate freely, a paradigm known as centralized learning, distributed execution [9]. However, how best to exploit this paradigm remains an open issue [10]. An additional challenge is how to transmit relevant information to other agents of the team. Communication is a general and flexible approach, that allow agents to share both low- and high-level information [11], despite being possibly constrained by environment restraints, such as distance. Recent research has shown that agents can learn communication protocols tabula rasa [12], [13] or derive them from symbol alphabets [14], [15]. Determining how agents learn their communication protocols also remains an open question.
In this work, we propose Asynchronous Advantage Actor Centralized-Critic with Communication (A3C3), a deep multi-agent reinforcement learning algorithm based on actor-critic methods. A3C3 agents train an actor, i.e., the policy, by following a gradient estimated by a centralized critic, while also training an additional communication network, which follows a gradient given by the actors of their own team-mates. A3C3 is based on three core ideas.
Firstly, it is based on Asynchronous Advantage Actor-Critic (A3C) [16], a deep learning single-agent actor-critic method. A3C use advantages to train its actors, representing how better an action’s actual returns were when compared with the value expectation given by the critic. A larger advantage implies the action had a better outcome than expected, and it should be taken more often. A3C can also be horizontally scaled through multiple workers, requiring no specialized hardware.
Secondly, A3C3 uses a centralized critic. The critic, only used during learning, incorporates all agents’ observations and any additional information provided by the environment. During execution, agents do not require a critic, and can act based solely on their local observations. The centralized critic speeds up and stabilizes the learning process. Its function is to estimate the expected reward for a given state when agents follow their current policies. It is simpler than approaches which output value estimations for all actions in a given state, and, by extension, is trained faster.
Thirdly, A3C3 uses a communication network. This network, like the actor, takes as input only local information available to each agent (such as local observations or incoming messages sent by other agents). It outputs messages, modeled as vectors of continuous values, which are sent to other agents following environment restraints (limitations, noise, among others). The protocol learned by each population is unique, and stimulates coordination among agents, by improving the policies of an agent’s team-mates.
The remainder of this paper is structured as follows. Section 2 lists related work, from actor-critic based algorithm with centralized critics, to reinforcement learning algorithms that simultaneously learn communication protocols and action policies. Section 3 states our problem, and Section 4 introduces and formally describes A3C3, as well as its architecture, modules, limitations, and methodology. Section 5 shows the results of our proposal obtained in complex environments, on two multi-agent environment suites, used by other state-of-the-art algorithms. Finally, Section 6 draws conclusions and lists future work directions.
Section snippets
Related work
Multi-Agent Reinforcement Learning (MARL) is the discipline that focuses on models where agents dynamically learn policies through interaction with the environment. An agent’s goal is to maximize its local reward, a numerical-representation of a long-term objective [17]. In a MAS, multiple agents behaves as learners, selecting and performing actions on the environment, which then reaches a new state. Agents sample observations from the environment’s current state, and obtain a reward associated
Problem statement
We focus on multi-agent cooperative environments with J agents, in which each agent j has local partial observations of the environment at each discrete time-step t. An observation is a (usually incomplete) representation of the environment’s state st, and can be noisy, discrete or continuous. Orthogonally, observations may also be local or global. Global observations represent the state without any agent specific information or perspective (e.g., a bird’s eye view over a soccer field),
Asynchronous Advantage Actor Centralized-Critic with communication
Single-agent distributed algorithms like Asynchronous Advantage Actor-Critic (A3C) [40] run multiple parallel workers in multi-core CPU and have been shown to outperform single-threaded GPU-based algorithms. They keep global networks which are updated by multiple workers asynchronously, as shown in Fig. 3. Because they are based on Actor-Critic, they keep an actor network, outputting the probability π(at|ot; θa) of taking action at at time-step t, based on the current observation ot and the
Results
The A3C3 algorithm is tested in the POC and MPE environment suites. A3C3 is compared against state-of-the-art single-agent algorithms, A3C, DDPG, and PPO. It is then compared against multi-agent MADDPG, and the effects of its centralized critic and communication networks are independently tested. After this, the effects of noise in the communication medium are tested against baselines with no noise or no communication. Finally, multiple advantage estimation formulae are evaluated. Some of the
Conclusion
This article describes a multi-agent deep reinforcement learning algorithm, which we call Asynchronous Advantage Actor Centralized-Critic with Communication (A3C3), where distributed worker threads use Actor-Critic methods to asynchronously optimize value, policy and communication networks for agents. The algorithm features a centralized learning phase, distributed execution, and inter-agent communication. A3C3 supports partially observable domains, noisy communications, heterogeneous reward
CRediT authorship contribution statement
David Simões: Conceptualization, Methodology, Software, Writing - original draft, Validation, Investigation. Nuno Lau: Supervision, Writing - review & editing. Luís Paulo Reis: Supervision, Writing - review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
The first author is supported by FCT (Portuguese Foundation for Science and Technology) under grant PD/BD/113963/2015. This research was partially supported by IEETA (UID/CEC/00127/2019) and LIACC (PEst-UID/CEC/00027/2019).
David Simões obtained a M.Sc. (2015) in Computer and Telematics Engineering from the University of Aveiro, Portugal, and is currently a Ph.D. student in a joint Ph.D. program at the Universities of Minho, Aveiro and Porto (Portugal). His thesis topic is on learning coordination in multi-agent systems. He has worked on simulated humanoid robots and achieved different ranks in Robocup competitions including 4 world championships, and has worked in robotic and simulated maze-solving competitions,
References (49)
- P. Hernandez-Leal, B. Kartal, M.E. Taylor, Is multiagent deep reinforcement learning the answer or the question? a...
- et al.
High-dimensional continuous control using generalized advantage estimation
(2015) - et al.
Cooperative navigation in robotic swarms
Swarm Intell.
(2014) - et al.
Learning robust policies for object manipulation with robot swarms
Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA)
(2018) - et al.
An experimental review of reinforcement learning algorithms for adaptive traffic signal control
Autonomic Road Transport Support Systems
(2016) - et al.
Using multi-agent technology for the distributed management of a cluster of remote sensing satellites
Compl. Syst.: Fundam. Appl.
(2016) - et al.
Beating the world’s best at super smash bros. with deep reinforcement learning
(2017) - P. Hernandez-Leal, M. Kaisers, T. Baarslag, E.M. de Cote, A survey of learning in multiagent environments: Dealing with...
- et al.
Autonomous agents modelling other agents: a comprehensive survey and open problems
Artif. Intell.
(2018) - et al.
Learning to communicate with deep multi-agent reinforcement learning
(2016)
Counterfactual multi-agent policy gradients
Planning, learning and coordination in multiagent decision processes
Proceedings of the Sixth conference on Theoretical Aspects of Rationality and Knowledge
Learning multiagent communication with backpropagation
Multirobot behavior synchronization through direct neural network communication
Proceedings of the International Conference on Intelligent Robotics and Applications
Learning cooperative visual dialog agents with deep reinforcement learning
Emergence of grounded compositional language in multi-agent populations
Asynchronous methods for deep reinforcement learning
Proceedings of the International Conference on Machine Learning
Multi-agent reinforcement learning: a report on challenges and approaches
Introduction to Reinforcement Learning
A survey on multiagent reinforcement learning towards multi-robot systems.
Proceedings of the IEEE 2005 Symposium on Computational Intelligence and Games, CIG’05
Independent reinforcement learners in cooperative Markov games: a survey regarding coordination problems
Knowl. Eng. Rev.
Rational and convergent learning in stochastic games
Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence – Volume 2
A comprehensive survey of multiagent reinforcement learning
Trans. Syst. Man Cyber C
Cooperative reinforcement learning for independent learners
Cited by (34)
Electric vehicle charging scheduling control strategy for the large-scale scenario with non-cooperative game-based multi-agent reinforcement learning
2023, International Journal of Electrical Power and Energy SystemsHybrid attention-oriented experience replay for deep reinforcement learning and its application to a multi-robot cooperative hunting problem
2023, NeurocomputingCitation Excerpt :Even in the absence of high-dimensional data, the network can learn flexibly. In addition, the asynchronous advantage actor centralized-critic with communication method [8] uses a decentralized communication network to share relevant information between agents. Besides, hierarchical reinforcement learning is a common method for simplifying the multi-agent learning process and reducing the space complexity.
Common belief multi-agent reinforcement learning based on variational recurrent models
2022, NeurocomputingCitation Excerpt :However, all these methods only use centralised critic to coordinate during training, and lack a coordination mechanism among agents during execution. Therefore, a large number of studies resorted to communication mechanisms [15–20] to enable coordination among agents during the execution process. These works are normally built upon the assumption that agents can share some kind of private information using explicit communication protocols or emergent symbols.
Multiagent Reinforcement Learning: Rollout and Policy Iteration for POMDP With Application to Multirobot Problems
2024, IEEE Transactions on Robotics
David Simões obtained a M.Sc. (2015) in Computer and Telematics Engineering from the University of Aveiro, Portugal, and is currently a Ph.D. student in a joint Ph.D. program at the Universities of Minho, Aveiro and Porto (Portugal). His thesis topic is on learning coordination in multi-agent systems. He has worked on simulated humanoid robots and achieved different ranks in Robocup competitions including 4 world championships, and has worked in robotic and simulated maze-solving competitions, winning several national Micro-Rato competitions. His main research interests include multi-agent systems, deep learning, and game theory.
Nuno Lau is Assistant Profess or at Aveiro University, Portugal and Researcher at the Institute of Electronics and Informatics Engineering of Aveiro (IEETA), where he leads the Intelligent Robotics and Systems group (IRIS). He got his Electrical Engineering Degree from Oporto University in 1993, a DEA degree in Biomedical Engineering from Claude Bernard University, France, in 1994 and the Ph.D. from Aveiro University in 2003. His research interests are focused on Intelligent Robotics, Artificial Intelligence, Multi-Agent Systems and Simulation. Nuno Lau participated in more than 15 international and national research projects, having the tasks of general or local coordinator in about half of them. Nuno Lau won more than 50 scientific awards in robotic competitions, conferences (best papers) and education. He has lectured courses at Ph.D. and M.Sc. levels on Intelligent Robotics, Distributed Artificial Intelligence, Computer Architecture, Programming, etc. Nuno Lau is the author of more than 160 publications in international conferences and journals. He was President of the Portuguese Robotics Society from 2015 to 2017.
Luís Paulo Reis is an Associate Professor at the Faculty of Engineering of the University of Porto in Portugal and Director of LIACC – Artificial Intelligence and Computer Science Laboratory at the same University. He is an IEEE Senior Member and he was president of the Portuguese Society for Robotics and is vice-president of the Portuguese Association for Artificial Intelligence. During the last 25 years, he has lectured courses on Artificial Intelligence, Intelligent Robotics, Multi-Agent Systems, Simulation and Modelling, Games and Interaction, Educational/Serious Games and Computer Programming. He was the principal investigator of more than 10 research projects in those areas. He won more than 50 scientific awards including wining more than 15 RoboCup international competitions and best papers at conferences such as ICEIS, Robotica, IEEE ICARSC and ICAART. He supervised 20 Ph.D. and 102 M.Sc. theses to completion and is supervising 8 Ph.D. theses. He organized more than 50 international scientific events and belonged to the Program Committee of more than 250 scientific events. He is the author of more than 300 publications in international conferences and journals (indexed at SCOPUS or ISI Web of Knowledge).