Elsevier

Neurocomputing

Volume 390, 21 May 2020, Pages 40-56
Neurocomputing

Multi-agent actor centralized-critic with communication

https://doi.org/10.1016/j.neucom.2020.01.079Get rights and content

Abstract

Multiple real-world problems are naturally modeled as cooperative multi-agent systems, ranging from satellite formation to traffic monitoring. These systems require algorithms that can learn successful policies with independent agents that rely solely on local partial-observations of the environment. However, multi-agent environments are more complex, due to their partial-observability and non-stationarity from an agent’s perspective, as well as the structural credit assignment problem and the curse of dimensionality, and achieving coordination in such systems remains a complex challenge. To this end, we propose a multi-agent actor-critic algorithm called Asynchronous Advantage Actor Centralized-Critic with Communication (A3C3). A3C3 uses a centralized critic to estimate a value function, decentralized actors to approximate each agent’s policy function, and decentralized communication networks for each agent to share relevant information with its team. The critic can incorporate additional information, like the environment’s global state, when available, and optimizes the actor networks. The actor networks of an agent’s teammates optimize that agent’s communication network, such that each agent learns to output information that is relevant to the policies of others. A3C3 supports a dynamic amount of agents, noisy communication mediums, and can be horizontally scaled to shorten its learning phase. We evaluate A3C3 in two partially-observable multi-agent suites where agents benefit from communicating local information to each other. A3C3 outperforms state-of-the-art multi-agent algorithms, independent approaches, and centralized controllers with access to all agents’ observations.

Introduction

Many complex reinforcement learning problems can be modeled as cooperative multi-agent systems (MAS), including robotic navigation [1], [2], traffic monitoring [3], and satellite formation [4]. The recent popularity of deep reinforcement learning has achieved great results in highly complex single-agent environments [5], through the use of neural networks to approximate policies. However, single-agent algorithms typically underperform in multi-agent environments, as the joint action-space of the agents grows exponentially with the team size.

It becomes a necessity for policies to be executed in an independent and decentralized manner, where an agent only has access to its own local observation of the environment, its action-observation history, and communicated information sent by other team members. Various research efforts [7], [6], [8] have shown that achieving coordination among agents under such conditions remains a complex challenge with open questions.

Hence, there is a great need for new reinforcement learning methods that can efficiently learn decentralized policies. In many cases, learning can take place in a simulator or a laboratory in which extra state information is available and agents can communicate freely, a paradigm known as centralized learning, distributed execution [9]. However, how best to exploit this paradigm remains an open issue [10]. An additional challenge is how to transmit relevant information to other agents of the team. Communication is a general and flexible approach, that allow agents to share both low- and high-level information [11], despite being possibly constrained by environment restraints, such as distance. Recent research has shown that agents can learn communication protocols tabula rasa [12], [13] or derive them from symbol alphabets [14], [15]. Determining how agents learn their communication protocols also remains an open question.

In this work, we propose Asynchronous Advantage Actor Centralized-Critic with Communication (A3C3), a deep multi-agent reinforcement learning algorithm based on actor-critic methods. A3C3 agents train an actor, i.e., the policy, by following a gradient estimated by a centralized critic, while also training an additional communication network, which follows a gradient given by the actors of their own team-mates. A3C3 is based on three core ideas.

Firstly, it is based on Asynchronous Advantage Actor-Critic (A3C) [16], a deep learning single-agent actor-critic method. A3C use advantages to train its actors, representing how better an action’s actual returns were when compared with the value expectation given by the critic. A larger advantage implies the action had a better outcome than expected, and it should be taken more often. A3C can also be horizontally scaled through multiple workers, requiring no specialized hardware.

Secondly, A3C3 uses a centralized critic. The critic, only used during learning, incorporates all agents’ observations and any additional information provided by the environment. During execution, agents do not require a critic, and can act based solely on their local observations. The centralized critic speeds up and stabilizes the learning process. Its function is to estimate the expected reward for a given state when agents follow their current policies. It is simpler than approaches which output value estimations for all actions in a given state, and, by extension, is trained faster.

Thirdly, A3C3 uses a communication network. This network, like the actor, takes as input only local information available to each agent (such as local observations or incoming messages sent by other agents). It outputs messages, modeled as vectors of continuous values, which are sent to other agents following environment restraints (limitations, noise, among others). The protocol learned by each population is unique, and stimulates coordination among agents, by improving the policies of an agent’s team-mates.

The remainder of this paper is structured as follows. Section 2 lists related work, from actor-critic based algorithm with centralized critics, to reinforcement learning algorithms that simultaneously learn communication protocols and action policies. Section 3 states our problem, and Section 4 introduces and formally describes A3C3, as well as its architecture, modules, limitations, and methodology. Section 5 shows the results of our proposal obtained in complex environments, on two multi-agent environment suites, used by other state-of-the-art algorithms. Finally, Section 6 draws conclusions and lists future work directions.

Section snippets

Related work

Multi-Agent Reinforcement Learning (MARL) is the discipline that focuses on models where agents dynamically learn policies through interaction with the environment. An agent’s goal is to maximize its local reward, a numerical-representation of a long-term objective [17]. In a MAS, multiple agents behaves as learners, selecting and performing actions on the environment, which then reaches a new state. Agents sample observations from the environment’s current state, and obtain a reward associated

Problem statement

We focus on multi-agent cooperative environments with J agents, in which each agent j has local partial observations otj of the environment at each discrete time-step t. An observation otj is a (usually incomplete) representation of the environment’s state st, and can be noisy, discrete or continuous. Orthogonally, observations may also be local or global. Global observations represent the state without any agent specific information or perspective (e.g., a bird’s eye view over a soccer field),

Asynchronous Advantage Actor Centralized-Critic with communication

Single-agent distributed algorithms like Asynchronous Advantage Actor-Critic (A3C) [40] run multiple parallel workers in multi-core CPU and have been shown to outperform single-threaded GPU-based algorithms. They keep global networks which are updated by multiple workers asynchronously, as shown in Fig. 3. Because they are based on Actor-Critic, they keep an actor network, outputting the probability π(at|ot; θa) of taking action at at time-step t, based on the current observation ot and the

Results

The A3C3 algorithm is tested in the POC and MPE environment suites. A3C3 is compared against state-of-the-art single-agent algorithms, A3C, DDPG, and PPO. It is then compared against multi-agent MADDPG, and the effects of its centralized critic and communication networks are independently tested. After this, the effects of noise in the communication medium are tested against baselines with no noise or no communication. Finally, multiple advantage estimation formulae are evaluated. Some of the

Conclusion

This article describes a multi-agent deep reinforcement learning algorithm, which we call Asynchronous Advantage Actor Centralized-Critic with Communication (A3C3), where distributed worker threads use Actor-Critic methods to asynchronously optimize value, policy and communication networks for agents. The algorithm features a centralized learning phase, distributed execution, and inter-agent communication. A3C3 supports partially observable domains, noisy communications, heterogeneous reward

CRediT authorship contribution statement

David Simões: Conceptualization, Methodology, Software, Writing - original draft, Validation, Investigation. Nuno Lau: Supervision, Writing - review & editing. Luís Paulo Reis: Supervision, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

The first author is supported by FCT (Portuguese Foundation for Science and Technology) under grant PD/BD/113963/2015. This research was partially supported by IEETA (UID/CEC/00127/2019) and LIACC (PEst-UID/CEC/00027/2019).

David Simões obtained a M.Sc. (2015) in Computer and Telematics Engineering from the University of Aveiro, Portugal, and is currently a Ph.D. student in a joint Ph.D. program at the Universities of Minho, Aveiro and Porto (Portugal). His thesis topic is on learning coordination in multi-agent systems. He has worked on simulated humanoid robots and achieved different ranks in Robocup competitions including 4 world championships, and has worked in robotic and simulated maze-solving competitions,

References (49)

  • P. Hernandez-Leal, B. Kartal, M.E. Taylor, Is multiagent deep reinforcement learning the answer or the question? a...
  • J. Schulman et al.

    High-dimensional continuous control using generalized advantage estimation

    (2015)
  • F. Ducatelle et al.

    Cooperative navigation in robotic swarms

    Swarm Intell.

    (2014)
  • G.H. Gebhardt et al.

    Learning robust policies for object manipulation with robot swarms

    Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA)

    (2018)
  • P. Mannion et al.

    An experimental review of reinforcement learning algorithms for adaptive traffic signal control

    Autonomic Road Transport Support Systems

    (2016)
  • P. Skobelev et al.

    Using multi-agent technology for the distributed management of a cluster of remote sensing satellites

    Compl. Syst.: Fundam. Appl.

    (2016)
  • V. Firoiu et al.

    Beating the world’s best at super smash bros. with deep reinforcement learning

    (2017)
  • P. Hernandez-Leal, M. Kaisers, T. Baarslag, E.M. de Cote, A survey of learning in multiagent environments: Dealing with...
  • S.V. Albrecht et al.

    Autonomous agents modelling other agents: a comprehensive survey and open problems

    Artif. Intell.

    (2018)
  • J.N. Foerster et al.

    Learning to communicate with deep multi-agent reinforcement learning

    (2016)
  • J.N. Foerster et al.

    Counterfactual multi-agent policy gradients

    (2017)
  • C. Boutilier

    Planning, learning and coordination in multiagent decision processes

    Proceedings of the Sixth conference on Theoretical Aspects of Rationality and Knowledge

    (1996)
  • S. Sukhbaatar et al.

    Learning multiagent communication with backpropagation

    (2016)
  • D.B. D’Ambrosio et al.

    Multirobot behavior synchronization through direct neural network communication

    Proceedings of the International Conference on Intelligent Robotics and Applications

    (2012)
  • A. Das et al.

    Learning cooperative visual dialog agents with deep reinforcement learning

    (2017)
  • I. Mordatch et al.

    Emergence of grounded compositional language in multi-agent populations

    (2017)
  • V. Mnih et al.

    Asynchronous methods for deep reinforcement learning

    Proceedings of the International Conference on Machine Learning

    (2016)
  • S. Kapoor

    Multi-agent reinforcement learning: a report on challenges and approaches

    (2018)
  • R.S. Sutton et al.

    Introduction to Reinforcement Learning

    (1998)
  • YangE. et al.

    A survey on multiagent reinforcement learning towards multi-robot systems.

    Proceedings of the IEEE 2005 Symposium on Computational Intelligence and Games, CIG’05

    (2005)
  • L. Matignon et al.

    Independent reinforcement learners in cooperative Markov games: a survey regarding coordination problems

    Knowl. Eng. Rev.

    (2012)
  • M. Bowling et al.

    Rational and convergent learning in stochastic games

    Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence – Volume 2

    (2001)
  • L. Busoniu et al.

    A comprehensive survey of multiagent reinforcement learning

    Trans. Syst. Man Cyber C

    (2008)
  • B.H.K. Abed-Alguni

    Cooperative reinforcement learning for independent learners

    (2014)
  • Cited by (34)

    • Hybrid attention-oriented experience replay for deep reinforcement learning and its application to a multi-robot cooperative hunting problem

      2023, Neurocomputing
      Citation Excerpt :

      Even in the absence of high-dimensional data, the network can learn flexibly. In addition, the asynchronous advantage actor centralized-critic with communication method [8] uses a decentralized communication network to share relevant information between agents. Besides, hierarchical reinforcement learning is a common method for simplifying the multi-agent learning process and reducing the space complexity.

    • Common belief multi-agent reinforcement learning based on variational recurrent models

      2022, Neurocomputing
      Citation Excerpt :

      However, all these methods only use centralised critic to coordinate during training, and lack a coordination mechanism among agents during execution. Therefore, a large number of studies resorted to communication mechanisms [15–20] to enable coordination among agents during the execution process. These works are normally built upon the assumption that agents can share some kind of private information using explicit communication protocols or emergent symbols.

    View all citing articles on Scopus

    David Simões obtained a M.Sc. (2015) in Computer and Telematics Engineering from the University of Aveiro, Portugal, and is currently a Ph.D. student in a joint Ph.D. program at the Universities of Minho, Aveiro and Porto (Portugal). His thesis topic is on learning coordination in multi-agent systems. He has worked on simulated humanoid robots and achieved different ranks in Robocup competitions including 4 world championships, and has worked in robotic and simulated maze-solving competitions, winning several national Micro-Rato competitions. His main research interests include multi-agent systems, deep learning, and game theory.

    Nuno Lau is Assistant Profess or at Aveiro University, Portugal and Researcher at the Institute of Electronics and Informatics Engineering of Aveiro (IEETA), where he leads the Intelligent Robotics and Systems group (IRIS). He got his Electrical Engineering Degree from Oporto University in 1993, a DEA degree in Biomedical Engineering from Claude Bernard University, France, in 1994 and the Ph.D. from Aveiro University in 2003. His research interests are focused on Intelligent Robotics, Artificial Intelligence, Multi-Agent Systems and Simulation. Nuno Lau participated in more than 15 international and national research projects, having the tasks of general or local coordinator in about half of them. Nuno Lau won more than 50 scientific awards in robotic competitions, conferences (best papers) and education. He has lectured courses at Ph.D. and M.Sc. levels on Intelligent Robotics, Distributed Artificial Intelligence, Computer Architecture, Programming, etc. Nuno Lau is the author of more than 160 publications in international conferences and journals. He was President of the Portuguese Robotics Society from 2015 to 2017.

    Luís Paulo Reis is an Associate Professor at the Faculty of Engineering of the University of Porto in Portugal and Director of LIACC – Artificial Intelligence and Computer Science Laboratory at the same University. He is an IEEE Senior Member and he was president of the Portuguese Society for Robotics and is vice-president of the Portuguese Association for Artificial Intelligence. During the last 25 years, he has lectured courses on Artificial Intelligence, Intelligent Robotics, Multi-Agent Systems, Simulation and Modelling, Games and Interaction, Educational/Serious Games and Computer Programming. He was the principal investigator of more than 10 research projects in those areas. He won more than 50 scientific awards including wining more than 15 RoboCup international competitions and best papers at conferences such as ICEIS, Robotica, IEEE ICARSC and ICAART. He supervised 20 Ph.D. and 102 M.Sc. theses to completion and is supervising 8 Ph.D. theses. He organized more than 50 international scientific events and belonged to the Program Committee of more than 250 scientific events. He is the author of more than 300 publications in international conferences and journals (indexed at SCOPUS or ISI Web of Knowledge).

    View full text