Elsevier

Neurocomputing

Volume 471, 30 January 2022, Pages 79-93
Neurocomputing

Multi actor hierarchical attention critic with RNN-based feature extraction

https://doi.org/10.1016/j.neucom.2021.10.093Get rights and content

Abstract

Deep reinforcement learning has made significant progress in multi-agent tasks in recent years. However, most previous studies focus on solving fully cooperative tasks, which do not perform well in mixed tasks. In mixed tasks, the agent needs to comprehensively consider the information provided by its friends and enemies to learn its strategy, and its strategy is sensitive to the received information. Additionally, the input space of the critic network increases rapidly with the number of agents in the actor-critic framework. It’s of great necessity to efficiently learn information representation to obtain important features. To this end, we present an approach that conducts information representation with attention mechanism. Our approach adopts the framework of centralized training and decentralized execution. We apply the multi-head hierarchical attention mechanism to centrally computed critics, so critics can process the received information more accurately and assist actors in choosing better actions. The hierarchical attention critic adopts a bi-level attention structure which is composed of the agent-level and the group-level. They are designed to assign different weights to friends’ and enemies’ information and then summarize them at each timestep. It achieves high efficiency and scalability in mixed tasks. Furthermore, we use the feature extraction based on the recurrent neural network to encode the state-action sequence information of each agent. Experimental results show that our approach is not only applicable to cooperative environments but also better in mixed environments, especially in the predator-prey task, the reward obtained by our method is twice that of the baselines.

Introduction

Multi-agent reinforcement learning (MARL) is an important branch of reinforcement learning (RL). It has been studied for a long time and applied in a variety of settings [1], [2]. According to the relationship between different agents, multi-agent tasks can be divided into three categories, including fully cooperative, fully competitive, and mixed settings [3]. The mixed setting represents a series of scenarios that include both confrontation and cooperation, where multiple groups (usually two) are in a competitive relationship and the agents in the same group are in a cooperative relationship. MARL can be applied to learn cooperation between cooperative agents [4], the optimal strategy in competitive settings [5], etc. However, MARL has often been restricted to simple environments with tabular methods, which is difficult to apply to the real world. Recently, the success of deep reinforcement learning (DRL) has made exciting progress in many domains, including Atari-2600 games [6] and Go [7]. The success of deep Q-Networks (DQN) [6] has promoted the combination of deep learning and MARL, forming an emerging area of multi-agent deep reinforcement learning (MADRL) [8], [9], [10]. MADRL enables agents to operate in high-dimensional state space and action space [11], [12], [13].

In order to enable agents to learn effectively in multi-agent environments, agents must not only learn the dynamic characteristics of the environment but also learn the dynamic characteristics of other agents.

The simplest approach in MADRL is to use single-agent algorithms in the multi-agent setting. Each agent learns its own strategy, treating other agents as part of the environment and maximizes its own reward. However, these independent learners are hard to converge because the environment is dynamic and non-stationary [14], [15], which violates the Markov property (i.e., the future dynamics, transitions, and rewards depend only on the current state). These independent learning approaches are more scalable but suffer from non-stationarity issues [8]. Another idealized approach is to collectively model all agents as a centralized single unit [16]. Its action space is the joint action space of all agents. This approach is not scalable because the size of its action space increases exponentially with respect to the number of agents. Moreover, since the centralized control unit must collect information and distribute actions to each agent, this method requires highly effective communication during execution. It is difficult to meet this communication demand in real applications.

Recent studies of MADRL algorithms try to combine the advantages of these two approaches. The majority of these algorithms adopt the framework of centralized training and decentralized execution [17], [18], [13], [19]. This framework contains decentralized actors and centralized critics. The actor represents the policy, and the critic is used for the value function learning [20]. In this framework, the processing of all agents’ information in critics is important when deriving decentralized policies.

Although most recent works have achieved good results, they focus on fully cooperative or competitive tasks among multiple agents with shared rewards. In mixed tasks, there are two important challenges that need to be overcome. The first one is to train sensible agents for mixed settings. Agents are sensitive to observations. The interaction between friends and enemies may lead to multiple equilibria during the game, and the choice of equilibria determines the performance of agents [3]. It is important for centralized critics to properly process the received observations and actions to obtain potential accurate Q-values to guide decentralized actors on the choice of equilibria. The second challenge is that the sequence information provided by heterogeneous agents in mixed tasks needs to be pre-processed. In order to get better performance for feature extraction in deep neural networks (DNN), predecessors proposed an attention critic [19] that uses a multi-layer perceptron (MLP) to encode the received information.

We look for inspiration in natural language processing (NLP) to solve the above challenges. In natural language processing (NLP), predecessors used hierarchical attention networks (HAN) [21] to process the information at both the word attention and the sentence attention in order to obtain the document information better. Word embeddings and contextual connections constitute the information of the sentences. The prediction result of the document depends on the information expressed by the key sentence. By taking the insights of these latest prior works into consideration, we propose a novel model called actor hierarchical attention critic (AHAC) to overcome the challenges mentioned above, and we prove that AHAC performs better in mixed settings. The main idea behind it is to learn accurate centralized hierarchical attention critics. The inspiration of AHAC comes from the confrontation in the real world, where exists multiple groups with competitive relationships and the agents in the same group are still in a cooperative relationship. Each agent considers both powerful friends and threatening enemies. The overview of AHAC is shown in Fig. 1.

We mark the agents according to the type and goal of the agents in the environment. Since different agents may have different observation spaces and action spaces in mixed tasks, we project the observation information and action information of each agent to a fixed-size feature space to retain important information. The recurrent neural network (RNN) is a class of neural networks that take advantage of the sequential nature of data by sharing model parameters across multiple timesteps [22]. We use RNN to encode the state-action tuple. In order to solve the problem that the critic network’s input space increases linearly with the number of agents, we propose multi-head hierarchical attention mechanism (MHA) that combines multi-head attention [23] and hierarchical attention. MHA is a bi-level attention structure where the first level focuses on calculating the contribution of all agents, and the second level focuses on compressing them into a single vector. More specifically, first, each agent exchanges local perceived information with other agents in the same group to calculate their contribution. Then each agent calculates the influence of different groups on its behavior based on the contribution weight of each group. In this way, the agent can better judge the threats and help from other agents in the environment. Besides, since MHA converts the contribution from all agents into fixed-size feature vectors, the computational efficiency is highly improved.

Our main contribution can be summarized as follows:

  • RNN-based feature extraction.We use a feature extraction module based on RNN to encode the received information. It converts a series of original data (i.e., observation information and action information) of each agent into numerical features that are convenient for the calculation of the critic network. Furthermore, it can solve the problem of the unequal length of information received by critics in heterogeneous multi-agent problems.

  • Weighting information with MHA.We use MHA to adaptively extract important information from multiple agents. It calculates the importance distribution within the group in the agent-level attention and the importance distribution between the groups in the group-level attention at each timestep. In addition, MHA concatenates the weighted information of multiple feature subspaces followed by a feed-forward layer with a non-linear activation function. MHA can better integrate the contribution of other agents from all attention heads into a single vector.

  • Combining with multi-agent actor-critic.By combining RNN-based feature extraction and MHA with multi-agent actor-critic, we propose a novel algorithm AHAC based on multi-agent actor-critic. AHAC solves the problem that the input space of the critic network is sensitive to the growth of the number of agents. AHAC has better scalability in multi-agent tasks.

We evaluate AHAC in the multi-agent particle environment (MPE) [24]. MPE gathers a list of navigation tasks and permits customized tasks. We compare AHAC with the state-of-the-art baselines, and we evaluate them on different tasks designed by previous works [18], [19] which cover common multi-agent tasks. We prove that our approach is suitable for fully cooperative, fully competitive, and mixed tasks. In particular, we prove that our method achieves outstanding performance in mixed settings.

The rest of the paper is organized as follows. We discuss related works and preliminaries in Section 2 and describe our approach in Section 3. Experimental results are shown in Section 4. Finally, we conclude our work in Section 5.

Since there are too many abbreviations appear in this article, we summarize all abbreviations in Table 1.

Section snippets

Markov games

In this work, we consider the framework of markov games [5], which is a multi-agent extension of markov decision processes (MDPs). We denote a markov game with tuple <N,S,A,O,T,R>. N is the number of agents. S is the set of state. A={A1,,AN} is a set of actions for all agents. O={O1,,ON} is a set of observations for all agent i. Given the current state and action for each agent i,T:S×A1××ANP(S) is the state transition function, which defines the probability distribution over possibles next

Our approach

The architecture of AHAC is shown in Fig. 1. AHAC uses the multi-agent actor-critic framework. It is mainly composed of three core mechanisms: RNN-based feature extraction, MHA and multi-agent actor-critic updating.

First, we label agents according to their types and tasks in the environment section. Then, we use RNN-based feature extraction to project each agent’s observation and action into a fixed-size feature space because different agents’ information in mixed tasks may have different

Simulation

We first introduce experimental settings and baselines and then analyze the experimental results.

Conclusion

In our work, we propose an approach that applies RNN-based feature extraction and multi-head hierarchical attention to centralized training and decentralized execution. The key idea is to utilize multi-head hierarchical attention to weigh all information into a fixed-length vector to retain important information. It solves the problem that the size of the input space of the critic network is sensitive to the number of agents. Therefore, the critic network can make a more accurate evaluation for

Funding

This work was supported by the National Natural Science Foundation of China (No. 61803375).

CRediT authorship contribution statement

Dianxi Shi: Funding acquisition, Writing - review & editing. Chenran Zhao: Writing - original draft, Visualization, Investigation. Yajie Wang: Writing - original draft, Investigation, Conceptualization. Huanhuan Yang: Methodology, Formal analysis. Gongju Wang: Project administration. Hao Jiang: Visualization. Shaowu Yang: Supervision. Yongjun Zhang: Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Dianxi Shi is a Researcher and the Deputy Director of the Artificial Intelligence Research Center of National Innovation Institute of Defense Technology. He received the B.S., M.S. and Ph.D.degrees in computer science at National University of Defense Technology, Changsha, China, in 1989, 1996 and 2000. His research interests include distributed object middleware technology, software component technology, adaptive soft-ware technology and intelligent unmanned cluster system software

References (40)

  • M. Tan

    Multi-agent reinforcement learning: Independent vs. cooperative agents

  • L. Bu et al.

    A comprehensive survey of multiagent reinforcement learning

    IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews)

    (2008)
  • E. Yang, D. Gu, Multiagent reinforcement learning for multi-robot systems: A survey, Tech. rep., tech. rep...
  • L. Buşoniu, R. Babuška, B. De Schutter, Multi-agent reinforcement learning: An overview, in: Innovations in multi-agent...
  • M.L. Littman, Markov games as a framework for multi-agent reinforcement learning, in: Machine learning proceedings...
  • V. Mnih et al.

    Human-level control through deep reinforcement learning

    Nature

    (2015)
  • D. Silver, A. Huang, C.J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V....
  • P. Hernandez-Leal et al.

    A survey and critique of multiagent deep reinforcement learning

    Autonomous Agents and Multi-Agent Systems

    (2019)
  • G. Papoudakis, F. Christianos, A. Rahman, S.V. Albrecht, Dealing with non-stationarity in multi-agent deep...
  • T.T. Nguyen, N.D. Nguyen, S. Nahavandi, Deep reinforcement learning for multiagent systems: A review of challenges,...
  • A. Tampuu et al.

    Multiagent cooperation and competition with deep reinforcement learning

    PloS one

    (2017)
  • J.K. Gupta, M. Egorov, M. Kochenderfer, Cooperative multi-agent control using deep reinforcement learning, in:...
  • J.N. Foerster, G. Farquhar, T. Afouras, N. Nardelli, S. Whiteson, Counterfactual multi-agent policy gradients, in:...
  • K. Tuyls et al.

    Multiagent learning: Basics, challenges, and prospects

    Ai Magazine

    (2012)
  • A. Nowé, P. Vrancx, Y.-M. De Hauwere, Game theory and multi-agent reinforcement learning, in: Reinforcement Learning,...
  • L. Buoniu, R. Babuka, B.D. Schutter, Multi-agent Reinforcement Learning: An Overview, Springer, Berlin...
  • J. Foerster, N. Nardelli, G. Farquhar, T. Afouras, P.H. Torr, P. Kohli, S. Whiteson, Stabilising experience replay for...
  • R. Lowe et al.

    Multi-agent actor-critic for mixed cooperative-competitive environments

    Advances in Neural Information Processing Systems

    (2017)
  • S. Iqbal, F. Sha, Actor-attention-critic for multi-agent reinforcement learning, arXiv preprint...
  • V.R. Konda, J.N. Tsitsiklis, Actor-critic algorithms, in: Advances in neural information processing systems, 2000, pp....
  • Cited by (0)

    Dianxi Shi is a Researcher and the Deputy Director of the Artificial Intelligence Research Center of National Innovation Institute of Defense Technology. He received the B.S., M.S. and Ph.D.degrees in computer science at National University of Defense Technology, Changsha, China, in 1989, 1996 and 2000. His research interests include distributed object middleware technology, software component technology, adaptive soft-ware technology and intelligent unmanned cluster system software architecture.

    Chenran Zhao received the B.S. degree in computer science from Northeastern University, Shenyang, China, in 2019. She is currently pursuing the master’s degree with the National University of Defense Technology. Her current research interests include multi-agent reinforcement learning algorithms and attention mechanism.

    Yajie Wang received the B.S. degree in computer science from Beihang University, Beijing, China, in 2018. She is currently pursuing the master’s degree with the National University of Defense Technology. Her current research interests include multi-agent reinforcement learning algorithms and attention mechanism.

    Huanhuan Yang received the M.S. degree in computer science from Henan Polytechnic University, Jiaozuo, China, in 2019. She is currently pursuing the Ph.D. degree with the National University of Defense Technology. Her current research interests include reinforcement learning and machine learning.

    Gongju Wang received the B.E. degree in computer science from Beijing Jiaotong University, Beijing, China, in 2017. He is pursuing the master’s degree with the Artificial Intelligence Research Center, National Innovation Institute of Defense Technology, Beijing 100166, China. His current interests include multi-agent reinforcement learning and hierarchical reinforcement learning.

    Hao Jiang received the B.S. degree in computer science from Sichuan University, Chengdu, Sichuan, China, in 2018. He is currently pursuing the master’s degree with the National University of Defense Technology. His current research interests include multi-agent reinforcement learning algorithms and graph attention network.

    Chao Xue received the B.S. and Ph.D. degrees in department of computer science and technology from Tsinghua University in 2011 and 2017. His research interest includes modeling and evaluation of networked computing systems and applications, with a particular interest in cloud computing, re- inforcement learning, parallel learning, and cloud native machine learning system.

    Shaowu Yang was born in 1986. PhD, associate professor. His main research interests include simultaneous localization and mapping (SLAM), semantic analysis of three-dimensional environment and swarm intelligent robot operating system.

    Yongjun Zhang was born in 1966, Ph.D., professor. He has participated in the National High Technology Research and Development Program of China and the National Natural Science Foundation of China. He has published more than 20 papers. His main research interests include Artificial intelligence, multi-agent cooperation, machine learning, and feature recognition.

    View full text