Sub-AVG: Overestimation reduction for cooperative multi-agent reinforcement learning

doi:10.1016/j.neucom.2021.12.039

Neurocomputing

Volume 474, 14 February 2022, Pages 94-106

https://doi.org/10.1016/j.neucom.2021.12.039 Get rights and content

Abstract

Decomposing the centralized joint action value(JAV) into per-agent individual action value(IAV) is attractive in cooperative multi-agent reinforcement learning(MARL). In such tasks, IAVs based on local observation can perform decentralized policies, and the JAV is used for end-to-end training through traditional reinforcement learning methods, especially through the Q-learning algorithm. However, the Q-learning-based method suffers from overestimation, in which the overestimated action values may result in a suboptimal policy. In this paper, we show that such overestimation can occur in the above Q-learning-based decomposition method. Our solution is Sub-AVG, which utilizes a lower update target by discarding the larger of previously learned IAVs and averaging the retained ones, thus eliminating the excessive overestimation errors. Experiments in the StarCraft Multi-Agent Challenge(SMAC) environment show that Sub-AVG can lead to lower JAV estimations and better-performing policies.

Introduction

Many real-world tasks can be modeled as cooperative multi-agent problems, in which agents work together to satisfy a common goal. Reinforcement learning(RL) holds considerable potential to deal with these tasks, such as distributed logistics [1], and network packet routing [2].

The paradigm of centralized training with decentralized execution(CTDE) has drawn much attention in the cooperative multi-agent reinforcement learning (MARL) [3], [4], [5], [6], [7], [8]. One approach to exploit the CTDE paradigm is making the joint action and the global state available to estimate a fully centralized joint action value(JAV) [3], [4], then uses the JAV to guide the decentralized policies to learn coordination. However, the dimension of joint action space grows exponentially with the number of agents.

To cope with such scalability, many algorithms decompose the centralized JAV into per-agent individual action value(IAV) [5], [6], [7], [8]. Specifically, each agent estimates an IAV only based on its local observation-action pair, and the IAVs of all agents are mixed to compose a JAV. Then, the JAV is used for end-to-end training through traditional RL methods, especially through the commonly used Q-learning algorithm [9], such as in VDN [5] and QMIX [6].

However, because of including a maximization step over action values, the Q-learning-based method tends to estimate action values higher than the real action value [10]. Although uniformly estimating the action values higher is an exploration technique used in both single-agent [11] and multi-agent [12] RL methods, it may result in a suboptimal policy [10], [13] and even a policy deterioration [14] if the overestimation is not uniform. Moreover, such overestimation can occur in practice when the estimated action values are inaccurate, despite the error source [13]. When it comes to the above decomposition method, if overestimation error occurs in each IAV and is mixed into the JAV, the problem may be more severe. SMIX( $λ$ ) [8] utilizes a Sarsa [15]-based method rather than the Q-learning-based method to avoid the maximization step. However, whether the Q-learning-based method should be abandoned in the decomposition method is an open question. Thereby, if overestimation occurs in the Q-learning-based decomposition method, how can we cope with the error and improve performance? This is the original motivation of our work.

In this paper, we show that the overestimation error does occur on the JAV, and will be harmful to the policy performance. To cope with the error, we propose a new approach named Sub-Average(Sub-AVG). Key to our method is the insight that Double DQN [13] utilizes a lower update target and efficiently reduces the overestimation in the single-agent RL methods, but there are two problems when extending it to the decomposition method: (1) It uses a negative bias to replace the positive one, which may introduce an underestimation error [16], and such errors may be mixed into the JAV; (2) The lower update target relies on the condition that the greedy actions determined by online network and target network are inconsistent(otherwise it is same with the standard DQN [17]), which may be difficult to satisfy when all agents share the highly generalized agent networks in MARL. Thus, Sub-AVG contracts the update target by discarding the larger of the multiple previously learned action values and averaging the retained ones. It aims to eliminate the excessive overestimation error, and obtains an overall lower update target that within the range greater than the real one, which is more conservative and safe. Experimental results¹ in the StarCraft Multi-Agent Challenge(SMAC) [18] prove the main contributions of this work: (1) The larger action values, which are discarded in Sub-AVG, are harmful to the policy performance. It reflects that the overestimation errors do occur in the decomposition method. (2) By eliminating the excessive overestimation error, Sub-AVG with the overall lower update target can lead to a lower JAV estimation and better policy. (3) Sub-AVG can be generalized and be beneficial to other cooperative MARL methods.

Section snippets

Related works

Our work is related to the following overestimation reduction methods.

Single-agent reinforcement learning. The damage of overestimation has been studied in the previous work [10]. Thus, Double Q-learning [19] introduces a double estimator to avoid the direct max operation over action values, then Double DQN [13] and TD3 [20] extend it to the deep RL. However, it may induce an underestimation error, thus the weighted double estimator [16] is proposed to balance the overestimation and

Deep Q Network

Deep Q Network(DQN) [17] uses a neural network with parameters $θ_{t}$ to estimate the action value $Q (s_{t}, μ_{t}; θ_{t})$ . Specifically, when the agent takes an action $μ_{t}$ in state $s_{t}$ , and transfers to next state $s_{t + 1}$ with a reward $r_{t}$ , a transition $(s_{t}, μ_{t}, r_{t}, s_{t + 1})$ is stored in a replay buffer [32]. During training, batches of transitions are sampled from the buffer to update the parameters $θ_{t}$ of online network. The loss function is: $L (θ_{t}) = {(y^{DQN} - Q (s_{t}, μ_{t}; θ_{t}))}^{2}$ with an update target: $y^{DQN} = r_{t} + γ \max_{μ_{t + 1}} Q (s_{t + 1}, μ_{t + 1}; θ_{t}^{-}),$ where

Overestimation error in decomposition MARL

In Theorem 1, we show that there are upward biases on the maximum IAV and JAV in VDN, in which VDN is one of the representative Q-learning-based decomposition method.

Theorem 1

Considering a single local observation $o_{t}^{i}$ of the observation history $τ^{i}$ ,assume that the real optimal JAV $Q_{jt}^{*} (o_{t}, μ_{t})$ can be decomposed into per-agent ideal optimal IAV $Q_{i}^{*} (o_{t}^{i}, μ_{t}^{i})$ as described in VDN, in which $Q_{jt}^{*} (o_{t}, μ_{t}) = \sum_{i = 1}^{N} Q_{i}^{*} (o_{t}^{i}, μ_{t}^{i})$ and $Q_{i}^{*} (o_{t}^{i}, μ_{t}^{i})$ is equal at $Q_{i}^{*} (o_{t}^{i}, μ_{t}^{i}) = V_{i}^{*} (o_{t}^{i})$ for some $V_{i}^{*} (o_{t}^{i})$ .Then assume that

Overestimation reduction method

Theorem 1 gives a positive lower bound of overestimation error in Q-learning-based MARL method, which is $\sum_{i = 1}^{N} γ \sqrt{C_{i} / (m_{i} - 1)}$ , where N is the number of agents, $m_{i}$ is the size of action space, and $C_{i} = (1 / m_{i}) \sum_{μ_{t}^{i}} {(Q_{i} (o_{t}^{i}, μ_{t}^{i}) - V_{i}^{*} (o_{t}^{i}))}^{2}$ . The only variable in the lower bound is the IAV. Thus, we can reduce the lower bound of overestimation error in Q-learning-based MARL methods by reducing the overestimated IAVs.

In this section, we propose an approach named Sub-AVG, which aims to obtain a lower update

Experiment

We conduct experiments respectively on a classic cooperative task named Switch Riddle [35] and a popular cooperative benchmark named StarCraft Multi-Agent Challenge(SMAC) [18].

Conclusion

In this paper, we show that overestimation can occur in the Q-learning-based decomposition method. To address this issue, we present an extension method named Sub-AVG that aims to obtain a lower update target by discarding the larger action values, which can eliminate the excessive overestimation error. Experimental results show that Sub-AVG can obtain better-performing policies during reducing the overestimation on action values by the proposed lower update target. Besides, by comparison with

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Haolin Wu received the B.S. degree in electrical engineering and automation from Chongqing University of Posts and Telecommunications, Chongqing, China, in 2014 and the M.S. degree in control theory and engineering from Sichuan University of Science and Engineering, Yibin, China in 2018. He is currently pursuing the Ph.D. degree in computer science and technology at Sichuan University, Chengdu, China. His research interests includes the sample efficiency and algorithm performance improvement of

References (35)

Wang Ying et al.
Multi-agent framework for third party logistics in e-commerce
Expert Systems With Applications
(2005)
Dayong Ye et al.
A multi-agent framework for packet routing in wireless sensor networks
Sensors
(2015)
Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual...
Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed...
Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot,...
Tabish Rashid, Mikayel Samvelyan, Christian Schroeder de Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson....
Kyunghwan Son, Daewoo Kim, Wan Ju Kang, David Earl Hostallero, and Yung Yi. Qtran: Learning to factorize with...
Chao Wen
Xinghu Yao, Yuhui Wang, and Xiaoyang Tan. Smix $λ$ : Enhancing centralized value functions for cooperative multi-agent reinforcement learning
Christopher J.C.H. Watkins et al.
Q-learning
Machine Learning
(1992)
Sebastian Thrun and Anton Schwartz. Issues in using function approximation for reinforcement learning. In Proceedings...

Leslie Pack Kaelbling et al.

Reinforcement learning: a survey

Journal of Artificial Intelligence Research

(1996)

Martin Lauer et al.

An algorithm for distributed reinforcement learning in cooperative multi-agent systems

Hado Van Hasselt et al.

Deep reinforcement learning with double q-learning

Oron Anschel, Nir Baram, and Nahum Shimkin. Averaged-dqn: Variance reduction and stabilization for deep reinforcement...

Richard S Sutton et al.

Reinforcement learning: An introduction

(2018)

Zongzhang Zhang, Zhiyuan Pan, and Mykel J Kochenderfer. Weighted double q-learning. In IJCAI, pages 3455–3461,...

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin...

Cited by (8)

Q-learning with heterogeneous update strategy
2024, Information Sciences
A variety of algorithms has been proposed to mitigate the overestimation bias of Q-learning. These algorithms reduce the estimation of maximum Q-value, i.e., homogeneous update. As a result, some of these algorithms such as Double Q-learning suffer from the underestimation bias. Different from these algorithms, this paper proposes a heterogeneous update idea. It aims to enlarge the normalized gap between Q-value corresponding to the optimal action and that corresponding to the other actions. Based on heterogeneous update, we design HetUp Q-learning. More specifically, HetUp Q-learning increases the normalized gap by overestimating Q-value corresponding to the optimal action and underestimating Q-value corresponding to the other actions. However, one limitation is that our HetUp Q-learning takes the optimal action as input to decide whether a state-action pair should be overestimated or underestimated. To address this challenge, we apply a softmax strategy to estimate the optimal action and obtain HetUpSoft Q-learning. We also extend HetUpSoft Q-learning to HetUpSoft DQN for high-dimensional environments. Extensive experiment results show that our proposed methods outperform SOTA baselines drastically in different settings. In particular, HetUpSoft DQN improves the average score per episode over SOTA baselines by at least 55.49% and 32.26% in the Pixelcopter and Breakout environments, respectively.
Common belief multi-agent reinforcement learning based on variational recurrent models
2022, Neurocomputing
Citation Excerpt :
QMIX [8] used a mixing network to factorize the value functions. Other prominent progress includes but is not limited to studies such as [9–14]. However, all these methods only use centralised critic to coordinate during training, and lack a coordination mechanism among agents during execution.
The tacit cooperation among human teams benefits from the fact that consensus can be reached on a task through common belief. Similar to human social groups, agents in distributed learning systems can also rely on common belief to achieve cooperation under the condition of limited communication. In this paper, we show the role of common belief among agents in completing cooperative tasks, by proposing the Common Belief Multi-Agent (CBMA) reinforcement learning method. CBMA is a novel value-based method that infers the belief between agents with a variational model and models the environment with a variational recurrent neural network. We validate CBMA on two grid-world games as well as the StarCraft II micromanagement benchmark. Experimental results show that the learned common belief by CBMA can improve performance in both discrete and continuous state settings.
A double Actor-Critic learning system embedding improved Monte Carlo tree search
2024, Neural Computing and Applications
DM-DQN: Dueling Munchausen deep Q network for robot path planning
2023, Complex and Intelligent Systems
Multi-Agent Reinforcement Learning Clustering Algorithm Based on Silhouette Coefficient
2023, SSRN
Reinforcement learning method for target hunting control of multi-robot systems with obstacles
2022, International Journal of Intelligent Systems

View all citing articles on Scopus

Jianwei Zhang received his Ph.D. degree from Sichuan University, Chengdu, China, in 2008. He has taught and conducted research at Sichuan University since 1993. He has published more than 50 articles. His research interests include air traffic management, and intelligent image analysis and processing. Dr. Zhang received the National Science and Technology Progress Award in China.

Zhuang Wang received the B.S. degree in information engineering and the M.S. degree in optical engineering from Tianjin University, Tianjin, China, in 2009 and 2012. He is currently pursuing the Ph.D. degree in software engineering at Sichuan University, Chengdu, Sichuan, China. His research interest includes artificial intelligence in military, deep reinforcement learning, and air combat theory.

Yi Lin received his Ph.D. degree from Sichuan University, Chengdu, China, in 2019. He currently works as an Associate Professor with the College of Computer Science, Sichuan University. He was a visiting scholar at University of Wisconsin-Madison, Madison, WI, USA. His research interests include air traffic flow management and planning, machine learning, and deep-learning-based air traffic management applications.

Hui Li received the B.S. degree in Computer Science from Chengdu University of Science & Technology, Chengdu China in 1991, M.S. degree from Simon Fraser University, Canada in 1997, and Ph.D. degree in Computer Science from Sichuan University, China in 2007. From 1991 to 1994, he was a software engineer in Sichuan University; from 1997-1998, he was a senior develop in Nortel, Ottawa, Canada; since 1999, he has been working in College of Computer Science as lecturer, vice professor and professor sequentially. His research interests includes virtual reality, command and control simulation, and artificial intelligence. He has been awarded National Nature Science Foundation, National Science & Technology Foundation 3 times. He also conducted many simulation and smart system application project in his domain. He has published more than 20 papers. Dr. Li was awarded the National Science& Technology Advancing Prize once. He is also member of China Society of Image & Graphics.

View full text

Sub-AVG: Overestimation reduction for cooperative multi-agent reinforcement learning

Abstract

Introduction

Section snippets

Related works

Deep Q Network

Overestimation error in decomposition MARL

Overestimation reduction method

Experiment

Conclusion

Declaration of Competing Interest

Expert Systems With Applications

A multi-agent framework for packet routing in wireless sensor networks

Sensors

Xinghu Yao, Yuhui Wang, and Xiaoyang Tan. Smixλ: Enhancing centralized value functions for cooperative multi-agent reinforcement learning

Q-learning

Machine Learning

Reinforcement learning: a survey

Journal of Artificial Intelligence Research

An algorithm for distributed reinforcement learning in cooperative multi-agent systems

Deep reinforcement learning with double q-learning

Reinforcement learning: An introduction

Xinghu Yao, Yuhui Wang, and Xiaoyang Tan. Smix $λ$ : Enhancing centralized value functions for cooperative multi-agent reinforcement learning