Sub-AVG: Overestimation reduction for cooperative multi-agent reinforcement learning
Introduction
Many real-world tasks can be modeled as cooperative multi-agent problems, in which agents work together to satisfy a common goal. Reinforcement learning(RL) holds considerable potential to deal with these tasks, such as distributed logistics [1], and network packet routing [2].
The paradigm of centralized training with decentralized execution(CTDE) has drawn much attention in the cooperative multi-agent reinforcement learning (MARL) [3], [4], [5], [6], [7], [8]. One approach to exploit the CTDE paradigm is making the joint action and the global state available to estimate a fully centralized joint action value(JAV) [3], [4], then uses the JAV to guide the decentralized policies to learn coordination. However, the dimension of joint action space grows exponentially with the number of agents.
To cope with such scalability, many algorithms decompose the centralized JAV into per-agent individual action value(IAV) [5], [6], [7], [8]. Specifically, each agent estimates an IAV only based on its local observation-action pair, and the IAVs of all agents are mixed to compose a JAV. Then, the JAV is used for end-to-end training through traditional RL methods, especially through the commonly used Q-learning algorithm [9], such as in VDN [5] and QMIX [6].
However, because of including a maximization step over action values, the Q-learning-based method tends to estimate action values higher than the real action value [10]. Although uniformly estimating the action values higher is an exploration technique used in both single-agent [11] and multi-agent [12] RL methods, it may result in a suboptimal policy [10], [13] and even a policy deterioration [14] if the overestimation is not uniform. Moreover, such overestimation can occur in practice when the estimated action values are inaccurate, despite the error source [13]. When it comes to the above decomposition method, if overestimation error occurs in each IAV and is mixed into the JAV, the problem may be more severe. SMIX() [8] utilizes a Sarsa [15]-based method rather than the Q-learning-based method to avoid the maximization step. However, whether the Q-learning-based method should be abandoned in the decomposition method is an open question. Thereby, if overestimation occurs in the Q-learning-based decomposition method, how can we cope with the error and improve performance? This is the original motivation of our work.
In this paper, we show that the overestimation error does occur on the JAV, and will be harmful to the policy performance. To cope with the error, we propose a new approach named Sub-Average(Sub-AVG). Key to our method is the insight that Double DQN [13] utilizes a lower update target and efficiently reduces the overestimation in the single-agent RL methods, but there are two problems when extending it to the decomposition method: (1) It uses a negative bias to replace the positive one, which may introduce an underestimation error [16], and such errors may be mixed into the JAV; (2) The lower update target relies on the condition that the greedy actions determined by online network and target network are inconsistent(otherwise it is same with the standard DQN [17]), which may be difficult to satisfy when all agents share the highly generalized agent networks in MARL. Thus, Sub-AVG contracts the update target by discarding the larger of the multiple previously learned action values and averaging the retained ones. It aims to eliminate the excessive overestimation error, and obtains an overall lower update target that within the range greater than the real one, which is more conservative and safe. Experimental results1 in the StarCraft Multi-Agent Challenge(SMAC) [18] prove the main contributions of this work: (1) The larger action values, which are discarded in Sub-AVG, are harmful to the policy performance. It reflects that the overestimation errors do occur in the decomposition method. (2) By eliminating the excessive overestimation error, Sub-AVG with the overall lower update target can lead to a lower JAV estimation and better policy. (3) Sub-AVG can be generalized and be beneficial to other cooperative MARL methods.
Section snippets
Related works
Our work is related to the following overestimation reduction methods.
Single-agent reinforcement learning. The damage of overestimation has been studied in the previous work [10]. Thus, Double Q-learning [19] introduces a double estimator to avoid the direct max operation over action values, then Double DQN [13] and TD3 [20] extend it to the deep RL. However, it may induce an underestimation error, thus the weighted double estimator [16] is proposed to balance the overestimation and
Deep Q Network
Deep Q Network(DQN) [17] uses a neural network with parameters to estimate the action value . Specifically, when the agent takes an action in state , and transfers to next state with a reward , a transition is stored in a replay buffer [32]. During training, batches of transitions are sampled from the buffer to update the parameters of online network. The loss function is:with an update target:where
Overestimation error in decomposition MARL
In Theorem 1, we show that there are upward biases on the maximum IAV and JAV in VDN, in which VDN is one of the representative Q-learning-based decomposition method. Theorem 1 Considering a single local observation of the observation history ,assume that the real optimal JAV can be decomposed into per-agent ideal optimal IAV as described in VDN, in which and is equal at for some .Then assume that
Overestimation reduction method
Theorem 1 gives a positive lower bound of overestimation error in Q-learning-based MARL method, which is , where N is the number of agents, is the size of action space, and . The only variable in the lower bound is the IAV. Thus, we can reduce the lower bound of overestimation error in Q-learning-based MARL methods by reducing the overestimated IAVs.
In this section, we propose an approach named Sub-AVG, which aims to obtain a lower update
Experiment
We conduct experiments respectively on a classic cooperative task named Switch Riddle [35] and a popular cooperative benchmark named StarCraft Multi-Agent Challenge(SMAC) [18].
Conclusion
In this paper, we show that overestimation can occur in the Q-learning-based decomposition method. To address this issue, we present an extension method named Sub-AVG that aims to obtain a lower update target by discarding the larger action values, which can eliminate the excessive overestimation error. Experimental results show that Sub-AVG can obtain better-performing policies during reducing the overestimation on action values by the proposed lower update target. Besides, by comparison with
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Haolin Wu received the B.S. degree in electrical engineering and automation from Chongqing University of Posts and Telecommunications, Chongqing, China, in 2014 and the M.S. degree in control theory and engineering from Sichuan University of Science and Engineering, Yibin, China in 2018. He is currently pursuing the Ph.D. degree in computer science and technology at Sichuan University, Chengdu, China. His research interests includes the sample efficiency and algorithm performance improvement of
References (35)
- et al.
Multi-agent framework for third party logistics in e-commerce
Expert Systems With Applications
(2005) - et al.
A multi-agent framework for packet routing in wireless sensor networks
Sensors
(2015) - Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual...
- Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed...
- Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot,...
- Tabish Rashid, Mikayel Samvelyan, Christian Schroeder de Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson....
- Kyunghwan Son, Daewoo Kim, Wan Ju Kang, David Earl Hostallero, and Yung Yi. Qtran: Learning to factorize with...
Xinghu Yao, Yuhui Wang, and Xiaoyang Tan. Smix: Enhancing centralized value functions for cooperative multi-agent reinforcement learning
- et al.
Q-learning
Machine Learning
(1992) - Sebastian Thrun and Anton Schwartz. Issues in using function approximation for reinforcement learning. In Proceedings...
Reinforcement learning: a survey
Journal of Artificial Intelligence Research
An algorithm for distributed reinforcement learning in cooperative multi-agent systems
Deep reinforcement learning with double q-learning
Reinforcement learning: An introduction
Cited by (8)
Q-learning with heterogeneous update strategy
2024, Information SciencesCommon belief multi-agent reinforcement learning based on variational recurrent models
2022, NeurocomputingCitation Excerpt :QMIX [8] used a mixing network to factorize the value functions. Other prominent progress includes but is not limited to studies such as [9–14]. However, all these methods only use centralised critic to coordinate during training, and lack a coordination mechanism among agents during execution.
A double Actor-Critic learning system embedding improved Monte Carlo tree search
2024, Neural Computing and ApplicationsDM-DQN: Dueling Munchausen deep Q network for robot path planning
2023, Complex and Intelligent SystemsReinforcement learning method for target hunting control of multi-robot systems with obstacles
2022, International Journal of Intelligent Systems
Haolin Wu received the B.S. degree in electrical engineering and automation from Chongqing University of Posts and Telecommunications, Chongqing, China, in 2014 and the M.S. degree in control theory and engineering from Sichuan University of Science and Engineering, Yibin, China in 2018. He is currently pursuing the Ph.D. degree in computer science and technology at Sichuan University, Chengdu, China. His research interests includes the sample efficiency and algorithm performance improvement of model-free deep reinforcement learning, and the fundamental study and application of multi-agent reinforcement learning.
Jianwei Zhang received his Ph.D. degree from Sichuan University, Chengdu, China, in 2008. He has taught and conducted research at Sichuan University since 1993. He has published more than 50 articles. His research interests include air traffic management, and intelligent image analysis and processing. Dr. Zhang received the National Science and Technology Progress Award in China.
Zhuang Wang received the B.S. degree in information engineering and the M.S. degree in optical engineering from Tianjin University, Tianjin, China, in 2009 and 2012. He is currently pursuing the Ph.D. degree in software engineering at Sichuan University, Chengdu, Sichuan, China. His research interest includes artificial intelligence in military, deep reinforcement learning, and air combat theory.
Yi Lin received his Ph.D. degree from Sichuan University, Chengdu, China, in 2019. He currently works as an Associate Professor with the College of Computer Science, Sichuan University. He was a visiting scholar at University of Wisconsin-Madison, Madison, WI, USA. His research interests include air traffic flow management and planning, machine learning, and deep-learning-based air traffic management applications.
Hui Li received the B.S. degree in Computer Science from Chengdu University of Science & Technology, Chengdu China in 1991, M.S. degree from Simon Fraser University, Canada in 1997, and Ph.D. degree in Computer Science from Sichuan University, China in 2007. From 1991 to 1994, he was a software engineer in Sichuan University; from 1997-1998, he was a senior develop in Nortel, Ottawa, Canada; since 1999, he has been working in College of Computer Science as lecturer, vice professor and professor sequentially. His research interests includes virtual reality, command and control simulation, and artificial intelligence. He has been awarded National Nature Science Foundation, National Science & Technology Foundation 3 times. He also conducted many simulation and smart system application project in his domain. He has published more than 20 papers. Dr. Li was awarded the National Science& Technology Advancing Prize once. He is also member of China Society of Image & Graphics.