Abstract
Communication learning while learning a behaviour policy is a challenging problem within the multi-agent reinforcement learning domain. In this work, we combine the MACC (Multi-Agent Counterfactual Communication) method with the DR.PG (Difference Reward Policy Gradient) method and propose the novel DR.MACC (Difference Reward Multi-Agent Counterfactual Communication) method. The DR.MACC method enables us to create an agent-specific difference return for the action and communication policy of the agents. This policy-specific difference return minimizes the credit-assignment problem compared to using the team reward directly. The DR.MACC method does not require us to learn a joint Q-function, like the MACC method, but instead operates using the environment’s reward function. Alternatively, when the reward function is unavailable, we can learn an approximation of the reward function in the DRR.MACC method. Here, the agent’s environment interactions are used to train the approximation of the reward function using supervised learning. In the experiments, we compare the novel DR.MACC method against the MACC method with an individual Q-function and a joint Q-function. The results show that the DR.MACC method can outperform both MACC variants in the different environment configurations.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Castellini, J., Devlin, S., Oliehoek, F.A., Savani, R.: Difference rewards policy gradients (2021)
Chang, Y.H., Ho, T., Kaelbling, L.: All learning is local: multi-agent learning in global reward games. In: Advances in Neural Information Processing Systems, vol. 16 (2003)
Das, A., et al.: Tarmac: targeted multi-agent communication. In: International Conference on Machine Learning, pp. 1538–1546. PMLR (2019)
Foerster, J., Assael, I.A., De Freitas, N., Whiteson, S.: Learning to communicate with deep multi-agent reinforcement learning. In: Advances in Neural Information Processing Systems, vol. 29 (2016)
Foerster, J., Farquhar, G., Afouras, T., Nardelli, N., Whiteson, S.: Counterfactual multi-agent policy gradients. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Jaques, N., et al.: Social influence as intrinsic motivation for multi-agent deep reinforcement learning. In: International Conference on Machine Learning, pp. 3040–3049. PMLR (2019)
Lowe, R., Wu, Y.I., Tamar, A., Harb, J., Pieter Abbeel, O., Mordatch, I.: Multi-agent actor-critic for mixed cooperative-competitive environments. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Lyu, X., Xiao, Y., Daley, B., Amato, C.: Contrasting centralized and decentralized critics in multi-agent reinforcement learning (2021)
Mandhane, A., et al.: Muzero with self-competition for rate control in vp9 video compression. arXiv preprint arXiv:2202.06626 (2022)
Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. In: International Conference on Machine Learning, pp. 1928–1937. PMLR (2016)
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)
Oliehoek, F.A., Amato, C.: A Concise Introduction to Decentralized POMDPs. Springer, Cham (2016)
Oliehoek, F.A., Vlassis, N.: Q-value functions for decentralized POMDPs. In: Proceedings of the 6th International Joint Conference on Autonomous Agents and Multiagent Systems, pp. 1–8 (2007)
Van der Pol, E., Oliehoek, F.A.: Coordinated deep reinforcement learners for traffic light control. In: Proceedings of Learning, Inference and Control of Multi-Agent Systems (at NIPS 2016) (2016)
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., Riedmiller, M.: Deterministic policy gradient algorithms. In: International Conference on Machine Learning, pp. 387–395. PMLR (2014)
Sukhbaatar, S., Fergus, R., et al.: Learning multiagent communication with backpropagation. In: Advances in Neural Information Processing Systems, vol. 29 (2016)
Sutton, R.S., Barto, A.G.: Introduction to Reinforcement Learning, 1st edn. MIT Press, Cambridge (1998)
Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double q-learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30 (2016)
Vanneste, S., et al.: Learning to communicate with reinforcement learning for an adaptive traffic control system. In: Barolli, L. (ed.) 3PGCIC 2021. LNNS, vol. 343, pp. 207–216. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-89899-1_21
Vanneste, S., et al.: Learning to communicate using counterfactual reasoning. In: Proceedings of the Adaptive and Learning Agents Workshop (ALA 2022). Adaptive and Learning Agents Workshop (ALA 2022) (2022)
Wolpert, D.H., Tumer, K.: Optimal payoff functions for members of collectives. Adv. Complex Syst. 4(02n03), 265–279 (2001)
Acknowledgments
Simon Vanneste and Astrid Vanneste are supported by the Research Foundation Flanders (FWO) under Grant Number 1S94120N and Grant Number 1S12121N respectively.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Disclosure of Interests
The authors have no competing interests to declare that are relevant to the content of this article.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Vanneste, S., Vanneste, A., De Schepper, T., Mercelis, S., Hellinckx, P., Mets, K. (2025). Multi-Agent Counterfactual Communication Using Difference Rewards Policy Gradients. In: Oliehoek, F.A., Kok, M., Verwer, S. (eds) Artificial Intelligence and Machine Learning. BNAIC/Benelearn 2023. Communications in Computer and Information Science, vol 2187. Springer, Cham. https://doi.org/10.1007/978-3-031-74650-5_5
Download citation
DOI: https://doi.org/10.1007/978-3-031-74650-5_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-74649-9
Online ISBN: 978-3-031-74650-5
eBook Packages: Artificial Intelligence (R0)