Multi-Agent Counterfactual Communication Using Difference Rewards Policy Gradients

Vanneste, Simon; Vanneste, Astrid; De Schepper, Tom; Mercelis, Siegfried; Hellinckx, Peter; Mets, Kevin

doi:10.1007/978-3-031-74650-5_5

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 2187))

Included in the following conference series:

Benelux Conference on Artificial Intelligence

43 Accesses

Abstract

Communication learning while learning a behaviour policy is a challenging problem within the multi-agent reinforcement learning domain. In this work, we combine the MACC (Multi-Agent Counterfactual Communication) method with the DR.PG (Difference Reward Policy Gradient) method and propose the novel DR.MACC (Difference Reward Multi-Agent Counterfactual Communication) method. The DR.MACC method enables us to create an agent-specific difference return for the action and communication policy of the agents. This policy-specific difference return minimizes the credit-assignment problem compared to using the team reward directly. The DR.MACC method does not require us to learn a joint Q-function, like the MACC method, but instead operates using the environment’s reward function. Alternatively, when the reward function is unavailable, we can learn an approximation of the reward function in the DRR.MACC method. Here, the agent’s environment interactions are used to train the approximation of the reward function using supervised learning. In the experiments, we compare the novel DR.MACC method against the MACC method with an individual Q-function and a joint Q-function. The results show that the DR.MACC method can outperform both MACC variants in the different environment configurations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

References

Castellini, J., Devlin, S., Oliehoek, F.A., Savani, R.: Difference rewards policy gradients (2021)
Google Scholar
Chang, Y.H., Ho, T., Kaelbling, L.: All learning is local: multi-agent learning in global reward games. In: Advances in Neural Information Processing Systems, vol. 16 (2003)
Google Scholar
Das, A., et al.: Tarmac: targeted multi-agent communication. In: International Conference on Machine Learning, pp. 1538–1546. PMLR (2019)
Google Scholar
Foerster, J., Assael, I.A., De Freitas, N., Whiteson, S.: Learning to communicate with deep multi-agent reinforcement learning. In: Advances in Neural Information Processing Systems, vol. 29 (2016)
Google Scholar
Foerster, J., Farquhar, G., Afouras, T., Nardelli, N., Whiteson, S.: Counterfactual multi-agent policy gradients. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Google Scholar
Jaques, N., et al.: Social influence as intrinsic motivation for multi-agent deep reinforcement learning. In: International Conference on Machine Learning, pp. 3040–3049. PMLR (2019)
Google Scholar
Lowe, R., Wu, Y.I., Tamar, A., Harb, J., Pieter Abbeel, O., Mordatch, I.: Multi-agent actor-critic for mixed cooperative-competitive environments. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Lyu, X., Xiao, Y., Daley, B., Amato, C.: Contrasting centralized and decentralized critics in multi-agent reinforcement learning (2021)
Google Scholar
Mandhane, A., et al.: Muzero with self-competition for rate control in vp9 video compression. arXiv preprint arXiv:2202.06626 (2022)
Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. In: International Conference on Machine Learning, pp. 1928–1937. PMLR (2016)
Google Scholar
Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)
Article ADS CAS PubMed Google Scholar
Oliehoek, F.A., Amato, C.: A Concise Introduction to Decentralized POMDPs. Springer, Cham (2016)
Book Google Scholar
Oliehoek, F.A., Vlassis, N.: Q-value functions for decentralized POMDPs. In: Proceedings of the 6th International Joint Conference on Autonomous Agents and Multiagent Systems, pp. 1–8 (2007)
Google Scholar
Van der Pol, E., Oliehoek, F.A.: Coordinated deep reinforcement learners for traffic light control. In: Proceedings of Learning, Inference and Control of Multi-Agent Systems (at NIPS 2016) (2016)
Google Scholar
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., Riedmiller, M.: Deterministic policy gradient algorithms. In: International Conference on Machine Learning, pp. 387–395. PMLR (2014)
Google Scholar
Sukhbaatar, S., Fergus, R., et al.: Learning multiagent communication with backpropagation. In: Advances in Neural Information Processing Systems, vol. 29 (2016)
Google Scholar
Sutton, R.S., Barto, A.G.: Introduction to Reinforcement Learning, 1st edn. MIT Press, Cambridge (1998)
Google Scholar
Van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double q-learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30 (2016)
Google Scholar
Vanneste, S., et al.: Learning to communicate with reinforcement learning for an adaptive traffic control system. In: Barolli, L. (ed.) 3PGCIC 2021. LNNS, vol. 343, pp. 207–216. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-89899-1_21
Chapter Google Scholar
Vanneste, S., et al.: Learning to communicate using counterfactual reasoning. In: Proceedings of the Adaptive and Learning Agents Workshop (ALA 2022). Adaptive and Learning Agents Workshop (ALA 2022) (2022)
Google Scholar
Wolpert, D.H., Tumer, K.: Optimal payoff functions for members of collectives. Adv. Complex Syst. 4(02n03), 265–279 (2001)
Google Scholar

Download references

Acknowledgments

Simon Vanneste and Astrid Vanneste are supported by the Research Foundation Flanders (FWO) under Grant Number 1S94120N and Grant Number 1S12121N respectively.

Author information

Authors and Affiliations

IDLab - Faculty of Applied Engineering, University of Antwerp - imec, Sint-Pietersvliet 7, 2000, Antwerp, Belgium
Simon Vanneste, Astrid Vanneste, Siegfried Mercelis & Kevin Mets
IDLab - Department of Computer Science, University of Antwerp - imec, Sint-Pietersvliet 7, 2000, Antwerp, Belgium
Tom De Schepper
Faculty of Applied Engineering, University of Antwerp, Antwerp, Belgium
Peter Hellinckx

Authors

Simon Vanneste
View author publications
You can also search for this author in PubMed Google Scholar
Astrid Vanneste
View author publications
You can also search for this author in PubMed Google Scholar
Tom De Schepper
View author publications
You can also search for this author in PubMed Google Scholar
Siegfried Mercelis
View author publications
You can also search for this author in PubMed Google Scholar
Peter Hellinckx
View author publications
You can also search for this author in PubMed Google Scholar
Kevin Mets
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Simon Vanneste .

Editor information

Editors and Affiliations

Delft University of Technology, Delft, The Netherlands
Frans A. Oliehoek
Delft University of Technology, Delft, The Netherlands
Manon Kok
Delft University of Technology, Delft, The Netherlands
Sicco Verwer

Ethics declarations

Disclosure of Interests

The authors have no competing interests to declare that are relevant to the content of this article.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vanneste, S., Vanneste, A., De Schepper, T., Mercelis, S., Hellinckx, P., Mets, K. (2025). Multi-Agent Counterfactual Communication Using Difference Rewards Policy Gradients. In: Oliehoek, F.A., Kok, M., Verwer, S. (eds) Artificial Intelligence and Machine Learning. BNAIC/Benelearn 2023. Communications in Computer and Information Science, vol 2187. Springer, Cham. https://doi.org/10.1007/978-3-031-74650-5_5

Download citation

DOI: https://doi.org/10.1007/978-3-031-74650-5_5
Published: 02 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-74649-9
Online ISBN: 978-3-031-74650-5
eBook Packages: Artificial Intelligence (R0)

Publish with us

Policies and ethics