Modified action decoder using Bayesian reasoning for multi-agent deep reinforcement learning

Du, Wei; Ding, Shifei; Zhang, Chenglong; Du, Shuying

doi:10.1007/s13042-021-01385-7

Modified action decoder using Bayesian reasoning for multi-agent deep reinforcement learning

Original Article
Published: 27 July 2021

Volume 12, pages 2947–2961, (2021)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Wei Du¹,
Shifei Ding ORCID: orcid.org/0000-0002-1391-2717^1,2,
Chenglong Zhang¹ &
…
Shuying Du¹

741 Accesses
1 Altmetric
Explore all metrics

Abstract

Deep reinforcement learning has achieved superhuman performance in zero-sum games such as Go and Poker in recent years. In the real world, however, many scenarios are non-zero-sum settings, meaning that success feels the necessity for cooperation and communication rather than competition. Hanabi game has been established as an ideal benchmark for agents to learn to cooperate adequately with other agents and humans. The Bayesian action decoder methods perform well on the 2 players Hanabi game while there remains a large performance gap between the numbers achieved by these methods and the performance of hat-coding strategies on the 3–5 players settings. The pivotal problem is the contradiction of the exploration of actions against the exploitation of observed actions. We present a novel deep multi-agent reinforcement learning method, the Modified Action Decoder to resolve this problem leveraging centralized training with decentralized execution paradigm. During the training phase, agents not only observe the exploratory action selected but also observe the optimal action of their teammates for better exploitation. We verify our method on Hanabi game in the 2–5 players setting, and it is superior to previously published reinforcement learning methods and establishes a new state of the art.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

From mimic to counteract: a two-stage reinforcement learning algorithm for Google research football

Article 22 February 2024

Collaborative Reinforcement Learning Framework to Model Evolution of Cooperation in Sequential Social Dilemmas

OB-HPPO: An Option and Intrinsic Curiosity Based Hierarchical Reinforcement Learning Approach for Real-Time Strategy Games

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Hassabis D (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529(7587):484–489
Article Google Scholar
Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hassabis D (2017) Mastering the game of go without human knowledge. Nature 550(7676):354–359
Article Google Scholar
Silver D, Hubert T, Schrittwieser J, Antonoglou I, Lai M, Guez A, Hassabis D (2018) A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science 362(6419):1140–1144
Article MathSciNet Google Scholar
Brown N, Sandholm T (2018) Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science 359(6374):418–424
Article MathSciNet Google Scholar
Moravčík M, Schmid M, Burch N, Lisý V, Morrill D, Bard N, Bowling M (2017) Deepstack: expert-level artificial intelligence in heads-up no-limit poker. Science 356(6337):508–513
Article MathSciNet Google Scholar
Brown N, Sandholm T (2019) Superhuman AI for multiplayer poker. Science 365(6456):885–890
Article MathSciNet Google Scholar
Foerster JN, Assael YM, De Freitas N, Whiteson S (2016) Learning to communicate with deep multi-agent reinforcement learning. In: Advances in neural information processing systems, Bacerlona, vol 29, pp 2137–2145
Foerster J, Farquhar G, Afouras T, Nardelli N, Whiteson S (2018) Counterfactual multi-agent policy gradients. In: Proceedings of the 32nd AAAI conference on artificial intelligence, New Orleans, vol 32, No. 1, pp 2974–2982
Lowe R, Wu Y, Tamar A, Harb J, Abbeel P, Mordatch I (2017) Multi-agent actor-critic for mixed cooperative-competitive environments. In: Advances in neural information processing systems, Long Beach, vol 30, pp 6379–6390
Rashid T, Samvelyan M, Schroeder C, Farquhar G, Foerster J, Whiteson S (2018) QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning. In: International conference on machine learning. PMLR, London, pp 4295–4304
Baker CL, Jara-Ettinger J, Saxe R, Tenenbaum JB (2017) Rational quantitative attribution of beliefs, desires and percepts in human mentalizing. Nat Hum Behav 1(4):0064
Article Google Scholar
Bard N, Foerster JN, Chandar S, Burch N, Lanctot M, Song HF, Bowling M (2020) The Hanabi challenge: a new frontier for AI research. Artif Intell 280:103216
Article MathSciNet Google Scholar
Foerster J, Song F, Hughes E, Burch N, Dunning I, Whiteson S, Bowling M (2019) Bayesian action decoder for deep multi-agent reinforcement learning. In: International conference on machine learning, Long Beach, pp 1942–1951
Hu H, Foerster JN (2019) Simplified action decoder for deep multi-agent reinforcement learning. arXiv preprint arXiv:1912.02288
Sunehag P, Lever G, Gruslys A, Czarnecki WM, Zambaldi V, Jaderberg M, Graepel T (2017) Value-decomposition networks for cooperative multi-agent learning. arXiv preprint arXiv:1706.05296
Sukhbaatar S, Fergus R (2016) Learning multiagent communication with backpropagation. In: Advances in neural information processing systems, vol 29. MIT Press, pp 2244–2252
Baffier JF, Chiu MK, Diez Y, Korman M, Mitsou V, Van Renssen A, Uno Y (2017) Hanabi is np-hard, even for cheaters who look at their cards. Theoret Comput Sci 675:43–55
Article MathSciNet Google Scholar
Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. In: International conference on machine learning. PMLR, New York, pp 1928–1937
Ding S, Du W, Zhao X, Wang L, Jia W (2019) A new asynchronous reinforcement learning algorithm based on improved parallel PSO. Appl Intell 49(12):4211–4222
Article Google Scholar
Espeholt L, Soyer H, Munos R, Simonyan K, Mnih V, Ward T, Kavukcuoglu K (2018) Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In: International conference on machine learning. PMLR, New York, pp 1407–1416
Jaderberg M, Czarnecki WM, Dunning I, Marris L, Lever G, Castaneda AG, Graepel T (2019) Human-level performance in 3D multiplayer games with population-based reinforcement learning. Science 364(6443):859–865
Article MathSciNet Google Scholar
Ye D, Chen G, Zhao P, Qiu F, Yuan B, Zhang W, Huang L (2020) Supervised Learning achieves human-level performance in MOBA games: a case study of honor of kings. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TTNLS.2020.3029475
Jaderberg M, Dalibard V, Osindero S, Czarnecki WM, Donahue J, Razavi A, Kavukcuoglu K (2017) Population based training of neural networks. arXiv preprint arXiv:1711.09846
Osawa H (2015) Solving Hanabi: estimating hands by opponent’s actions in cooperative game with incomplete information. In: AAAI workshop: computer poker and imperfect information, Texas, pp 37–43
Eger M, Martens C, Cordoba MA (2017) An intentional AI for Hanabi. In: 2017 IEEE conference on computational intelligence and games (CIG), New York, pp 68–75
Canaan R, Togelius J, Nealen A et al (2019) Diverse agents for ad-hoc cooperation in hanabi. In: 2019 IEEE conference on Games (CoG), London, pp 1–8
Goodman J (2019) Re-determinizing information set Monte Carlo tree search in Hanabi. arXiv preprint arXiv:1902.06075
Brown N, Sandholm T, Amos B (2018) Depth-limited solving for imperfect- information games. arXiv preprint arXiv:1805.08195
Nguyen TT, Nguyen ND, Nahavandi S (2018) Deep reinforcement learning for multi-agent systems: a review of challenges, solutions and applications. arXiv preprint arXiv:1812.11794
Du W, Ding S (2020) A survey on multi-agent deep reinforcement learning: from the perspective of challenges and applications. Artif Intell Rev 54:1–24
Google Scholar
Oliehoek FA (2012) Decentralized pomdps. In: Reinforcement learning, Berlin, pp 471–503
Hausknecht M, Stone P (2015) Deep recurrent q-learning for partially observable mdps. arXiv preprint arXiv:1507.06527
Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Hassabis D (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529–533
Article Google Scholar
Nasrin S, Drobitch J, Shukla P, Tulabandhula T, Bandyopadhyay S, Trivedi AR (2020) Bayesian reasoning machine on a magneto-tunneling junction network. Nanotechnology 31(48):484001
Article Google Scholar
Ghavamzadeh M, Mannor S, Pineau J, Tamar A (2016) Bayesian reinforcement learning: a survey. arXiv preprint arXiv:1609.04436
Azizzadenesheli K, Brunskill E, Anandkumar A (2018) Efficient exploration through bayesian deep q-networks. In: 2018 Information Theory and Applications Workshop (ITA), San Diego, pp 1–9
Hernandez-Leal P, Rosman B, Taylor ME, Sucar LE, Munoz de Cote E (2016) A Bayesian approach for learning and tracking switching, non-stationary opponents. In: Proceedings of the 15th international conference on autonomous agents & multiagent systems, Singapore, pp 1315–1316
Zheng Y, Meng Z, Hao J, Zhang Z, Yang T, Fan C (2018) A deep Bayesian policy reuse approach against non-stationary agents. In: Proceedings of the 32nd international conference on neural information processing systems, Montreal, Canada, vol 31, pp 962–972
Yang T, Meng Z, Hao J, Zhang C, Zheng Y, Zheng Z (2018) Towards efficient detection and optimal response against sophisticated opponents. arXiv preprint arXiv:1809.04240
Zhang DG, Wang X, Song XD (2015) New medical image fusion approach with coding based on SCD in wireless sensor network. J Electric Eng Technol 10(6):2384–2392
Article Google Scholar
Liu XH (2021) Novel best path selection approach based on hybrid improved A* algorithm and reinforcement learning. Appl Intell 51(9):1–15
Google Scholar
Zhang D, Li G, Zheng K, Ming X, Pan ZH (2013) An energy-balanced routing method based on forward-aware factor for wireless sensor networks. IEEE Trans Industr Inf 10(1):766–773
Article Google Scholar
Zhang DG, Zhang T, Dong Y, Liu XH, Cui YY, Zhao DX (2018) Novel optimized link state routing protocol based on quantum genetic strategy for mobile learning. J Netw Comput Appl 122:37–49
Article Google Scholar
Zhang D, Ge H, Zhang T, Cui YY, Liu X, Mao G (2018) New multi-hop clustering algorithm for vehicular ad hoc networks. IEEE Trans Intell Transp Syst 20(4):1517–1530
Article Google Scholar
Zhang T, Zhang DG, Yan HR, Qiu JN, Gao JX (2021) A new method of data missing estimation with FNN-based tensor heterogeneous ensemble learning for internet of vehicle. Neurocomputing 420:98–110
Article Google Scholar
Chen J, Mao G, Li C, Liang W, Zhang DG (2017) Capacity of cooperative vehicular networks with infrastructure support: multiuser case. IEEE Trans Veh Technol 67(2):1546–1560
Article Google Scholar
Zhang DG, Liu S, Liu XH, Zhang T, Cui YY (2018) Novel dynamic source routing protocol (DSR) based on genetic algorithm-bacterial foraging optimization (GA-BFO). Int J Commun Syst 31(18):e3824
Article Google Scholar
Yang J, Ding M, Mao G, Lin Z, Zhang DG, Luan TH (2019) Optimal base station antenna downtilt in downlink cellular networks. IEEE Trans Wirel Commun 18(3):1779–1791
Article Google Scholar
Zhang DG, Zhang T, Zhang J, Dong Y, Zhang XD (2018) A kind of effective data aggregating method based on compressive sensing for wireless sensor network. EURASIP J Wirel Commun Netw 2018(1):1–15
Article Google Scholar
Zhang D, Wang X, Song X, Zhao D (2014) A novel approach to mapped correlation of ID for RFID anti-collision. IEEE Trans Serv Comput 7(4):741–748
Article Google Scholar
Zhang DG, Chen L, Zhang J, Chen J, Zhang T, Tang YM, Qiu JN (2020) A multi-path routing protocol based on link lifetime and energy consumption prediction for mobile edge computing. IEEE Access 8:69058–69071
Article Google Scholar
Zhang DG (2012) A new approach and system for attentive mobile learning based on seamless migration. Appl Intell 36(1):75–89
Article Google Scholar
Liu XH, Zhang DG, Yan HR, Cui YY, Chen L (2019) A new algorithm of the best path selection based on machine learning. IEEE Access 7:126913–126928
Article Google Scholar
Wang Z, Schaul T, Hessel M, Hasselt H, Lanctot M, Freitas N (2016) Dueling network architectures for deep reinforcement learning. In: International conference on machine learning, California, pp 1995–2003
Van Hasselt H, Guez A, Silver D (2016) Deep reinforcement learning with double q-learning. In: Proceedings of the 30th AAAI conference on artificial intelligence, Phoenix, pp 2094–2100
Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980
Hessel M, Modayil J, Van Hasselt H, Schaul T, Ostrovski G, Dabney W, Silver D (2018) Rainbow: combining improvements in deep reinforcement learning. In: Proceedings of the 32nd AAAI Conference on artificial intelligence, New Orleans, pp 3215–3222

Download references

Acknowledgements

This work is supported by the National Natural Science Foundations of China (No.61976216, No. 61672522).

Author information

Authors and Affiliations

School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, 221116, China
Wei Du, Shifei Ding, Chenglong Zhang & Shuying Du
Mine Digitization Engineering Research Center of Ministry of Education of the People’s Republic of China, Xuzhou, 221116, China
Shifei Ding

Authors

Wei Du
View author publications
You can also search for this author inPubMed Google Scholar
Shifei Ding
View author publications
You can also search for this author inPubMed Google Scholar
Chenglong Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Shuying Du
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Shifei Ding.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Du, W., Ding, S., Zhang, C. et al. Modified action decoder using Bayesian reasoning for multi-agent deep reinforcement learning. Int. J. Mach. Learn. & Cyber. 12, 2947–2961 (2021). https://doi.org/10.1007/s13042-021-01385-7

Download citation

Received: 13 November 2020
Accepted: 19 July 2021
Published: 27 July 2021
Issue Date: October 2021
DOI: https://doi.org/10.1007/s13042-021-01385-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Modified action decoder using Bayesian reasoning for multi-agent deep reinforcement learning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

From mimic to counteract: a two-stage reinforcement learning algorithm for Google research football

Collaborative Reinforcement Learning Framework to Model Evolution of Cooperation in Sequential Social Dilemmas

OB-HPPO: An Option and Intrinsic Curiosity Based Hierarchical Reinforcement Learning Approach for Real-Time Strategy Games

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now