Abstract
Deep reinforcement learning (DRL) has recently been employed in various games, with which superhuman intelligence has been achieved, including Atari, Go, no-limit, and Texas hold’em. However, this technique has not been fully considered for Doudizhu which is a popular poker game in Asia and involves confrontation and cooperation among multiple players with imperfect information. In this paper we present a new deep reinforcement learning approach NV-Dou for the game Doudizhu. It adopts a variant of neural fictitious self-play to approximate the Nash equilibria of the game. The loss functions of the neural network integrate Q-Based policy gradient (mean actor critic) with advantage learning and proximal policy optimization. In addition, parametric noises are adopted for the fully connected layers in the neural network. The experimental results show that it needs only a few hours of training and achieves almost state-of-the-art performance comparing with the well-known open implementations RHCP, CQL, MCTS and others for Doudizhu.
Similar content being viewed by others
Notes
References
Alvarado M, Rendón AY (2012) Nash equilibrium for collective strategic reasoning. Expert Syst Appl 39(15):12014–12025
Asadi K, Allen C, Roderick M, Mohamed A-R, Konidaris GD, Littman ML (2017) Mean actor critic. arXiv:1709.00503
Azar OH, Bar-Eli M (2011) Do soccer players play the mixed-strategy nash equilibrium? Appl Econ 43(25):3591–3601
Babaeizadeh M, Frosio I, Tyree S, Clemons J, Kautz J (2017) Reinforcement learning through asynchronous advantage actor-critic on a GPU. In: 5th international conference on learning representations, ICLR 2017, Toulon, April 24–26. Conference track proceedings. OpenReview.net
Bowling M, Burch N, Johanson M, Tammelin O (2015) Heads-up limit hold’em poker is solved. Science 347(6218):145–149
Brown N, Lerer A, Gross S, Sandholm T (2019) Deep counterfactual regret minimization. In: Chaudhuri K, Salakhutdinov R (eds) Proceedings of the 36th international conference on machine learning, ICM. 9–15 June 2019. Long Beach, California, vol 97 of Proceedings of machine learning research, pp 793–802. PMLR
Brown N, Sandholm T (2018) Superhuman ai for heads-up no-limit poker: libratus beats top professionals. Science 359(6374):418–424
Brown N, Sandholm T (2019) Superhuman ai for multiplayer poker. Science 365(6456):885–890
Cowling PI, Powley EJ, Whitehouse D (2012) Information set Monte Carlo tree search. IEEE Trans Comput Intell AI Games 4(2):120–143
Deng X, Wang Z, Qi L, Deng Y, Mahadevan S (2014) A belief-based evolutionarily stable strategy. J Theor Biol 361:81–86
Fortunato M, Azar MG, Piot B, Menick J, Hessel M, Osband I, Graves A, Mnih V, Munos R, Hassabis D, Pietquin O, Blundell C, Legg S (2018) Noisy networks for exploration. In: 6th international conference on learning representations, ICLR 2018, Vancouver, BC, April 30–May 3 2018, conference track proceedings. OpenReview.net
Gao Y, Li W, Khalid MNA, Iida H (2020) Quantifying attractiveness of incomplete-information multi-player game: case study using doudizhu. In: Computational science and technology. Springer, pp 301–310
Gao Y, Li W, Xiao Y, Khalid MNA, Iida H (2020) Nature of attractive multiplayer games: case study on China’s most popular card game—doudizhu. Information 11(3):141
Heinrich J, Silver D (2015) Smooth UCT search in computer poker. In: Yang Q, Wooldridge MJ (eds) Proceedings of the twenty-fourth international joint conference on artificial intelligence, IJCAI, Buenos Aires, Argentina, July 25–31, 2015. AAAI Press, pp 554–560
Jiang Q, Li K, Du B, Chen H, Fang H (2019) Deltadou: expert-level doudizhu AI through self-play. In: Kraus S (ed) Proceedings of the twenty-eighth international joint conference on artificial intelligence, IJCAI 2019, Macao, China, August 10–16, 2019, pp 1265–1271. ijcai.org
Kawamura K, Mizukami N, Tsuruoka Y (2017) Neural fictitious self-play in imperfect information games with many players. In: Cazenave T, Winands MHM, Saffidine A (eds) Computer games - 6th workshop, CGW 2017, held in conjunction with the 26th international conference on artificial intelligence, IJCAI 2017, Melbourne, VIC, Australia, August, 20, 2017, Revised selected papers, vol 818 of Communications in computer and information science. Springer, pp 61–74
Knuth DE (2000) Dancing links. arXiv:cs/0011047
Lanctot M, Waugh K, Zinkevich M, Bowling MH (2009) Monte Carlo sampling for regret minimization in extensive games. In: Bengio Y, Schuurmans D, Lafferty JD, Williams CKI, Culotta A (eds) Advances in neural information processing systems 22: 23rd annual conference on neural information processing systems 2009. Proceedings of a meeting held 7–10 December 2009, Vancouver, British Columbia, Canada. Curran Associates, Inc., pp 1078–1086
Li P, Bing L, Lam W (2018) Actor-critic based training framework for abstractive summarization. arXiv:1803.11070
Li S, Li S, Cao H, Meng K, Ding M (2020) Study on the strategy of playing doudizhu game based on multirole modeling. Complexity 2020
Li S, Wu R, Bo J (2019) Study on the play strategy of dou dizhu poker based on convolution neural network. In: 2019 IEEE international conferences on ubiquitous computing & communications (IUCC) and data science and computational intelligence (DSCI) and smart computing, networking and services (SmartCNS). IEEE, pp 702–707
Mnih V, Badia AP, Mirza M, Graves A, Lillicrap TP, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. In: Balcan M-F, Weinberger KQ (eds) Proceedings of the 33rd international conference on machine learning, ICML 2016, New York City, NY, USA, June 19–24, 2016, vol 48 of JMLR workshop and conference proceedings, pp 1928–1937, JMLR.org
Moravčík M, Schmid M, Burch N, Lisỳ V, Morrill D, Bard N, Davis T, Waugh K, Johanson M, Bowling M (2017) Deepstack: expert-level artificial intelligence in heads-up no-limit poker. Science 356(6337):508–513
Powley EJ, Whitehouse D, Cowling PI (2011) Determinization in Monte-Carlo tree search for the card game dou di zhu. Proc Artif Intell Simul Behav:17–24
Schofield N, Sened I (2002) Local nash equilibrium in multiparty politics. Ann Oper Res 109 (1):193–211
Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. arXiv:1707.06347
Silver D, Huang A, Maddison CJ, Guez A, Sifre L, van den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M, Dieleman S, Grewe D, Nham J, Kalchbrenner N, Sutskever I, Lillicrap TP, Leach M, Kavukcuoglu K, Graepel T, Hassabis D (2016) Mastering the game of go with deep neural networks and tree search. Nature 529(7587):484–489
Silver D, Hubert T, Schrittwieser J, Antonoglou I, Lai M, Guez A, Lanctot M, Sifre L, Kumaran D, Graepel T et al (2018) A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 362(6419):1140–1144
Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai M, Bolton A, Chen Y, Lillicrap TP, Hui F, Sifre L, van den Driessche G, Graepel T, Hassabis D (2017) Mastering the game of go without human knowledge. Nature 550(7676):354–359
Tan G, Wei P, He Y, Xu H, Shi X (2021) Solving the playing strategy of dou dizhu using convolutional neural network: a residual learning approach. J Comput Methods Sci Eng 21(1):3–18
Wang Z, Schaul T, Hessel M, van Hasselt H, Lanctot M, de Freitas N (2016) Dueling network architectures for deep reinforcement learning. In: Balcan M-F, Weinberger KQ (eds) Proceedings of the 33rd international conference on machine learning, ICML 2016, New York City, NY, USA, June 19–24, 2016, vol 48 of JMLR workshop and conference proceedings, pp 1995–2003, JMLR.org
Yee A, Rodríguez R, Alvarado M (2014) Analysis of strategies in American football using nash equilibrium. In: International conference on artificial intelligence: methodology, systems, and applications. Springer, pp 286–294
You Y, Li L, Guo B, Wang W, Lu C (2020) Combinatorial q-learning for dou di zhu. In: Proceedings of the sixteenth AAAI conference on artificial intelligence and interactive digital entertainment, AIIDE’20. AAAI Press
Zha D, Lai K-H, Cao Y, Huang S, Wei R, Guo J, Hu X (2019) Rlcard: a toolkit for reinforcement learning in card games. arXiv:1910.04376
Acknowledgements
We kindly thank all the anonymous reviewers whose comments improved this work to a great extent. This work is supported by the National Natural Science Foundation of China under Grants 61976065 and U1836205, Guizhou Science and Technology Foundation under Grant Qiankehejichu[2020]1Y420 and Guizhou Science Support Project (No. 2022-259).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yu, X., Wang, Y., Qin, J. et al. A Q-based policy gradient optimization approach for Doudizhu. Appl Intell 53, 15372–15389 (2023). https://doi.org/10.1007/s10489-022-04281-x
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-022-04281-x