A Q-based policy gradient optimization approach for Doudizhu

Yu, Xiaomin; Wang, Yisong; Qin, Jin; Chen, Panfeng

doi:10.1007/s10489-022-04281-x

A Q-based policy gradient optimization approach for Doudizhu

Published: 16 November 2022

Volume 53, pages 15372–15389, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Xiaomin Yu^1,2,
Yisong Wang ORCID: orcid.org/0000-0003-2126-7006^1,3,
Jin Qin¹ &
…
Panfeng Chen¹

382 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

Deep reinforcement learning (DRL) has recently been employed in various games, with which superhuman intelligence has been achieved, including Atari, Go, no-limit, and Texas hold’em. However, this technique has not been fully considered for Doudizhu which is a popular poker game in Asia and involves confrontation and cooperation among multiple players with imperfect information. In this paper we present a new deep reinforcement learning approach NV-Dou for the game Doudizhu. It adopts a variant of neural fictitious self-play to approximate the Nash equilibria of the game. The loss functions of the neural network integrate Q-Based policy gradient (mean actor critic) with advantage learning and proximal policy optimization. In addition, parametric noises are adopted for the fully connected layers in the neural network. The experimental results show that it needs only a few hours of training and achieves almost state-of-the-art performance comparing with the well-known open implementations RHCP, CQL, MCTS and others for Doudizhu.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Mastering “Gongzhu” with Self-play Deep Reinforcement Learning

Deep Reinforcement Learning with Hidden Layers on Future States

From mimic to counteract: a two-stage reinforcement learning algorithm for Google research football

Article 22 February 2024

Notes

References

Alvarado M, Rendón AY (2012) Nash equilibrium for collective strategic reasoning. Expert Syst Appl 39(15):12014–12025
Article Google Scholar
Asadi K, Allen C, Roderick M, Mohamed A-R, Konidaris GD, Littman ML (2017) Mean actor critic. arXiv:1709.00503
Azar OH, Bar-Eli M (2011) Do soccer players play the mixed-strategy nash equilibrium? Appl Econ 43(25):3591–3601
Article Google Scholar
Babaeizadeh M, Frosio I, Tyree S, Clemons J, Kautz J (2017) Reinforcement learning through asynchronous advantage actor-critic on a GPU. In: 5th international conference on learning representations, ICLR 2017, Toulon, April 24–26. Conference track proceedings. OpenReview.net
Bowling M, Burch N, Johanson M, Tammelin O (2015) Heads-up limit hold’em poker is solved. Science 347(6218):145–149
Article Google Scholar
Brown N, Lerer A, Gross S, Sandholm T (2019) Deep counterfactual regret minimization. In: Chaudhuri K, Salakhutdinov R (eds) Proceedings of the 36th international conference on machine learning, ICM. 9–15 June 2019. Long Beach, California, vol 97 of Proceedings of machine learning research, pp 793–802. PMLR
Brown N, Sandholm T (2018) Superhuman ai for heads-up no-limit poker: libratus beats top professionals. Science 359(6374):418–424
Article MathSciNet MATH Google Scholar
Brown N, Sandholm T (2019) Superhuman ai for multiplayer poker. Science 365(6456):885–890
Article MathSciNet MATH Google Scholar
Cowling PI, Powley EJ, Whitehouse D (2012) Information set Monte Carlo tree search. IEEE Trans Comput Intell AI Games 4(2):120–143
Article Google Scholar
Deng X, Wang Z, Qi L, Deng Y, Mahadevan S (2014) A belief-based evolutionarily stable strategy. J Theor Biol 361:81–86
Article MATH Google Scholar
Fortunato M, Azar MG, Piot B, Menick J, Hessel M, Osband I, Graves A, Mnih V, Munos R, Hassabis D, Pietquin O, Blundell C, Legg S (2018) Noisy networks for exploration. In: 6th international conference on learning representations, ICLR 2018, Vancouver, BC, April 30–May 3 2018, conference track proceedings. OpenReview.net
Gao Y, Li W, Khalid MNA, Iida H (2020) Quantifying attractiveness of incomplete-information multi-player game: case study using doudizhu. In: Computational science and technology. Springer, pp 301–310
Gao Y, Li W, Xiao Y, Khalid MNA, Iida H (2020) Nature of attractive multiplayer games: case study on China’s most popular card game—doudizhu. Information 11(3):141
Article Google Scholar
Heinrich J, Silver D (2015) Smooth UCT search in computer poker. In: Yang Q, Wooldridge MJ (eds) Proceedings of the twenty-fourth international joint conference on artificial intelligence, IJCAI, Buenos Aires, Argentina, July 25–31, 2015. AAAI Press, pp 554–560
Jiang Q, Li K, Du B, Chen H, Fang H (2019) Deltadou: expert-level doudizhu AI through self-play. In: Kraus S (ed) Proceedings of the twenty-eighth international joint conference on artificial intelligence, IJCAI 2019, Macao, China, August 10–16, 2019, pp 1265–1271. ijcai.org
Kawamura K, Mizukami N, Tsuruoka Y (2017) Neural fictitious self-play in imperfect information games with many players. In: Cazenave T, Winands MHM, Saffidine A (eds) Computer games - 6th workshop, CGW 2017, held in conjunction with the 26th international conference on artificial intelligence, IJCAI 2017, Melbourne, VIC, Australia, August, 20, 2017, Revised selected papers, vol 818 of Communications in computer and information science. Springer, pp 61–74
Knuth DE (2000) Dancing links. arXiv:cs/0011047
Lanctot M, Waugh K, Zinkevich M, Bowling MH (2009) Monte Carlo sampling for regret minimization in extensive games. In: Bengio Y, Schuurmans D, Lafferty JD, Williams CKI, Culotta A (eds) Advances in neural information processing systems 22: 23rd annual conference on neural information processing systems 2009. Proceedings of a meeting held 7–10 December 2009, Vancouver, British Columbia, Canada. Curran Associates, Inc., pp 1078–1086
Li P, Bing L, Lam W (2018) Actor-critic based training framework for abstractive summarization. arXiv:1803.11070
Li S, Li S, Cao H, Meng K, Ding M (2020) Study on the strategy of playing doudizhu game based on multirole modeling. Complexity 2020
Li S, Wu R, Bo J (2019) Study on the play strategy of dou dizhu poker based on convolution neural network. In: 2019 IEEE international conferences on ubiquitous computing & communications (IUCC) and data science and computational intelligence (DSCI) and smart computing, networking and services (SmartCNS). IEEE, pp 702–707
Mnih V, Badia AP, Mirza M, Graves A, Lillicrap TP, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. In: Balcan M-F, Weinberger KQ (eds) Proceedings of the 33rd international conference on machine learning, ICML 2016, New York City, NY, USA, June 19–24, 2016, vol 48 of JMLR workshop and conference proceedings, pp 1928–1937, JMLR.org
Moravčík M, Schmid M, Burch N, Lisỳ V, Morrill D, Bard N, Davis T, Waugh K, Johanson M, Bowling M (2017) Deepstack: expert-level artificial intelligence in heads-up no-limit poker. Science 356(6337):508–513
Article MathSciNet MATH Google Scholar
Powley EJ, Whitehouse D, Cowling PI (2011) Determinization in Monte-Carlo tree search for the card game dou di zhu. Proc Artif Intell Simul Behav:17–24
Schofield N, Sened I (2002) Local nash equilibrium in multiparty politics. Ann Oper Res 109 (1):193–211
Article MathSciNet MATH Google Scholar
Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. arXiv:1707.06347
Silver D, Huang A, Maddison CJ, Guez A, Sifre L, van den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M, Dieleman S, Grewe D, Nham J, Kalchbrenner N, Sutskever I, Lillicrap TP, Leach M, Kavukcuoglu K, Graepel T, Hassabis D (2016) Mastering the game of go with deep neural networks and tree search. Nature 529(7587):484–489
Article Google Scholar
Silver D, Hubert T, Schrittwieser J, Antonoglou I, Lai M, Guez A, Lanctot M, Sifre L, Kumaran D, Graepel T et al (2018) A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 362(6419):1140–1144
Article MathSciNet MATH Google Scholar
Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai M, Bolton A, Chen Y, Lillicrap TP, Hui F, Sifre L, van den Driessche G, Graepel T, Hassabis D (2017) Mastering the game of go without human knowledge. Nature 550(7676):354–359
Article Google Scholar
Tan G, Wei P, He Y, Xu H, Shi X (2021) Solving the playing strategy of dou dizhu using convolutional neural network: a residual learning approach. J Comput Methods Sci Eng 21(1):3–18
Google Scholar
Wang Z, Schaul T, Hessel M, van Hasselt H, Lanctot M, de Freitas N (2016) Dueling network architectures for deep reinforcement learning. In: Balcan M-F, Weinberger KQ (eds) Proceedings of the 33rd international conference on machine learning, ICML 2016, New York City, NY, USA, June 19–24, 2016, vol 48 of JMLR workshop and conference proceedings, pp 1995–2003, JMLR.org
Yee A, Rodríguez R, Alvarado M (2014) Analysis of strategies in American football using nash equilibrium. In: International conference on artificial intelligence: methodology, systems, and applications. Springer, pp 286–294
You Y, Li L, Guo B, Wang W, Lu C (2020) Combinatorial q-learning for dou di zhu. In: Proceedings of the sixteenth AAAI conference on artificial intelligence and interactive digital entertainment, AIIDE’20. AAAI Press
Zha D, Lai K-H, Cao Y, Huang S, Wei R, Guo J, Hu X (2019) Rlcard: a toolkit for reinforcement learning in card games. arXiv:1910.04376

Download references

Acknowledgements

We kindly thank all the anonymous reviewers whose comments improved this work to a great extent. This work is supported by the National Natural Science Foundation of China under Grants 61976065 and U1836205, Guizhou Science and Technology Foundation under Grant Qiankehejichu[2020]1Y420 and Guizhou Science Support Project (No. 2022-259).

Author information

Authors and Affiliations

College of Computer Science and Technology, Guizhou University, Guiyang, Guizhou, China
Xiaomin Yu, Yisong Wang, Jin Qin & Panfeng Chen
Guizhou Key Laboratory of Economics System Simulation, Guizhou University of Finance and Economics, Guiyang, Guizhou, China
Xiaomin Yu
Institute for Artificial Intelligence, Guizhou University, Guiyang, Guizhou, China
Yisong Wang

Authors

Xiaomin Yu
View author publications
You can also search for this author in PubMed Google Scholar
Yisong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jin Qin
View author publications
You can also search for this author in PubMed Google Scholar
Panfeng Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yisong Wang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Table 2 The action categories in Doudizhu

Full size table

Table 3 Game records played by NV-Dou and RHCP

Full size table

Table 4 Game records played by RHCP and NV-Dou

Full size table

Table 5 Game records played by RHCP and NV-Dou

Full size table

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yu, X., Wang, Y., Qin, J. et al. A Q-based policy gradient optimization approach for Doudizhu. Appl Intell 53, 15372–15389 (2023). https://doi.org/10.1007/s10489-022-04281-x

Download citation

Accepted: 18 October 2022
Published: 16 November 2022
Issue Date: June 2023
DOI: https://doi.org/10.1007/s10489-022-04281-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Q-based policy gradient optimization approach for Doudizhu

Abstract

Access this article

Similar content being viewed by others

Mastering “Gongzhu” with Self-play Deep Reinforcement Learning

Deep Reinforcement Learning with Hidden Layers on Future States

From mimic to counteract: a two-stage reinforcement learning algorithm for Google research football

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Q-based policy gradient optimization approach for Doudizhu

Abstract

Access this article

Similar content being viewed by others

Mastering “Gongzhu” with Self-play Deep Reinforcement Learning

Deep Reinforcement Learning with Hidden Layers on Future States

From mimic to counteract: a two-stage reinforcement learning algorithm for Google research football

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation