Skip to main content
Log in

A Monte Carlo Neural Fictitious Self-Play approach to approximate Nash Equilibrium in imperfect-information dynamic games

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

Solving the optimization problem to approach a Nash Equilibrium point plays an important role in imperfect information games, e.g., StarCraft and poker. Neural Fictitious Self-Play (NFSP) is an effective algorithm that learns approximate Nash Equilibrium of imperfect-information games from purely self-play without prior domain knowledge. However, it needs to train a neural network in an off-policy manner to approximate the action values. For games with large search spaces, the training may suffer from unnecessary exploration and sometimes fails to converge. In this paper, we propose a new Neural Fictitious Self-Play algorithm that combines Monte Carlo tree search with NFSP, called MC-NFSP, to improve the performance in real-time zero-sum imperfect-information games. With experiments and empirical analysis, we demonstrate that the proposed MC-NFSP algorithm can approximate Nash Equilibrium in games with large-scale search depth while the NFSP can not. Furthermore, we develop an Asynchronous Neural Fictitious Self-Play framework (ANFSP). It uses asynchronous and parallel architecture to collect game experience and improve both the training efficiency and policy quality. The experiments with th e games with hidden state information (Texas Hold’em), and the FPS (firstperson shooter) games demonstrate effectiveness of our algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Arulkumaran K, Cully A, Togelius J. Alphastar: an evolutionary computation perspective. In: Proceedings of the Genetic and Evolutionary Computation Conference Companion. 2019, 314–315

  2. Nash J. Non-cooperative games. Annals of Mathematics, 1951, 54(2): 286–295

    Article  MathSciNet  MATH  Google Scholar 

  3. Sanholm T. The state of solving large incomplete-information games, and application to poker. AI Magazine, 2010, 31(4): 13–32

    Article  Google Scholar 

  4. Bošanský B, Kiekintveld C, Lisý V, Pĕchouček M. An exact double-oracle algorithm for zero-sum extensive-form games with imperfect information. Journal of Artificial Intelligence Research, 2014, 51: 829–866

    Article  MathSciNet  MATH  Google Scholar 

  5. Bowling M, Burch N, Johanson M, Tammelin O. Heads-up limit hold’em poker is solved. Science, 2015, 347(6218): 145–149

    Article  Google Scholar 

  6. Browne C B, Powley E, Whitehouse D, Lucas S M, Cowling P I, Rohlfshagen P, Tavener S, Perez D, Samothrakis S, Colton S. A survey of Monte Carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in Games, 2012, 4(1): 1–43

    Article  Google Scholar 

  7. Brown G W. Iterative solution of games by fictitious play. Activity Analysis of Production and Allocation, 1951, 13(1): 374–376

    MathSciNet  MATH  Google Scholar 

  8. Heinrich J, Lanctot M, Silver D. Fictitious self-play in extensive-form games. In: Proceedings of the 32nd International Conference on Machine Learning. 2015, 805–813

  9. Heinrich J, Silver D. Deep reinforcement learning from self-play in imperfect-information games. 2016, arXiv preprint arXiv:1603.01121

  10. Sutton R S, Barto A G. Reinforcement Learning: An Introduction. 2nd ed. London: MIT Press, 1998

    MATH  Google Scholar 

  11. Myerson R B. Game Theory: Analysis of Conflict. 1st ed. London: Harvard University Press, 1991

    MATH  Google Scholar 

  12. Shi L, Li S, Cao L, Yang L, Pan G. TBQ (σ) improving efficiency of trace utilization for off-policy reinforcement learning. In: Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems. 2019, 1025–1032

  13. Yang L, Shi M, Zheng Q, Meng W, Pan G. A unified approach for multi-step temporal-difference learning with eligibility traces in reinforcement learning. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence. 2018, 2984–2990

  14. Mnih V, Kavukcuoglu K, Silver D, Rusu A A, Veness J, Bellemare M G, Graves A, Riedmiller M, Fidjeland A K, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra D, Legg S, Hassabis D. Human-level control through deep reinforcement learning. Nature, 2015, 518(7540): 529–533

    Article  Google Scholar 

  15. Meng W, Zheng Q, Yang L, Li P, Pan G. Qualitative measurements of policy discrepancy for return-based deep q-network. IEEE Transactions on Neural Networks and Learning Systems, 2019, 31(10): 4374–4380

    Article  Google Scholar 

  16. Mnih V, Badia A P, Mirza M, Graves A, Lillicrap T P, Harley T, Silver D, Kavukcuoglu K. Asynchronous methods for deep reinforcement learning. In: Proceedings of the 33rd International Conference on Machine Learning. 2016, 1928–1937

  17. Silver D, Huang A, Maddison C J, Guez A, Sifre L, van den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M, et al. Mastering the game of go with deep neural networks and tree search. Nature, 2016, 529(7587): 484

    Article  Google Scholar 

  18. Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai M, Bolton A, et al. Mastering the game of go without human knowledge. Nature, 2017, 550(7676): 354

    Article  Google Scholar 

  19. Sukhbaatar S, Szlam A, Fergus R. Learning multiagent communication with back propagation. In: Proceedings of the 30th Annual Conference on Neural Information Processing Systems. 2016, 2244–2252

  20. Peng P, Wen Y, Yang Y, Yuan Q, Tang Z, Long H, Wang J. Multiagent bidirectionally-coordinated nets: emergence of human-level coordination in learning to play starcraft combat games. 2017, arXiv preprint arXiv:1703.10069

  21. Heinrich J, Silver D. Smooth uct search in computer poker. In: Proceedings of the 24th International Joint Conference on Artificial Intelligence. 2015, 554–560

  22. Lisy V, Lanctot M, Bowling M. Online Monte Carlo counterfactual regret minimization for search in imperfect information games. In: Proceedings of the 14th International Conference on Autonomous Agents and Multiagent Systems. 2015, 27–36

  23. Brown N, Sandholm T. Libratus: the superhuman ai for no-limit poker. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence. 2017, 5226–5228

  24. Moravčík M, Schmid M, Burch N, Lisy V, Morrill D, Bard N, Davis T, Waugh K, Johanson M, Bowling M. Deepstack: expert-level artificial intelligence in heads-up no-limit poker. Science, 2017, 356(6337): 508–513

    Article  MathSciNet  MATH  Google Scholar 

  25. Leslie D S, Collins E J. Generalised weakened fictitious play. Games and Economic Behavior, 2006, 56(2): 285–298

    Article  MathSciNet  MATH  Google Scholar 

  26. Hendon E, Jacobsen H J, Sloth B. Fictitious play in extensive form games. Games and Economic Behavior, 1996, 15(2): 177–202

    Article  MathSciNet  MATH  Google Scholar 

  27. Thrun S, Schwartz A. Issues in using function approximation for reinforcement learning. In: Proceedings of the 4th Connectionist Models Summer School. 1993, 1–7

  28. Anschel O, Baram N, Shimkin N. Averaged-DQN: variance reduction and stabilization for deep reinforcement learning. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 176–185

  29. Foerster J, Nardelli N, Farquhar G, Afouras T, Torr P H, Kohli P, Whiteson S. Stabilising experience replay for deep multi-agent reinforcement learning. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 1146–1155

  30. Kocsis L, Szepesvári C. Bandit based Monte-Carlo planning. In: Proceedings of the 17th European Conference on Machine Learning. 2006, 282–293

  31. Audibert J Y, Munos R, Szepesvári C. Exploration-exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 2009, 410(19): 1876–1902

    Article  MathSciNet  MATH  Google Scholar 

  32. Shah D, Xie Q, Xu Z. On reinforcement learning using Monte Carlo tree search with supervised learning: non-asymptotic analysis. 2019, arXiv preprint arXiv:1902.05213

  33. Lisy V, Kovarik V, Lanctot M, Bosansky B. Convergence of Monte Carlo tree search in simultaneous move games. In: Proceedings of the 27th Annual Conference on Neural Information Processing Systems. 2013, 2112–2120

  34. Auger D. Multiple tree for partially observable Monte-Carlo tree search. In: Proceedings of the 14th European Conference on the Applications of Evolutionary Computation. 2011, 53–62

  35. Ponsen M, De Jong S, Lanctot M. Computing approximate nash equilibria and robust best-responses using sampling. Journal of Artificial Intelligence Research, 2011, 42: 575–605

    MathSciNet  MATH  Google Scholar 

  36. Cowling P I, Powley E J, Whitehouse D. Information set Monte Carlo tree search. IEEE Transactions on Computational Intelligence and AI in Games, 2012, 4(2): 120–143

    Article  Google Scholar 

  37. Jin P, Keutzer K, Levine S. Regret minimization for partially observable deep reinforcement learning. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 2342–2351

  38. Vitter J S. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 1985, 11(1): 37–57

    Article  MathSciNet  MATH  Google Scholar 

  39. Chaslot G M B, Winands M H, van den Herik H J. Parallel Monte-Carlo tree search. In: Proceedings of the 6th International Conference on Computers and Games. 2008, 60–71

  40. Fang Q, Boas D A. Monte Carlo simulation of photon migration in 3d turbid media accelerated by graphics processing units. Optics Express, 2009, 17(22): 20178–20190

    Article  Google Scholar 

  41. Hill M D, Marty M R. Amdahl’s law in the multicore era. Computer, 2008, 41(7): 33–38

    Article  Google Scholar 

  42. Lanctot M, Waugh K, Zinkevich M, Bowling M. Monte Carlo sampling for regret minimization in extensive games. In: Proceedings of the 23rd Annual Conference on Neural Information Processing Systems. 2009, 1078–1086

Download references

Acknowledgements

This work was supported by National Key Research and Development Program of China (2017YFB1002503), Science and Technology Innovation 2030 — “New Generation Artificial Intelligence” Major Project (2018AAA0100902), China.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shijian Li.

Additional information

Li Zhang received the BEng and PhD degrees from Zhejiang University, China in 2007 and 2013, respectively. He is currently an assistant researcher of the Department of Computer Science, Zhejiang University, China. In 2009, he was a visiting scholar at the University of Hong Kong, China. From 2013 to 2017, he was a researcher in Works Applications Co., Ltd. His current interests include deep learning, game theory, human-machine hybrid computing and pervasive computing. He has authored over ten refereed papers and patents.

Yuxuan Chen is a PhD student of computer science in the Pervasive and Cyborg Technology laboratory of Zhejiang University, China. He is doing research on game algorithms in incomplete information games and Nash Equlibrium problems. He did some research on reinforcement learning and implemented some related algorithms. He is also interested in human-computer iteraction. He also contributes on opensource gaming algorithms.

Wei Wang is currently an Algorithm engineer in Alibaba. She received her MS in computer science from Zhejiang University, and BS in computer Science from University of electronic science and technology of China. She did some reinforcement learning research with lecturer during studying in the university. And she works on recommendation system now.

Ziliang Han is currently pursuing a master’s degree at the Pervasive and Cyborg Technology laboratory of Zhejiang University, China. He has a bachelor’s degree in computer science from Jilin University, China. He did some research on reinforcement learning and implemented some related algorithms. He is also building a computational platform for large-scale game algorithm training and evaluation. He also contributes on opensource gaming algorithms.

Shijian Li received the PhD degree from Zhejiang University, China in 2006. In 2010, he was a Visiting Scholar with the Institute Telecom SudParis, France. He is currently with the College of Computer Science and Technology, Zhejiang University, China. He was published over 40 papers. His research interests include sensor networks, ubiquitous computing, and social computing. He serves as an Editor of the International Journal of Distributed Sensor Networks and as Reviewer or PC Member of more than ten conferences.

Zhijie Pan, PhD, Prof. Director of the intelligent Vehicle Research Center of Zhejiang University, China. He is mainly engaged in the cross research fields of computer science and automotive, including AI, autonomous vehicle technology, automatic driving, ITS. He has published more than 90 papers in international conferences and international magazines IEEE, SAE, JSAE, etc., has more than 100 patents.

Gang Pan received the BEng and PhD degrees from Zhejiang University, China in 1998 and 2004, respectively. He is currently a professor of the Department of Computer Science, and deputy director of State Key Lab of CAD&CG, Zhejiang University, China. His current interests include artificial intelligence, pervasive computing, brain-inspired computing, and brain-machine interfaces. He has authored over 100 refereed papers, and 35 patents granted. He received three best paper awards (e.g., ACM UbiComp’16) and three nominations from premier international conferences. He is the recipient of IEEE TCSC Award for Excellence (Middle Career Researcher), CCF-IEEE CS Young Computer Scientist Award, and the State Scientific and Technological Progress Award. He serves as an Associate Editor of IEEE Trans. Neural Networks and Learning Systems, IEEE Systems Journal, Pervasive and Mobile Computing.

Electronic supplementary material

11704_2020_9307_MOESM1_ESM.pdf

A Monte Carlo Neural Fictitious self-play approach to approximate Nash Equilibrium in imperfect-information dynamic games

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, L., Chen, Y., Wang, W. et al. A Monte Carlo Neural Fictitious Self-Play approach to approximate Nash Equilibrium in imperfect-information dynamic games. Front. Comput. Sci. 15, 155334 (2021). https://doi.org/10.1007/s11704-020-9307-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11704-020-9307-6

Keywords

Navigation