Abstract
This paper takes into account a general two-player zero-sum Markov game scenario in which our agent faces multi-type opponents with multiple policies. To enhance our agent’s return against opponent’s diverse policies, a novel Decision-making Framework based on Opponent Distinguishing and Policy Judgment (DF-ODPJ) is proposed. On the basis of the pre-trained Nash equilibrium strategies, DF-ODPJ can distinguish the opponent’s type by sampling from the interaction trajectory. Then a fast criterion is proposed to judge the opponent’s policy which is proven to minimize the misjudgment probability with optimal threshold calculated. According to the identification results, appropriate policies are generated to enhance the return. The proposed DF-ODPJ is more flexible since it is orthogonal to existing Nash equilibrium algorithms and single-agent reinforcement learning algorithms. The experimental results on grid world, video games, and UAV aerial combat environments illustrate the effectiveness of DF-ODPJ. The code is available at https://github.com/ChenXJ295/DF-ODPJ.














Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
The data that support the findings of this study are available on request from the corresponding author upon reasonable request.
References
Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M (2013) Playing atari with deep reinforcement learning. arXiv:1312.5602
Yu X, Wang Y, Qin J, Chen P (2023) A q-based policy gradient optimization approach for doudizhu. Appl Intell 53:15372–15389
Pan C, Min X, Zhang H, Song G, Min F (2023) Behavior imitation of individual board game players. Appl Intell 53:11571–11585
Zhang Y, Li K, Li K, Liu J (2023) Intelligent prediction method for updraft of uav that is based on lstm network. IEEE Trans Cognit Dev Syst 15(2):464–475
Wang D, Zhao M, Ha M, Qiao J (2022) Stability and admissibility analysis for zero-sum games under general value iteration formulation. IEEE Trans Neural Netw Learn Syst
Zhu J, Wei Y, Kang Y, Jiang X, Dullerud GE (2022) Adaptive deep reinforcement learning for non-stationary environments. Sci China Inf Sci 65(10):1–17
Volpi NC, Polani D (2020) Goal-directed empowerment: combining intrinsic motivation and task-oriented behaviour. IEEE Trans Cognit Dev Syst
Littman ML (1994) Markov games as a framework for multi-agent reinforcement learning. In: Machine Learning Proceedings 1994. Elsevier, San Francisco, CA, pp 157–163
Van Hasselt H, Guez A, Silver D (2016) Deep reinforcement learning with double q-learning. In: Proceedings of the AAAI conference on artificial intelligence, vol 30
Haarnoja T, Zhou A, Hartikainen K, Tucker G, Ha S, Tan J, Kumar V, Zhu H, Gupta A, Abbeel P, et al (2018) Soft actor-critic algorithms and applications. arXiv:1812.05905
Hu Y, Han C, Li H, Guo T (2023) Modeling opponent learning in multiagent repeated games. Appl Intell 53:17194–17210
Garcia F, Rachelson E (2013) Markov decision processes. Markov Decis Process Artif Intell 1–38
Jin C, Liu Q, Yu T (2022) The power of exploiter: provable multi-agent rl in large state spaces. In: International conference on machine learning. PMLR, pp 10251–10279
Puterman ML (1990) Markov decision processes. Handbooks in operations research and management science 2:331–434
Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. MIT Press, Cambridge, MA
Watkins CJ, Dayan P (1992) Q-learning. Mach Learn 8:279–292
Daskalakis C, Goldberg PW, Papadimitriou CH (2009) The complexity of computing a nash equilibrium. Commun ACM 52(2):89–97
Hu J, Wellman MP (2003) Nash q-learning for general-sum stochastic games. J Mach Learn Res 4:1039–1069
Zhu Y, Zhao D (2020) Online minimax q network learning for two-player zero-sum markov games. IEEE Trans Neural Netw Learn Syst 33(3):1228–1241
Casgrain P, Ning B, Jaimungal S (2022) Deep q-learning for nash equilibria: nash-dqn. Appl Math Finance 29(1):62–78
Alvarado M, Rendón AY (2012) Nash equilibrium for collective strategic reasoning. Expert Syst Appl 39(15):12014–12025
Yee A, Rodríguez R, Alvarado M (2014) Analysis of strategies in american football using nash equilibrium. In: Artificial intelligence: methodology, systems, and applications: 16th international conference, AIMSA 2014, Varna, Bulgaria, September 11-13, 2014. Proceedings 16. Springer, pp 286–294
Rizk Y, Awad M, Tunstel EW (2018) Decision making in multiagent systems: a survey. IEEE Trans Cognit Dev Syst 10(3):514–529
Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O (2017) Proximal policy optimization algorithms. arXiv:1707.06347
Ding Z, Su D, Liu Q, Jin C (2022) A deep reinforcement learning approach for finding non-exploitable strategies in two-player atari games. arXiv:2207.08894
He H, Boyd-Graber J, Kwok K, Daumé III H (2016) Opponent modeling in deep reinforcement learning. In: International conference on machine learning. PMLR, pp 1804–1813
Zhao Y, Tian Y, Lee J, Du S (2022) Provably efficient policy optimization for two-player zero-sum markov games. In: International conference on artificial intelligence and statistics. PMLR, pp 2736–2761
Heinrich J, Silver D (2016) Deep reinforcement learning from self-play in imperfect-information games. arXiv:1603.01121
Janner M, Fu J, Zhang M, Levine S (2019) When to trust your model: Model-based policy optimization. Adv Neural Inf Process Syst 32
Yu X, Jiang J, Zhang W, Jiang H, Lu Z (2022) Model-based opponent modeling. Advances in Neural Information Processing Systems 35:28208–28221
Bernardo JM, Smith AF (2009) Bayesian Theory, vol 405. John Wiley & Sons, Chichester, UK
Chernoff H (1952) A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann Math Stat 493–507
Lanctot M, Zambaldi V, Gruslys A, Lazaridou A, Tuyls K, Pérolat J, Silver D, Graepel T (2017) A unified game-theoretic approach to multiagent reinforcement learning. Adv Neural Inf Process Syst 30
Terry J, Black B, Grammel N, Jayakumar M, Hari A, Sullivan R, Santos LS, Dieffendahl C, Horsch C, Perez-Vicente R et al (2021) Pettingzoo: gym for multi-agent reinforcement learning. Adv Neural Inf Process Syst 34:15032–15043
Berndt J (2004) Jsbsim: an open source flight dynamics model in c++. In: AIAA modeling and simulation technologies conference and exhibit, pp 4923
Zhao E, Yan R, Li J, Li K, Xing J (2022) Alphaholdem: high-performance artificial intelligence for heads-up no-limit poker via end-to-end reinforcement learning. In: Proceedings of the AAAI conference on artificial intelligence, vol 36, pp 4689–4697
Acknowledgements
This work is supported by Anhui Provincial Natural Science Foundation under Grant 2008085MF198.
Author information
Authors and Affiliations
Contributions
Jin Zhu: conceptualization, methodology, investigation and writing; Xuan Wang: data curation, software, validation; Geir E Dullerud: conceptualization, supervision.
Corresponding author
Ethics declarations
Conflicts of Interest
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhu, J., Wang, X. & Geir E., D. Enhanced decision framework for two-player zero-sum Markov games with diverse opponent policies. Appl Intell 55, 449 (2025). https://doi.org/10.1007/s10489-025-06344-1
Accepted:
Published:
DOI: https://doi.org/10.1007/s10489-025-06344-1