Skip to main content
Log in

Generalized multiagent learning with performance bound

  • Published:
Autonomous Agents and Multi-Agent Systems Aims and scope Submit manuscript

Abstract

We present new Multiagent learning (MAL) algorithms with the general philosophy of policy convergence against some classes of opponents but otherwise ensuring high payoffs. We consider a 3-class breakdown of opponent types: (eventually) stationary, self-play and “other” (see Definition 4) agents. We start with ReDVaLeR that can satisfy policy convergence against the first two types and no-regret against the third, but it needs to know the type of the opponents. This serves as a baseline to delineate the difficulty of achieving these goals. We show that a simple modification on ReDVaLeR yields a new algorithm, RV σ(t), that achieves no-regret payoffs in all games, and convergence to Nash equilibria in self-play (and to best response against eventually stationary opponents—a corollary of no-regret) simultaneously, without knowing the opponent types, but in a smaller class of games than ReDVaLeR . RV σ(t) effectively ensures the performance of a learner during the process of learning, as opposed to the performance of a learned behavior. We show that the expression for regret of RV σ(t) can have a slightly better form than those of other comparable algorithms like GIGA and GIGA-WoLF though, contrastingly, our analysis is in continuous time. Moreover, experiments show that RV σ(t) can converge to an equilibrium in some cases where GIGA, GIGA-WoLF would fail, and to better equilibria where GIGA, GIGA-WoLF converge to undesirable equilibria (coordination games). This important class of coordination games also highlights the key desirability of policy convergence as a criterion for MAL in self-play instead of high average payoffs. To our knowledge, this is also the first successful (guaranteed) attempt at policy convergence of a no-regret algorithm in the Shapley game.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Auer, P., Cesa-Bianchi, N., Freund, Y., & Schapire, R. E. (1995). Gambling in a rigged casino: The adversarial multi-arm bandit problem. In Proceedings of the thirtysixth annual symposium on foundations of computer science (pp. 322–331). Milwaukee, WI: IEEE Computer Society Press.

  2. Banerjee, B., & Peng, J. (2004). Performance bounded reinforcement learning in strategic interactions. In Proceedings of the nineteenth national conference on artificial intelligence (AAAI-04) (pp. 2–7). San Jose, CA: AAAI Press.

  3. Bowling, M. (2005). Convergence and no-regret in multiagent learning. In Proceedings of NIPS 2004/5.

  4. Bowling, M., & Veloso, M. (2001). Rational and convergent learning in stochastic games. In Proceedings of the seventeenth international joint conference on artificial intelligence (pp. 1021–1026). Seattle, WA.

  5. Bowling M., Veloso M. (2002). Multiagent learning using a variable learning rate. Artificial Intelligence 136: 215–250

    Article  MATH  MathSciNet  Google Scholar 

  6. Brafman R.I., Tennenholtz M. (2002). R-max - A general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research 3: 213–231

    Article  MathSciNet  Google Scholar 

  7. Claus, C., & Boutilier, C. (1998). The dynamics of reinforcement learning in cooperative multiagent systems. In Proceedings of the fifteenth national conference on artificial intelligence (pp. 746–752). Menlo Park, CA: AAAI Press/MIT Press.

  8. Conitzer, V., & Sandholm, T. (2003). AWESOME: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents. In Proceedings of the twentieth international conference on machine learning.

  9. Cover, T. M., & Thomas, J. A. (1991). Elements of Information Theory. Wiley.

  10. Flaxman, A., Kalai, A., & McMahan, H. B. (2005). Online convex optimization in the bandit setting: Gradient descent without a gradient. In Proceedings of the sixteenth annual ACM-SIAM symposium on discrete algorithms (SODA), (To appear)

  11. Freund Y., Schapire R.E. (1999). Adaptive game playing using multiplicative weights. Games and Economic Behavior 29: 79–103

    Article  MATH  MathSciNet  Google Scholar 

  12. Fudenberg D., Levine D.K. (1995). Consistency and cautious fictitious play. Journal of Economic Dynamics and Control 19: 1065–1089

    Article  MATH  MathSciNet  Google Scholar 

  13. Fudenberg D., Levine K. (1998). The theory of learning in games. Cambridge, MA, MIT Press

    MATH  Google Scholar 

  14. Greenwald, A., & Hall, K. (2002). Correlated q-learning. In Proceedings of the AAAI symposium on collaborative learning agents.

  15. Hart S., Mas-Colell A. (2003) Uncoupled dynamics do not lead to nash equilibrium. American Economic Review 93(3): 1830–1836

    Article  Google Scholar 

  16. Hu J., Wellman M.P. (2003). Nash Q-learning for general-sum stochastic games. Journal of Machine Learning Research 4: 1039–1069

    Article  MathSciNet  Google Scholar 

  17. Jafari, A., Greenwald, A., Gondek, D., & Ercal, G. (2001). On no-regret learning, fictitious play, and nash equilibrium. In Proceedings of the eighteenth international conference on machine learning, pp. 226–223.

  18. Littlestone N., Warmuth M. (1994). The weighted majority algorithm. Information and Computation 108: 212–261

    Article  MATH  MathSciNet  Google Scholar 

  19. Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. In Proceedings of the eleventh international conference on machine learning, (pp. 157–163). San Mateo, CA: Morgan Kaufmann.

  20. Littman, M. L. (2001). Friend-or-foe Q-learning in general-sum games. In Proceedings of the eighteenth international conference on machine learnig, Williams College, MA, USA.

  21. Littman, M. L., & Szepesvari, C. (1996). A generalized reinforcement learning model: Convergence and applications. In Proceedings of the 13th international conference on machine learning, pp. 310–318.

  22. Nash J.F. (1951). Non-cooperative games. Annals of Mathematics 54: 286–295

    Article  MathSciNet  Google Scholar 

  23. Nowak M., Sigmund K. (1993). A strategy of win-stay, lose-shift that outperforms tit-for-tat in the prisoner’s dilemma game. Nature 364: 56–58

    Article  Google Scholar 

  24. Owen G. (1995). Game Theory. UK, Academic Press

    Google Scholar 

  25. Posch, M., & Brannath, W. (1997). Win-stay, lose-shift. A general learning rule for repeated normal form games. In Proceedings of the third international conference on computing in economics and finance, Stanford, CA, June 30–July 2, 1997.

  26. Powers, R., & Shoham, Y. (2005). New criteria and a new algorithm for learning in multi-agent systems. In Proceedings of NIPS 2004/5.

  27. Sandholm T., Crites R. (1996). On multiagent Q-learning in a semi-competitive domain. In G. Weiß & S. Sen, (Eds.) Adaptation and learning in multi-agent systems. pp. 191–205, Springer-Verlag.

  28. Sen, S., Sekaran, M., & Hale, J. (1994). Learning to coordinate without sharing information. In National conference on artificial intelligence, p. 426–431, Menlo Park, CA: AAAI Press/MIT Press. (Also published in READINGS in AGENTS, Michael Huhns, N, and Munindar Singh (Editors), p. 509–514, Morgan Kaufmann Publishers Inc., San Francisco, CA, 1998.).

  29. Shapley, L. S. (1974). A note on the lemke howson algorithm. Mathematical programming study 1: Pivoting and extensions, pp. 175–189.

  30. Singh, S., Kearns, M., & Mansour, Y. (2000). Nash convergence of gradient dynamics in general-sum games. In Proceedings of the sixteenth conference on uncertainty in artificial intelligence, pp. 541–548.

  31. Sutton, R., & Barto, A. G. (1998). Reinforcement learning: An introduction. MIT Press.

  32. Tan, M. (1993). Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the tenth international conference on machine learning, pp. 330–337.

  33. Tesauro, G. (2004). Extending Q-learning to general adaptive multi-agent systems. In S. Thrun, L. Saul, & B. Schölkopf, (Eds), Advances in neural information processing systems Vol. 16. Cambridge, MA: MIT Press.

  34. Wang, X., & Sandholm, T. (2002). Reinforcement learning to play an optimal nash equilibrium in team markov games. In Advances in neural information processing systems 15, NIPS.

  35. Weinberg, M., & Rosenschein, J. S. (2004). Best-Response multiagent learning in non-stationary environments. In Proceedings of the third international joint conference on autonomous agents and multiagent systems (AAMAS), (vol. 2, pp. 506–513, New York, NY: ACM.

  36. Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the twentieth international conference on machine learning, Washington DC.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bikramjit Banerjee.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Banerjee, B., Peng, J. Generalized multiagent learning with performance bound. Auton Agent Multi-Agent Syst 15, 281–312 (2007). https://doi.org/10.1007/s10458-007-9013-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10458-007-9013-x

Keywords

Navigation