Abstract
We present new Multiagent learning (MAL) algorithms with the general philosophy of policy convergence against some classes of opponents but otherwise ensuring high payoffs. We consider a 3-class breakdown of opponent types: (eventually) stationary, self-play and “other” (see Definition 4) agents. We start with ReDVaLeR that can satisfy policy convergence against the first two types and no-regret against the third, but it needs to know the type of the opponents. This serves as a baseline to delineate the difficulty of achieving these goals. We show that a simple modification on ReDVaLeR yields a new algorithm, RV σ(t), that achieves no-regret payoffs in all games, and convergence to Nash equilibria in self-play (and to best response against eventually stationary opponents—a corollary of no-regret) simultaneously, without knowing the opponent types, but in a smaller class of games than ReDVaLeR . RV σ(t) effectively ensures the performance of a learner during the process of learning, as opposed to the performance of a learned behavior. We show that the expression for regret of RV σ(t) can have a slightly better form than those of other comparable algorithms like GIGA and GIGA-WoLF though, contrastingly, our analysis is in continuous time. Moreover, experiments show that RV σ(t) can converge to an equilibrium in some cases where GIGA, GIGA-WoLF would fail, and to better equilibria where GIGA, GIGA-WoLF converge to undesirable equilibria (coordination games). This important class of coordination games also highlights the key desirability of policy convergence as a criterion for MAL in self-play instead of high average payoffs. To our knowledge, this is also the first successful (guaranteed) attempt at policy convergence of a no-regret algorithm in the Shapley game.
Similar content being viewed by others
References
Auer, P., Cesa-Bianchi, N., Freund, Y., & Schapire, R. E. (1995). Gambling in a rigged casino: The adversarial multi-arm bandit problem. In Proceedings of the thirtysixth annual symposium on foundations of computer science (pp. 322–331). Milwaukee, WI: IEEE Computer Society Press.
Banerjee, B., & Peng, J. (2004). Performance bounded reinforcement learning in strategic interactions. In Proceedings of the nineteenth national conference on artificial intelligence (AAAI-04) (pp. 2–7). San Jose, CA: AAAI Press.
Bowling, M. (2005). Convergence and no-regret in multiagent learning. In Proceedings of NIPS 2004/5.
Bowling, M., & Veloso, M. (2001). Rational and convergent learning in stochastic games. In Proceedings of the seventeenth international joint conference on artificial intelligence (pp. 1021–1026). Seattle, WA.
Bowling M., Veloso M. (2002). Multiagent learning using a variable learning rate. Artificial Intelligence 136: 215–250
Brafman R.I., Tennenholtz M. (2002). R-max - A general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research 3: 213–231
Claus, C., & Boutilier, C. (1998). The dynamics of reinforcement learning in cooperative multiagent systems. In Proceedings of the fifteenth national conference on artificial intelligence (pp. 746–752). Menlo Park, CA: AAAI Press/MIT Press.
Conitzer, V., & Sandholm, T. (2003). AWESOME: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents. In Proceedings of the twentieth international conference on machine learning.
Cover, T. M., & Thomas, J. A. (1991). Elements of Information Theory. Wiley.
Flaxman, A., Kalai, A., & McMahan, H. B. (2005). Online convex optimization in the bandit setting: Gradient descent without a gradient. In Proceedings of the sixteenth annual ACM-SIAM symposium on discrete algorithms (SODA), (To appear)
Freund Y., Schapire R.E. (1999). Adaptive game playing using multiplicative weights. Games and Economic Behavior 29: 79–103
Fudenberg D., Levine D.K. (1995). Consistency and cautious fictitious play. Journal of Economic Dynamics and Control 19: 1065–1089
Fudenberg D., Levine K. (1998). The theory of learning in games. Cambridge, MA, MIT Press
Greenwald, A., & Hall, K. (2002). Correlated q-learning. In Proceedings of the AAAI symposium on collaborative learning agents.
Hart S., Mas-Colell A. (2003) Uncoupled dynamics do not lead to nash equilibrium. American Economic Review 93(3): 1830–1836
Hu J., Wellman M.P. (2003). Nash Q-learning for general-sum stochastic games. Journal of Machine Learning Research 4: 1039–1069
Jafari, A., Greenwald, A., Gondek, D., & Ercal, G. (2001). On no-regret learning, fictitious play, and nash equilibrium. In Proceedings of the eighteenth international conference on machine learning, pp. 226–223.
Littlestone N., Warmuth M. (1994). The weighted majority algorithm. Information and Computation 108: 212–261
Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. In Proceedings of the eleventh international conference on machine learning, (pp. 157–163). San Mateo, CA: Morgan Kaufmann.
Littman, M. L. (2001). Friend-or-foe Q-learning in general-sum games. In Proceedings of the eighteenth international conference on machine learnig, Williams College, MA, USA.
Littman, M. L., & Szepesvari, C. (1996). A generalized reinforcement learning model: Convergence and applications. In Proceedings of the 13th international conference on machine learning, pp. 310–318.
Nash J.F. (1951). Non-cooperative games. Annals of Mathematics 54: 286–295
Nowak M., Sigmund K. (1993). A strategy of win-stay, lose-shift that outperforms tit-for-tat in the prisoner’s dilemma game. Nature 364: 56–58
Owen G. (1995). Game Theory. UK, Academic Press
Posch, M., & Brannath, W. (1997). Win-stay, lose-shift. A general learning rule for repeated normal form games. In Proceedings of the third international conference on computing in economics and finance, Stanford, CA, June 30–July 2, 1997.
Powers, R., & Shoham, Y. (2005). New criteria and a new algorithm for learning in multi-agent systems. In Proceedings of NIPS 2004/5.
Sandholm T., Crites R. (1996). On multiagent Q-learning in a semi-competitive domain. In G. Weiß & S. Sen, (Eds.) Adaptation and learning in multi-agent systems. pp. 191–205, Springer-Verlag.
Sen, S., Sekaran, M., & Hale, J. (1994). Learning to coordinate without sharing information. In National conference on artificial intelligence, p. 426–431, Menlo Park, CA: AAAI Press/MIT Press. (Also published in READINGS in AGENTS, Michael Huhns, N, and Munindar Singh (Editors), p. 509–514, Morgan Kaufmann Publishers Inc., San Francisco, CA, 1998.).
Shapley, L. S. (1974). A note on the lemke howson algorithm. Mathematical programming study 1: Pivoting and extensions, pp. 175–189.
Singh, S., Kearns, M., & Mansour, Y. (2000). Nash convergence of gradient dynamics in general-sum games. In Proceedings of the sixteenth conference on uncertainty in artificial intelligence, pp. 541–548.
Sutton, R., & Barto, A. G. (1998). Reinforcement learning: An introduction. MIT Press.
Tan, M. (1993). Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the tenth international conference on machine learning, pp. 330–337.
Tesauro, G. (2004). Extending Q-learning to general adaptive multi-agent systems. In S. Thrun, L. Saul, & B. Schölkopf, (Eds), Advances in neural information processing systems Vol. 16. Cambridge, MA: MIT Press.
Wang, X., & Sandholm, T. (2002). Reinforcement learning to play an optimal nash equilibrium in team markov games. In Advances in neural information processing systems 15, NIPS.
Weinberg, M., & Rosenschein, J. S. (2004). Best-Response multiagent learning in non-stationary environments. In Proceedings of the third international joint conference on autonomous agents and multiagent systems (AAMAS), (vol. 2, pp. 506–513, New York, NY: ACM.
Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the twentieth international conference on machine learning, Washington DC.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Banerjee, B., Peng, J. Generalized multiagent learning with performance bound. Auton Agent Multi-Agent Syst 15, 281–312 (2007). https://doi.org/10.1007/s10458-007-9013-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10458-007-9013-x