Coordinated learning in multiagent MDPs with infinite state-space

Melo, Francisco S.; Ribeiro, M. Isabel

doi:10.1007/s10458-009-9104-y

Coordinated learning in multiagent MDPs with infinite state-space

Published: 20 August 2009

Volume 21, pages 321–367, (2010)
Cite this article

Autonomous Agents and Multi-Agent Systems Aims and scope Submit manuscript

Francisco S. Melo¹ &
M. Isabel Ribeiro²

205 Accesses
8 Citations
Explore all metrics

Abstract

In this paper we address the problem of simultaneous learning and coordination in multiagent Markov decision problems (MMDPs) with infinite state-spaces. We separate this problem in two distinct subproblems: learning and coordination. To tackle the problem of learning, we survey Q-learning with soft-state aggregation (Q-SSA), a well-known method from the reinforcement learning literature (Singh et al. in Advances in neural information processing systems. MIT Press, Cambridge, vol 7, pp 361–368, 1994). Q-SSA allows the agents in the game to approximate the optimal Q-function, from which the optimal policies can be computed. We establish the convergence of Q-SSA and introduce a new result describing the rate of convergence of this method. In tackling the problem of coordination, we start by pointing out that the knowledge of the optimal Q-function is not enough to ensure that all agents adopt a jointly optimal policy. We propose a novel coordination mechanism that, given the knowledge of the optimal Q-function for an MMDP, ensures that all agents converge to a jointly optimal policy in every relevant state of the game. This coordination mechanism, approximate biased adaptive play (ABAP), extends biased adaptive play (Wang and Sandholm in Advances in neural information processing systems. MIT Press, Cambridge, vol 15, pp 1571–1578, 2003) to MMDPs with infinite state-spaces. Finally, we combine Q-SSA with ABAP, this leading to a novel algorithm in which learning of the game and coordination take place simultaneously. We discuss several important properties of this new algorithm and establish its convergence with probability 1. We also provide simple illustrative examples of application.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Bernstein D. S., Zilberstein S., Immerman N. (2002) The complexity of decentralized control of Markov decision processes. Mathematics of Operations Research 27(4): 819–840
Article MATH MathSciNet Google Scholar
Bertsekas D. P., Tsitsiklis J. N. (1996) Neuro-dynamic programming optimization and neural computation series. Athena Scientific, Belmont, MA
Google Scholar
Boutilier, C. (1999). Sequential optimality and coordination in multiagent systems. In Proceedings of the 16th international joint conference on artificial intelligence (IJCAI’99) (pp. 478–485).
Boutilier, C. (1996). Planning, learning and coordination in multiagent decision processes. In Proceedings of the 6th conference on theoretical aspects of rationality and knowledge (TARK-96) (pp. 195–210)
Bowling, M. (2000). Convergence problems of general-sum multiagent reinforcement learning. In Proceedings of the 17th international conference on machine learning (ICML’00) (pp 89–94). Morgan Kaufman.
Bowling, M., & Veloso, M. (2000a). An analysis of stochastic game theory for multiagent reinforcement learning. Technical Report CMU-CS-00-165, School of Computer Science, Carnegie Mellon University.
Bowling, M., & Veloso, M. (2000b). Scalable learning in stochastic games. In Proceedings of the AAAI workshop on game theoretic and decision theoretic agents (GTDT’02) (pp. 11–18). The AAAI Press, Published as AAAI Technical Report WS-02-06.
Bowling, M., & Veloso, M. (2001). Rational and convergent learning in stochastic games. In Proceedings of the 17th international joint conference on artificial intelligence (IJCAI’01) (pp. 1021–1026).
Bowling M., Veloso M. (2002) Multi-agent learning using a variable learning rate. Artificial Intelligence 136: 215–250
Article MATH MathSciNet Google Scholar
Brown G. W. (1949) Some notes on computation of games solutions. Research Memoranda RM-125-PR. RAND Corporation, Santa Monica
Google Scholar
Claus, C., & Boutilier, C. (1998). The dynamics of reinforcement learning in cooperative multiagent systems. In Proceedings of the 15th national conference on artificial intelligence (AAAI’98) (pp. 746–752).
Crites R. H., Barto A. G. (1998) Elevator group control using multiple reinforcement learning agents. Machine Learning 33(2–3): 235–262
Article MATH Google Scholar
Duflo, M. (1997). Random iterartive Models. In Applications of Mathematics (Vol. 34). Springer.
Durfee E. H., Lesser V. R., Corkill D. D. (1987) Coherent cooperation among communicating problem solvers. IEEE Transactions on Computers 36(11): 1275–1291
Article Google Scholar
Even-Dar E., Mansour Y. (2003) Learning rates for Q-learning. Journal of Machine Learning Research 5: 1–25
MathSciNet Google Scholar
Gmytrasiewicz P., Doshi P. (2005) A framework for sequential planning in multiagent settings. Journal of Artificial Intelligence Research 24: 49–79
MATH Google Scholar
Gordon, G. J. (1995). Stable function approximation in dynamic programming. Technical Report CMU-CS-95-103, School of Computer Science, Carnegie Mellon University.
Guestrin, C., Lagoudakis, M. G., & Parr, R. (2002). Coordinated reinforcement learning. In Proceedings of the 19th international conference on machine learning (ICML’02) (pp, 227–234).
Hu J., Wellman M. P. (2003) Nash Q-learning for general sum stochastic games. Journal of Machine Learning Research 4: 1039–1069
Article MathSciNet Google Scholar
Kearns, M., & Singh, S. (1999). Finite-sample convergence rates for Q-learning and indirect algorithms. In M. J. Kearns, S. A. Solla, & D. A. Cohn, (Eds.), Advances in neural information processing systems (Vol. 11, pp. 996–1002). Cambridge, MA: MIT Press.
Kok J. R., Spaan, M. T. J., & Vlassis, N. (2002). An approach to noncommunicative multiagent coordination in continuous domains. In: M. Wiering, (Ed.), Benelearn 2002: Proceedings of the 12th Belgian–Dutch conference on machine learning (pp. 46–52). Utrecht, The Netherlands.
Leslie D. S., Collins E. J. (2006) Generalised weakened fictitious play. Games and Economic Behavior 56: 285–298
Article MATH MathSciNet Google Scholar
Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. In R. López de Mántaras, & D. Poole (Eds.), Proceedings of the 11th international conference on machine learning (ICML’94) (pp. 157–163). San Francisco, CA: Morgan Kaufmann.
Littman M. L. (2001) Value-function reinforcement learning in Markov games. Journal of Cognitive Systems Research 2(1): 55–66
Article Google Scholar
Littman, M. L. (2001b). Friend-or-foe Q-learning in general-sum games. In Proceedings of the 18th international conference on machine learning (ICML’01) (pp. 322–328). San Francisco, CA: Morgan Kaufmann.
Melo, F. S., & Ribeiro, M. I. (2007a). Rational and convergent model-free adaptive learning for team Markov games. Technical Report RT-601-07, Institute for Systems and Robotics, February.
Melo, F. S., & Ribeiro, M. I. (2007b). Learning to coordinate in topological navigation tasks. In Proceedings of the 6th IFAC symposium on intelligent autonomous vehicles (IAV’07) (to appear), September.
Melo, F. S., & Ribeiro, M. I. (2008). Emerging coordination in infinite team Markov games. In Proceedings of the 7th international conference on autonomous agents and multiagent systems (AAMAS’08) (pp. 355–362).
Melo, F. S., & Veloso, M. (2009). Learning of coordination: Exploiting sparse interactions in multiagent systems. In Proceedings of the 8th international conference on autonomous agents and multiagent systems (AAMAS’08) (pp. 773–780).
Melo, F. S., Meyn, S. P., & Ribeiro, M. I. (2008). An analysis of reinforcement learning with function approximation. In Proceedings of the 25th international conference on machine learning (ICML’08) (pp. 664–671).
Meyn, S. P., & Tweedie, R. L. (1993). Markov chains and stochastic stability. Communicatons and Control Engineering Series. New York: Springer.
Nash J. F. (1950) Equilibrium points in n-person games. Proceedings of the National Academy of Sciences 36: 48–49
Article MATH MathSciNet Google Scholar
Ormoneit D., Sen Ś. (2002) Kernel-based reinforcement learning. Machine Learning 49: 161–178
Article MATH Google Scholar
Pelletier M. (1998) On the almost sure asymptotic behaviour of stochastic algorithms. Stochastic Processes and their Applications 78: 217–244
Article MATH MathSciNet Google Scholar
Perkins T. J., Precup D. (2003) A convergent form of approximate policy iteration. In: Thrun S., Becker S., Obermayer K. (eds) Advances in neural information processing systems. MIT Press, Cambridge, MA, pp 1595–1602
Google Scholar
Robinson J. (1951) An iterative method of solving a game. Annals of Mathematics 54: 296–301
Article MathSciNet Google Scholar
Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 3(3), 210–229. Reprinted in IBM Journal of Research and Development, 44(1/2), 206–226, 2000.
Samuel A. L. (1967) Some studies in machine learning using the game of checkers II: Recent progress. IBM Journal of Research and Development 11: 601–617
Article Google Scholar
Sen S., Weiß G. (1999) Learning in multiagent systems, chapter 6. MIT Press, Cambridge, MA, pp 259–298
Google Scholar
Singh, S. P., Jaakkola, T., & Jordan, M. I. (1994). Reinforcement learning with soft state aggregation. In Advances in neural information processing systems (Vol. 7, pp. 361–368). Cambridge, MA: MIT Press.
Singh, S. P., Kearns, M., & Mansour, Y. (2000). Nash convergence of gradient dynamics in general-sum games. In Proceedings of the 16th conference on uncertainty in artificial intelligence (UAI’00) (pp. 541–548).
Sutton R. S., Barto A. G. (1998) Reinforcement learning: An introduction. Adaptive computation and machine learning series (3rd ed.). MIT Press, Cambridge, MA
Google Scholar
Szepesvári C. (1997) The asymptotic convergence rates for Q-learning. Proceedings of Neural Information Processing Systems (NIPS’97) 10: 1064–1070
Google Scholar
Szepesvári C., Littman M. L. (1999) A unified analysis of value-function-based reinforcement learning algorithms. Neural Computation 11(8): 2017–2059
Article Google Scholar
Szepesvári, C., & Smart, W. D. (2004). Interpolation-based Q-learning. In Proceedings of the 21st international conference on machine learning (ICML’04) (pp. 100–107). New York, USA: ACM Press, July.
Tesauro G. (1994) TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation 6(2): 215–219
Article Google Scholar
Tesauro G. (1995) Temporal difference learning and TD-Gammon. Communications of the ACM 38(3): 58–68
Article Google Scholar
Tong H., Brown T. X. (2000) Reinforcement learning for call admission control and routing under quality of service constraints in multimedia networks. Machine Learning 49(2–3): 111–139
Google Scholar
Tsitsiklis J. N., Athans M. (1985) On the complexity of decentralized decision making and detection problems. IEEE Transactions on Automatic Control AC 30(5): 440–446
Article MATH MathSciNet Google Scholar
Tsitsiklis J. N., Van Roy B. (1996) Feature-based methods for large scale dynamic programming. Machine Learning 22: 59–94
MATH Google Scholar
Tsitsiklis J. N., Van Roy B. (1996) An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control 42(5): 674–690
Article Google Scholar
Uther, W., & Veloso, M. (2003). Adversarial reinforcement learning. Technical Report CMU-CS-03-107, School of Computer Science, Carnegie Mellon University, January.
Wang X., Sandholm T. (2003) Reinforcement learning to play an optimal Nash equilibrium in team Markov games. In: Becker S., Thrun S., Obermayer K. (eds) Advances in neural information processing systems. MIT Press, Cambridge, MA, pp 1571–1578
Google Scholar
Watkins, C. J. C. H. (1989). Learning from delayed rewards. PhD thesis, King’s College, University of Cambridge, May.
Young H. P. (1993) The evolution of conventions. Econometrica 61(1): 57–84
Article MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA, 15213, USA
Francisco S. Melo
Institute for Systems and Robotics, Instituto Superior Técnico, Av. Rovisco Pais, 1, 1049-001, Lisbon, Portugal
M. Isabel Ribeiro

Authors

Francisco S. Melo
View author publications
You can also search for this author in PubMed Google Scholar
M. Isabel Ribeiro
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Francisco S. Melo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Melo, F.S., Ribeiro, M.I. Coordinated learning in multiagent MDPs with infinite state-space. Auton Agent Multi-Agent Syst 21, 321–367 (2010). https://doi.org/10.1007/s10458-009-9104-y

Download citation

Published: 20 August 2009
Issue Date: November 2010
DOI: https://doi.org/10.1007/s10458-009-9104-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Coordinated learning in multiagent MDPs with infinite state-space

Abstract

Access this article

Similar content being viewed by others

Learning in the Presence of Multiple Agents

Breaking Deadlocks in Multi-agent Reinforcement Learning with Sparse Interaction

Smooth Q-Learning: An Algorithm for Independent Learners in Stochastic Cooperative Markov Games

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Abstract

Access this article

Similar content being viewed by others

Learning in the Presence of Multiple Agents

Breaking Deadlocks in Multi-agent Reinforcement Learning with Sparse Interaction

Smooth Q-Learning: An Algorithm for Independent Learners in Stochastic Cooperative Markov Games

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation