Skip to main content
Log in

Coordinated learning in multiagent MDPs with infinite state-space

  • Published:
Autonomous Agents and Multi-Agent Systems Aims and scope Submit manuscript

Abstract

In this paper we address the problem of simultaneous learning and coordination in multiagent Markov decision problems (MMDPs) with infinite state-spaces. We separate this problem in two distinct subproblems: learning and coordination. To tackle the problem of learning, we survey Q-learning with soft-state aggregation (Q-SSA), a well-known method from the reinforcement learning literature (Singh et al. in Advances in neural information processing systems. MIT Press, Cambridge, vol 7, pp 361–368, 1994). Q-SSA allows the agents in the game to approximate the optimal Q-function, from which the optimal policies can be computed. We establish the convergence of Q-SSA and introduce a new result describing the rate of convergence of this method. In tackling the problem of coordination, we start by pointing out that the knowledge of the optimal Q-function is not enough to ensure that all agents adopt a jointly optimal policy. We propose a novel coordination mechanism that, given the knowledge of the optimal Q-function for an MMDP, ensures that all agents converge to a jointly optimal policy in every relevant state of the game. This coordination mechanism, approximate biased adaptive play (ABAP), extends biased adaptive play (Wang and Sandholm in Advances in neural information processing systems. MIT Press, Cambridge, vol 15, pp 1571–1578, 2003) to MMDPs with infinite state-spaces. Finally, we combine Q-SSA with ABAP, this leading to a novel algorithm in which learning of the game and coordination take place simultaneously. We discuss several important properties of this new algorithm and establish its convergence with probability 1. We also provide simple illustrative examples of application.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Bernstein D. S., Zilberstein S., Immerman N. (2002) The complexity of decentralized control of Markov decision processes. Mathematics of Operations Research 27(4): 819–840

    Article  MATH  MathSciNet  Google Scholar 

  2. Bertsekas D. P., Tsitsiklis J. N. (1996) Neuro-dynamic programming optimization and neural computation series. Athena Scientific, Belmont, MA

    Google Scholar 

  3. Boutilier, C. (1999). Sequential optimality and coordination in multiagent systems. In Proceedings of the 16th international joint conference on artificial intelligence (IJCAI’99) (pp. 478–485).

  4. Boutilier, C. (1996). Planning, learning and coordination in multiagent decision processes. In Proceedings of the 6th conference on theoretical aspects of rationality and knowledge (TARK-96) (pp. 195–210)

  5. Bowling, M. (2000). Convergence problems of general-sum multiagent reinforcement learning. In Proceedings of the 17th international conference on machine learning (ICML’00) (pp 89–94). Morgan Kaufman.

  6. Bowling, M., & Veloso, M. (2000a). An analysis of stochastic game theory for multiagent reinforcement learning. Technical Report CMU-CS-00-165, School of Computer Science, Carnegie Mellon University.

  7. Bowling, M., & Veloso, M. (2000b). Scalable learning in stochastic games. In Proceedings of the AAAI workshop on game theoretic and decision theoretic agents (GTDT’02) (pp. 11–18). The AAAI Press, Published as AAAI Technical Report WS-02-06.

  8. Bowling, M., & Veloso, M. (2001). Rational and convergent learning in stochastic games. In Proceedings of the 17th international joint conference on artificial intelligence (IJCAI’01) (pp. 1021–1026).

  9. Bowling M., Veloso M. (2002) Multi-agent learning using a variable learning rate. Artificial Intelligence 136: 215–250

    Article  MATH  MathSciNet  Google Scholar 

  10. Brown G. W. (1949) Some notes on computation of games solutions. Research Memoranda RM-125-PR. RAND Corporation, Santa Monica

    Google Scholar 

  11. Claus, C., & Boutilier, C. (1998). The dynamics of reinforcement learning in cooperative multiagent systems. In Proceedings of the 15th national conference on artificial intelligence (AAAI’98) (pp. 746–752).

  12. Crites R. H., Barto A. G. (1998) Elevator group control using multiple reinforcement learning agents. Machine Learning 33(2–3): 235–262

    Article  MATH  Google Scholar 

  13. Duflo, M. (1997). Random iterartive Models. In Applications of Mathematics (Vol. 34). Springer.

  14. Durfee E. H., Lesser V. R., Corkill D. D. (1987) Coherent cooperation among communicating problem solvers. IEEE Transactions on Computers 36(11): 1275–1291

    Article  Google Scholar 

  15. Even-Dar E., Mansour Y. (2003) Learning rates for Q-learning. Journal of Machine Learning Research 5: 1–25

    MathSciNet  Google Scholar 

  16. Gmytrasiewicz P., Doshi P. (2005) A framework for sequential planning in multiagent settings. Journal of Artificial Intelligence Research 24: 49–79

    MATH  Google Scholar 

  17. Gordon, G. J. (1995). Stable function approximation in dynamic programming. Technical Report CMU-CS-95-103, School of Computer Science, Carnegie Mellon University.

  18. Guestrin, C., Lagoudakis, M. G., & Parr, R. (2002). Coordinated reinforcement learning. In Proceedings of the 19th international conference on machine learning (ICML’02) (pp, 227–234).

  19. Hu J., Wellman M. P. (2003) Nash Q-learning for general sum stochastic games. Journal of Machine Learning Research 4: 1039–1069

    Article  MathSciNet  Google Scholar 

  20. Kearns, M., & Singh, S. (1999). Finite-sample convergence rates for Q-learning and indirect algorithms. In M. J. Kearns, S. A. Solla, & D. A. Cohn, (Eds.), Advances in neural information processing systems (Vol. 11, pp. 996–1002). Cambridge, MA: MIT Press.

  21. Kok J. R., Spaan, M. T. J., & Vlassis, N. (2002). An approach to noncommunicative multiagent coordination in continuous domains. In: M. Wiering, (Ed.), Benelearn 2002: Proceedings of the 12th Belgian–Dutch conference on machine learning (pp. 46–52). Utrecht, The Netherlands.

  22. Leslie D. S., Collins E. J. (2006) Generalised weakened fictitious play. Games and Economic Behavior 56: 285–298

    Article  MATH  MathSciNet  Google Scholar 

  23. Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. In R. López de Mántaras, & D. Poole (Eds.), Proceedings of the 11th international conference on machine learning (ICML’94) (pp. 157–163). San Francisco, CA: Morgan Kaufmann.

  24. Littman M. L. (2001) Value-function reinforcement learning in Markov games. Journal of Cognitive Systems Research 2(1): 55–66

    Article  Google Scholar 

  25. Littman, M. L. (2001b). Friend-or-foe Q-learning in general-sum games. In Proceedings of the 18th international conference on machine learning (ICML’01) (pp. 322–328). San Francisco, CA: Morgan Kaufmann.

  26. Melo, F. S., & Ribeiro, M. I. (2007a). Rational and convergent model-free adaptive learning for team Markov games. Technical Report RT-601-07, Institute for Systems and Robotics, February.

  27. Melo, F. S., & Ribeiro, M. I. (2007b). Learning to coordinate in topological navigation tasks. In Proceedings of the 6th IFAC symposium on intelligent autonomous vehicles (IAV’07) (to appear), September.

  28. Melo, F. S., & Ribeiro, M. I. (2008). Emerging coordination in infinite team Markov games. In Proceedings of the 7th international conference on autonomous agents and multiagent systems (AAMAS’08) (pp. 355–362).

  29. Melo, F. S., & Veloso, M. (2009). Learning of coordination: Exploiting sparse interactions in multiagent systems. In Proceedings of the 8th international conference on autonomous agents and multiagent systems (AAMAS’08) (pp. 773–780).

  30. Melo, F. S., Meyn, S. P., & Ribeiro, M. I. (2008). An analysis of reinforcement learning with function approximation. In Proceedings of the 25th international conference on machine learning (ICML’08) (pp. 664–671).

  31. Meyn, S. P., & Tweedie, R. L. (1993). Markov chains and stochastic stability. Communicatons and Control Engineering Series. New York: Springer.

  32. Nash J. F. (1950) Equilibrium points in n-person games. Proceedings of the National Academy of Sciences 36: 48–49

    Article  MATH  MathSciNet  Google Scholar 

  33. Ormoneit D., Sen Ś. (2002) Kernel-based reinforcement learning. Machine Learning 49: 161–178

    Article  MATH  Google Scholar 

  34. Pelletier M. (1998) On the almost sure asymptotic behaviour of stochastic algorithms. Stochastic Processes and their Applications 78: 217–244

    Article  MATH  MathSciNet  Google Scholar 

  35. Perkins T. J., Precup D. (2003) A convergent form of approximate policy iteration. In: Thrun S., Becker S., Obermayer K. (eds) Advances in neural information processing systems. MIT Press, Cambridge, MA, pp 1595–1602

    Google Scholar 

  36. Robinson J. (1951) An iterative method of solving a game. Annals of Mathematics 54: 296–301

    Article  MathSciNet  Google Scholar 

  37. Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 3(3), 210–229. Reprinted in IBM Journal of Research and Development, 44(1/2), 206–226, 2000.

  38. Samuel A. L. (1967) Some studies in machine learning using the game of checkers II: Recent progress. IBM Journal of Research and Development 11: 601–617

    Article  Google Scholar 

  39. Sen S., Weiß G. (1999) Learning in multiagent systems, chapter 6. MIT Press, Cambridge, MA, pp 259–298

    Google Scholar 

  40. Singh, S. P., Jaakkola, T., & Jordan, M. I. (1994). Reinforcement learning with soft state aggregation. In Advances in neural information processing systems (Vol. 7, pp. 361–368). Cambridge, MA: MIT Press.

  41. Singh, S. P., Kearns, M., & Mansour, Y. (2000). Nash convergence of gradient dynamics in general-sum games. In Proceedings of the 16th conference on uncertainty in artificial intelligence (UAI’00) (pp. 541–548).

  42. Sutton R. S., Barto A. G. (1998) Reinforcement learning: An introduction. Adaptive computation and machine learning series (3rd ed.). MIT Press, Cambridge, MA

    Google Scholar 

  43. Szepesvári C. (1997) The asymptotic convergence rates for Q-learning. Proceedings of Neural Information Processing Systems (NIPS’97) 10: 1064–1070

    Google Scholar 

  44. Szepesvári C., Littman M. L. (1999) A unified analysis of value-function-based reinforcement learning algorithms. Neural Computation 11(8): 2017–2059

    Article  Google Scholar 

  45. Szepesvári, C., & Smart, W. D. (2004). Interpolation-based Q-learning. In Proceedings of the 21st international conference on machine learning (ICML’04) (pp. 100–107). New York, USA: ACM Press, July.

  46. Tesauro G. (1994) TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural Computation 6(2): 215–219

    Article  Google Scholar 

  47. Tesauro G. (1995) Temporal difference learning and TD-Gammon. Communications of the ACM 38(3): 58–68

    Article  Google Scholar 

  48. Tong H., Brown T. X. (2000) Reinforcement learning for call admission control and routing under quality of service constraints in multimedia networks. Machine Learning 49(2–3): 111–139

    Google Scholar 

  49. Tsitsiklis J. N., Athans M. (1985) On the complexity of decentralized decision making and detection problems. IEEE Transactions on Automatic Control AC 30(5): 440–446

    Article  MATH  MathSciNet  Google Scholar 

  50. Tsitsiklis J. N., Van Roy B. (1996) Feature-based methods for large scale dynamic programming. Machine Learning 22: 59–94

    MATH  Google Scholar 

  51. Tsitsiklis J. N., Van Roy B. (1996) An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control 42(5): 674–690

    Article  Google Scholar 

  52. Uther, W., & Veloso, M. (2003). Adversarial reinforcement learning. Technical Report CMU-CS-03-107, School of Computer Science, Carnegie Mellon University, January.

  53. Wang X., Sandholm T. (2003) Reinforcement learning to play an optimal Nash equilibrium in team Markov games. In: Becker S., Thrun S., Obermayer K. (eds) Advances in neural information processing systems. MIT Press, Cambridge, MA, pp 1571–1578

    Google Scholar 

  54. Watkins, C. J. C. H. (1989). Learning from delayed rewards. PhD thesis, King’s College, University of Cambridge, May.

  55. Young H. P. (1993) The evolution of conventions. Econometrica 61(1): 57–84

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Francisco S. Melo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Melo, F.S., Ribeiro, M.I. Coordinated learning in multiagent MDPs with infinite state-space. Auton Agent Multi-Agent Syst 21, 321–367 (2010). https://doi.org/10.1007/s10458-009-9104-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10458-009-9104-y

Keywords

Navigation