Abstract
Bayesian Learning is an inference method designed to tackle exploration-exploitation trade-off as a function of the uncertainty of a given probability model from observations within the Reinforcement Learning (RL) paradigm. It allows the incorporation of prior knowledge, as probabilistic distributions, into the algorithms. Finding the resulting Bayes-optimal policies is notorious problem. We focus our attention on RL of a special kind of ergodic and controllable Markov games. We propose a new framework for computing the near-optimal policies for each agent, where it is assumed that the Markov chains are regular and the inverse of the behavior strategy is well defined. A fundamental result of this paper is the development of a theoretical method that, based on the formulation of a non-linear problem, computes the near-optimal adaptive-behavior strategies and policies of the game under some restrictions that maximize the expected reward. We prove that such behavior strategies and the policies satisfy the Bayesian-Nash equilibrium. Another important result is that the RL process learn a model through the interaction of the agents with the environment, and shows how the proposed method can finitely approximate and estimate the elements of the transition matrices and utilities maintaining an efficient long-term learning performance measure. We develop the algorithm for implementing this model. A numerical empirical example shows how to deploy the estimation process as a function of agent experiences.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availibility Statement
Data sharing not applicable to this article as no datasets were generated or analysed during the current study.
References
Araya-López V.M. Thomas, Buffet O.: Near-optimal brl using optimistic local transitions. In: ICML’12: Proceedings of the 29th International Coference on Machine Learning, Omnipres, Edinburgh, Scotland, pp 97–104 (2012)
Asiain, E., Clempner, J.B., Poznyak, A.S.: Controller exploitation-exploration: A reinforcement learning architecture. Soft Computing 23(11), 3591–3604 (2019)
Asmuth J., Li L., Littman M., Nouri A., Wingate D.: A bayesian sampling approach to exploration in reinforcement learning. In: UAI ’09: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, AUAI Press, Montreal, Quebec, Canada, pp 19–26 (2009)
Bellman R.: (1961) Adaptive Control Processes: A Guided Tour. Princeton University Press
Besson, R., Le Pennec, E.: Allassonnière S,: Learning from both experts and data. Entropy 21(12), 1208 (2019). https://doi.org/10.3390/e21121208
Castro P.S., Precup D.: Using linear programming for bayesian exploration in markov decision processes. In: Proceedings of the Twentieth International Joint Conference on Artificial Intelligence, Hyderabad, India, pp 2437–2442 (2007)
Chalkiadakis G., Boutilier C.: Coordination in multiagent reinforcementlearning: A bayesian approach. In: Proceedings of the 2nd InternationalJoint Conference on Autonomous Agents and Multiagent Systems, Association for Computing Machinery, Melbourne, Australia, pp 709–716 (2013)
Choi J., Kim K.E.: Map inference for bayesian inverse reinforcement learning. In: Proceedings of the 24th International Conference on Neural Information Processing Systems, Granada, Spain, p 1989–1997 (2011)
Clempner, J.B.: A markovian stackelberg game approach for computing an optimal dynamic mechanism. Computational and Applied Mathematics 40(6), 1–25 (2021)
Clempner, J.B.: A proximal/gradient approach for computing the nash equilibrium in controllable markov games. J Optim Theory Appl 188(3), 847–862 (2021)
Clempner, J.B.: A dynamic mechanism design for controllable and ergodic markov games. Computational Economics To be published (2022). https://doi.org/10.1007/s10614-022-10240-y
Clempner, J.B., Poznyak, A.S.: A tikhonov regularization parameter approach for solving lagrange constrained optimization problems. Engineering Optimization (2018). https://doi.org/10.1080/0305215X.2017.1418866, to be published
Clempner, J.B., Poznyak, A.S.: A tikhonov regularized penalty function approach for solving polylinear programming problems. J. Comput. Appl. Math. 328, 267–286 (2018)
Clempner, J.B., Poznyak, A.S.: A nucleus for bayesian partially observable markov games: Joint observer and mechanism design. Engineering Applications of Artificial Intelligence 95, 103876 (2020)
Clempner, J.B., Poznyak, A.S.: Analytical method for mechanism design in partially observable markov games. Mathematics 9(4), 1–15 (2021)
Clempner, J.B., Poznyak, A.S.: Computing a mechanism for a bayesian and partially observable markov approach. To be published, Int. J. Appl. Math. Comp. Sci (2023)
Dearden R., Friedman N., Andre D.: Model based bayesian exploration. In: UAI’99: Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, Morgan Kaufmann Publishers Inc., Stockholm, Sweden (1999)
Feldbaum A.A.: Dual control theory, parts i and ii. Automation and Remote Control 21:874–880 and 1033–1039 (1961)
Filatov, N., Unbehauen, H.: Survey of adaptive dual control methods. IEEE Control Theoryand Applications 147, 118–128 (2000)
van Geen C., Gerraty R.T.: Hierarchical bayesian models of reinforcement learning: Introduction and comparison to alternative methods, bioRxiv 2020.10.19.345512, https://doi.org/10.1101/2020.10.19.345512 (2020)
Ghavamzadeh, M., Engel, Y.: Bayesian actor-critic algorithms. In: International Conference on Machine Learning, pp. 297–304. Coravallis, Oregon, USA (2007)
Ghavamzadeh, M., Engel, Y.: Bayesian policy gradient algorithms. Neural Information Processing Systems 19, 457–464 (2007)
Ghavamzadeh, M., Mannor, S., Pineau, J., Tamar, A.: Bayesian reinforcement learning: A survey. Foundations and Trends in Machine Learning 8(5–6), 359–492 (2015)
Grover D., Basu D., Dimitrakakis C.: Bayesian reinforcement learning via deep, sparse sampling. In: Chiappa S, Calandra R (eds) Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, PMLR, 108, 3036–3045 (2020)
Harsanyi, J.C., Selten, R.: A general theory of equilibrium selection in games. MIT Press, Cambridge, Massachusetts (1988)
Kassab R., Simeone O.: Federated generalized bayesian learning via distributed stein variational gradient descent, arXiv 2020, arXiv:2009.06419 (2020)
Klenske, E.D., Hennig, P.: Dual controlfor approximate bayesian reinforcement learning. Journal of Machine Learning Research 17, 1–30 (2016)
Kolter J., Ng A.Y.: Near-bayesian exploration in polynomial time. In: ICML ’09: Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, Quebec, Canada, 513–520 (2009)
Kottke, D., Herde, M., Cea, Sandrock: oward optimal probabilistic active learning using a bayesian approach. Mach Learn 110, 1199–1231 (2021)
Nolan S., Smerzi A., Pezzè L.: A machine learning approach to bayesian parameter estimation, arXiv:2006.02369v2 (2020)
Osband I., Roy B.V., Russo D.: (more) efficient reinforcement learning via posterior sampling. In: Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, Curran Associates Inc., Lake Tahoe, Nevada, 3003–3011 (2013)
Poupart P., Vlassis N., Hoey J., Regan K.: An analytic solution to discrete bayesian reinforcement learning. In: ICML ’06: Proceedings of the 23rd international conference on Machine learning, Association for Computing Machinery, Pittsburgh, Pennsylvania, USA, 697–704 (2006)
Senda, K., Hishinuma, T., Tani, Y.: Approximate bayesian reinforcement learning based on estimation of plant. Autonomous Robots 44, 845–857 (2020)
Sutton, R.S., Barto, A.: Reinforcement learning: An introduction. MIT Press, Cambridge, MA, Introduction (1998)
Trejo, K.K., Clempner, J.B., Poznyak, A.S.: Computing the stackelberg/nash equilibria using the extraproximal method: Convergence analysis and implementation details for markov chains games. International Journal of Applied Mathematics and Computer Science 25(2), 337–351 (2015)
Trejo, K.K., Clempner, J.B., Poznyak, A.S.: Computing the bargaining approach for equalizing the ratios of maximal gains in continuous-time markov chains games. Computational Economics 54, 933–955 (2019). https://doi.org/10.1007/s10614-018-9859-9
Trejo K.K., Juarez R., Clempner J.B., Poznyak A.S.: Non-cooperative bargaining with unsophisticated agents. Computational Economics 1–38 (2020)
Vasilyeva, M., Tyrylgin, A., Brown, D., Mondal, A.: Preconditioning markov chain monte carlo method for geomechanical subsidence using multiscale method and machine learning technique. Journal of Computational and Applied Mathematics 392, 113420 (2021)
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Clempner, J.B. A Bayesian reinforcement learning approach in markov games for computing near-optimal policies. Ann Math Artif Intell 91, 675–690 (2023). https://doi.org/10.1007/s10472-023-09860-3
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10472-023-09860-3