Abstract
A model-based offline policy iteration (PI) algorithm and a model-free online Q-learning algorithm are proposed for solving fully cooperative linear quadratic dynamic games. The PI-based adaptive Q-learning method can learn the feedback Nash equilibrium online using the state samples generated by behavior policies, without sending inquiries to the system model. Unlike the existing Q-learning methods, this novel Q-learning algorithm executes both policy evaluation and policy improvement in an adaptive manner. We prove the convergence of the offline PI algorithm by proving its equivalence to Newton’s method while solving the game algebraic Riccati equation (GARE). Furthermore, we prove that the proposed Q-learning method will converge to the Nash equilibrium under a small learning rate if the method satisfies certain persistence of excitation conditions, which can be easily met by suitable behavior policies. Our simulation results demonstrate the good performance of the proposed online adaptive Q-learning algorithm.
Similar content being viewed by others
References
Basar T, Olsder G J. Dynamic Noncooperative Game Theory (Classics in Applied Mathematics). 2nd ed. Philadelphia: SIAM, 1999
Falugi P, Kountouriotis P A, Vinter R B. Differential games controllers that confine a system to a safe region in the state space, with applications to surge tank control. IEEE Trans Autom Contr, 2012, 57: 2778–2788
Lin F H, Liu Q, Zhou X W, et al. Towards green for relay in InterPlaNetary Internet based on differential game model. Sci China Inf Sci, 2014, 57: 042306
Luo B, Wu H N, Huang T. Off-policy reinforcement learning for H ∞ control design. IEEE Trans Cyber, 2015, 45: 65–76
Sutton R S, Barto A G. Reinforcement Learning: An Introduction. Cambridge: MIT Press 1998
Xia R S, Wu Q X, Chen M. Disturbance observer-based optimal longitudinal trajectory control of near space vehicle. Sci China Inf Sci, 2019, 62: 050212
Wang D, Mu C X. Developing nonlinear adaptive optimal regulators through an improved neural learning mechanism. Sci China Inf Sci, 2017, 60: 058201
Yan X H, Zhu J H, Kuang M C, et al. Missile aerodynamic design using reinforcement learning and transfer learning. Sci China Inf Sci, 2018, 61: 119204
Watkins C, Dayan P. Q-learning. Mach Learn, 1992, 8: 279–292
Bradtke S J, Ydstie B E, Barto A G. Adaptive linear quadratic control using policy iteration. In: Proceedings of American Control Conference, Baltimore, 1994. 3475–3479
Chen C L, Dong D Y, Li H X, et al. Hybrid MDP based integrated hierarchical Q-learning. Sci China Inf Sci, 2011, 54: 2279–2294
Wei Q L, Liu D R. A novel policy iteration based deterministic Q-learning for discrete-time nonlinear systems. Sci China Inf Sci, 2015, 58: 122203
Wei Q L, Lewis F L, Sun Q Y, et al. Discrete-time deterministic Q-learning: a novel convergence analysis. IEEE Trans Cybern, 2017, 47: 1224–1237
Luo B, Liu D R, Huang T W, et al. Model-free optimal tracking control via critic-only Q-learning. IEEE Trans Neural Netw Learn Syst, 2016, 27: 2134–2144
Vamvoudakis K G. Q-learning for continuous-time linear systems: a model-free infinite horizon optimal control approach. Syst Control Lett, 2017, 100: 14–20
Vrabie D, Lewis F L. Adaptive dynamic programming for online solution of a zero-sum differential game. J Control Theory Appl, 2011, 9: 353–360
Zhu Y H, Zhao D B, Li X G. Iterative adaptive dynamic programming for solving unknown nonlinear zero-sum game based on online data. IEEE Trans Neural Netw Learn Syst, 2017, 28: 714–725
Vamvoudakis K G, Lewis F L. Multi-player non-zero-sum games: online adaptive learning solution of coupled Hamilton-Jacobi equations. Automatica, 2011, 47: 1556–1569
Zhang H G, Cui L L, Luo Y H. Near-optimal control for nonzero-sum differential games of continuous-time nonlinear systems using single-network ADP. IEEE Trans Cyber, 2013, 43: 206–216
Liu D R, Li H L, Wang D. Online synchronous approximate optimal learning algorithm for multi-player non-zero-sum games with unknown dynamics. IEEE Trans Syst Man Cyber Syst, 2014, 44: 1015–1027
Vamvoudakis K G. Non-zero sum Nash Q-learning for unknown deterministic continuous-time linear systems. Automatica, 2015, 61: 274–281
Zhao D B, Zhang Q C, Wang D, et al. Experience replay for optimal control of nonzero-sum game systems with unknown dynamics. IEEE Trans Cyber, 2016, 46: 854–865
Song R Z, Lewis F L, Wei Q L. Off-policy integral reinforcement learning method to solve nonlinear continuous-time multiplayer nonzero-sum games. IEEE Trans Neural Netw Learn Syst, 2017, 28: 704–713
Mehraeen S, Dierks T, Jagannathan S, et al. Zero-sum two-player game theoretic formulation of affine nonlinear discrete-time systems using neural networks. IEEE Trans Cyber, 2013, 43: 1641–1655
Zhang H G, Jiang H, Luo C M, et al. Discrete-time nonzero-sum games for multiplayer using policy-iteration-based adaptive dynamic programming algorithms. IEEE Trans Cyber, 2017, 47: 3331–3340
Zhang H G, Jiang H, Luo Y H, et al. Data-driven optimal consensus control for discrete-time multi-agent systems with unknown dynamics using reinforcement learning method. IEEE Trans Ind Electron, 2017, 64: 4091–4100
Kiumarsi B, Lewis F L, Jiang Z P. H ∞ control of linear discrete-time systems: off-policy reinforcement learning. Automatica, 2017, 78: 144–152
Vamvoudakis K G, Modares H, Kiumarsi B, et al. Game theory-based control system algorithms with real-time reinforcement learning: how to solve multiplayer games online. IEEE Control Syst, 2017, 37: 33–52
Tamimi A A, Lewis F L, Khalaf M A. Model-free Q-learning designs for linear discrete-time zero-sum games with application to H-infinity control. Automatica, 2007, 43: 473–481
Rizvi S A A, Lin Z L. Output feedback Q-learning for discrete-time linear zero-sum games with application to the H-infinity control. Automatica, 2018, 95: 213–221
Li J N, Chai T Y, Lewis F L, et al. Off-policy Q-learning: set-point design for optimizing dual-rate rougher flotation operational processes. IEEE Trans Ind Electron, 2018, 65: 4092–4102
Leake R J, Liu R W. Construction of suboptimal control sequences. J SIAM Control, 1967, 5: 54–63
Ioannou P, Fidan B. Adaptive Control Tutorial. Philadelphia: SIAM 2006
Acknowledgements
This work was supported by Key Program of National Natural Science Foundation of China (Grant No. U1613225).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Li, X., Peng, Z., Jiao, L. et al. Online adaptive Q-learning method for fully cooperative linear quadratic dynamic games. Sci. China Inf. Sci. 62, 222201 (2019). https://doi.org/10.1007/s11432-018-9865-9
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11432-018-9865-9