Abstract
In this paper, a novel iterative Q-learning algorithm, called “policy iteration based deterministic Q-learning algorithm,” is developed to solve the optimal control problems for discrete-time deterministic nonlinear systems. The idea is to use an iterative adaptive dynamic programming (ADP) technique to construct the iterative control law which optimizes the iterative Q function. When the optimal Q function is obtained, the optimal control law can be achieved by directly minimizing the optimal Q function, where the mathematical model of the system is not necessary. Convergence property is analyzed to show that the iterative Q function is monotonically non-increasing and converges to the solution of the optimality equation. It is also proven that any of the iterative control laws is a stable control law. Neural networks are employed to implement the policy iteration based deterministic Q-learning algorithm, by approximating the iterative Q function and the iterative control law, respectively. Finally, two simulation examples are presented to illustrate the performance of the developed algorithm.
摘要
创新点
本文提出了一种新型策略迭代 Q 学习方法求解离散时间非线性系统最优控制问题. 本文主要思想是采用自适应动态规划(ADP)方法构建系统迭代控制控制律以及迭代 Q 函数, 使得系统性能指标达到最优. 当获得迭代 Q 函数后, 通过最小化迭代 Q 函数直接获得迭代控制而不需要系统的数学模型. 通过收敛性分析证明迭代 Q 函数单调递减收敛到最优. 论文首次分析策略迭代 Q 学习的稳定性, 证明了任意迭代控制均为容许控制. 最后通过仿真验证了方法的有效性.
Similar content being viewed by others
References
Mohler R R, Kolodziej W J. Optimal control of a class of nonlinear stochastic systems. IEEE Trans Automat Contr, 1981, 26: 1048–1054
Liu C, Atkeson C G, Su J. Neighboring optimal control for periodic tasks for systems with discontinuous dynamics. Sci China Inf Sci, 2011, 54: 653–663
Wang J, Wang T, Yao C, et al. Active tension optimal control for WT wheelchair robot by using a novel control law for holonomic or nonholonomic systems. Sci China Inf Sci, 2014, 57: 112203
Liu Z, Wang Y, Li H. Two kinds of optimal controls for probabilistic mix-valued logical dynamic networks. Sci China Inf Sci, 2014, 57: 052201
Li X, Wang H, Ding B, et al. MABP: an optimal resource allocation approach in data center networks. Sci China Inf Sci, 2014, 57: 102801
Yu H, Tang W, Li S. Joint optimal sensing time and power allocation for multi-channel cognitive radio networks considering sensing-channel selection. Sci China Inf Sci, 2014, 57: 042313
Werbos P J. Advanced forecasting methods for global crisis warning and models of intelligence. General Systems Yearbook, 1977, 22: 25–38
Werbos P J. A Menu of Designs for Reinforcement Learning Over Time, in Neural Networks for Control. Massachusetts: MIT Press, 1991. 67–95
Modares H, Lewis F L. Optimal tracking control of nonlinear partially-unknown constrained-input systems using integral reinforcement learning. Automatica, 2014, 50: 1780–1792
Zhang H, Wei Q, Liu D. An iterative adaptive dynamic programming method for solving a class of nonlinear zero-sum differential games. Automatica, 2011, 47: 207–214
Kumar M, Rajagopal K, Balakrishnan S N, et al. Reinforcement learning based controller synthesis for flexible aircraft wings. IEEE/CAA J Automat Sin, 2014, 1: 435–448
Kamalapurkar R, Klotz J R, Dixon W E. Concurrent learning-based approximate feedback-Nash equilibrium solution of N-player nonzero-sum differential games. IEEE/CAA J Automat Sin, 2014, 1: 239–247
Zhang Z, Zhao D. Clique-based cooperative multiagent reinforcement learning using factor graphs. IEEE/CAA J Automat Sin, 2015, 1: 248–256
Zhong X, He H, Zhang H, et al. Optimal control for unknown discrete-time nonlinear Markov jump systems using adaptive dynamic programming. IEEE Trans Neural Netw Learn Syst, 2014, 25: 2141–2155
Wei Q, Liu D, Yang X. Infinite horizon self-learning optimal control of nonaffine discrete-time nonlinear systems. IEEE Trans Neural Netw Learn Syst, 2015, 26: 866–879
Prokhorov D V, Wunsch D C. Adaptive critic designs. IEEE Trans Neural Networks, 1997, 8: 997–1007
Wei Q, Liu D. Adaptive dynamic programming for optimal tracking control of unknown nonlinear systems with application to coal gasification. IEEE Trans Autom Sci Eng, 2014, 11: 1020–1036
Song R, Xiao W, Sun C. A new self-learning optimal control laws for a class of discrete-time nonlinear systems based on ESN architecture. Sci China Inf Sci, 2014, 57: 068202
Wei Q, Wang F, Liu D, et al. Finite-approximation-error based discrete-time iterative adaptive dynamic programming. IEEE Trans Cybern, 2014, 44: 2820–2833
Ni Z, He B, Zhong X, Prokhorov D V. Model-free dual heuristic dynamic programming. IEEE Trans Neural Netw Learn Syst, 2015, 26: 1834–1839
Molina D, Venayagamoothy G K, Liang J, et al. Intelligent local area signals based damping of power system oscillations using virtual generators and approximate dynamic programming. IEEE Trans Smart Grid, 2013, 4: 498–508
Bertsekas D P, Tsitsiklis J N. Neuro-Dynamic Programming. Belmont: Athena Scientific, 1996
Si J, Wang Y T. On-line learning control by association and reinforcement. IEEE Trans Neural Networks, 2001, 12: 264–276
Wei Q, Liu D. Data-driven neuro-optimal temperature control of water gas shift reaction using stable iterative adaptive dynamic programming. IEEE Trans Ind Electron, 2014, 61: 6399–6408
Dierks T, Jagannathan S. Online optimal control of affine nonlinear discrete-time systems with unknown internal dynamics by using time-based policy update. IEEE Trans Neural Networks, 2012, 23: 1118–1129
Dierks T, Thumati B, Jagannathan S. Optimal control of unknown affine nonlinear discrete-time systems using offlinetrained neural networks with proof of convergence. Neural Networks, 2009, 22: 851–860
Wei Q, Liu D. An iterative ε-optimal control scheme for a class of discrete-time nonlinear systems with unfixed initial state. Neural Networks, 2012, 32: 236–244
Wei Q, Song R, Yan P. Data-driven zero-sum neuro-optimal control for a class of continuous-time unknown nonlinear systems with disturbance using ADP. IEEE Trans Neural Netw Learn Syst, 2015, PP: 1
Lewis F L, Vrabie D, Vamvoudakis K G. Reinforcement learning and feedback control: using natural decision methods to design optimal adaptive controllers. IEEE Contr Syst, 2012, 32: 76–105
Modares H, Lewis F L, Naghibi-Sistani M B. Adaptive optimal control of unknown constrained-input systems using policy iteration and neural networks. IEEE Trans Neural Netw Learn Syst, 2013, 24: 1513–1525
Wei Q, Liu D, Y Xu. Policy iteration optimal tracking control for chaotic systems by adaptive dynamic programming approach. Chin Phys B, 2015, 24: 030502
Modares H, Lewis F L, Naghibi-Sistani M B. Integral reinforcement learning and experience replay for adaptive optimal control of partially-unknown constrained-input continuous-time systems. Automatica, 2014, 50: 193–202
Murray J J, Cox C J, Lendaris G G, et al. Adaptive dynamic programming. IEEE Trans Syst Man Cybern Part C-Appl Rev, 2002, 32: 140–153
Vamvoudakis K G, Lewis F L. Multi-player non-zero-sum games: online adaptive learning solution of coupled Hamilton-Jacobi equations. Automatica, 2011, 47: 1556–1569
Liu D, Wei Q. Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems. IEEE Trans Neural Netw Learn Syst, 2014, 25: 621–634
Song R, Xiao W, Zhang H, et al. Adaptive dynamic programming for a class of complex-valued nonlinear systems. IEEE Trans Neural Netw Learn Syst, 2014, 25: 1733–1739
Song R, Lewis F L, Wei Q, et al. Multiple Actor-critic structures for continuous-time optimal control using input-output data. IEEE Trans Neural Netw Learn Syst, 2015, 26: 851–865
Song R, Lewis F L, Wei Q, et al. Off-policy actor-critic structure for optimal control of unknown systems with disturbances. IEEE Trans Cybern, 2015, PP: 1
Al-Tamimi A, Lewis F L, Abu-Khalaf M. Discrete-time nonlinear HJB solution using approximate dynamic programming: convergence proof. IEEE Trans Syst Man Cybern Part B-Cybern, 2008, 38: 943–949
Lincoln B, Rantzer A. Relaxing dynamic programming. IEEE Trans Automat Contr, 2006, 51: 1249–1260
Wei Q, Wang D, Zhang D. Dual iterative adaptive dynamic programming for a class of discrete-time nonlinear systems with time-delays. Neural Comput Appl, 2013, 23: 1851–1863
Zhang H, Wei Q, Luo Y. A novel infinite-time optimal tracking control scheme for a class of discrete-time nonlinear systems via the greedy HDP iteration algorithm. IEEE Trans Syst Man Cybern Part B-Cybern, 2008, 38: 937–942
Wei Q, Liu D. Neural-network-based adaptive optimal tracking control scheme for discrete-time nonlinear systems with approximation errors. Neurocomputing, 2015, 149: 106–115
Wei Q, Liu D. Stable iterative adaptive dynamic programming algorithm with approximation errors for discrete-time nonlinear systems. Neural Comput Appl, 2014, 24: 1355–1367
Wei Q, Liu D. Numerically adaptive learning control scheme for discrete-time nonlinear systems. IET Control Theory Appl, 2013, 7: 1472–1486
Kiumarsi B, Lewis F L, Modares H, et al. Reinforcement image-learning for optimal tracking control of linear discretetime systems with unknown dynamics. Automatica, 2014, 50: 1167–1175
Liu D, Wei Q. Finite-approximation-error-based optimal control approach for discrete-time nonlinear systems. IEEE Trans Cybern, 2013, 43: 779–789
Wei Q, Liu D. A novel iterative θ-adaptive dynamic programming for discrete-time nonlinear systems. IEEE Trans Autom Sci Eng, 2014, 11: 1176–1190
Wei Q, Liu D, Shi G, et al. Optimal multi-battery coordination control for home energy management systems via distributed iterative adaptive dynamic programming. IEEE Trans Ind Electron, 2015, 42: 4203–4214
Wei Q, Liu D. Nonlinear neuro-optimal tracking control via stable iterative Q-learning algorithm. Neurocomputing, 2015, 168: 520–528
Watkins C. Learning from delayed rewards. Dissertation for the Doctoral Degree. Cambridge: Cambridge University, 1989
Watkins C, Danyan P. Q-learning. Mach Learn, 1992, 8: 279–292
Busoniu L, Babuska R, Schutter B D, et al. Reinforcement Learning and Dynamic Programming Using Function Approximators. Boca Raton: CRC Press, 2010
Wei Q, Liu D, Shi G. A novel dual iterative Q-learning method for optimal battery management in smart residential environments. IEEE Trans Ind Electron, 2015, 62: 2509–2518
Huang T, Liu D. A self-learning scheme for residential energy system control and management. Neural Comput Appl, 2013, 22: 259–269
Boaro M, Fuselli D, Aagelis F D, et al. Adaptive dynamic programming algorithm for renewable energy scheduling and battery management. Cognitive Comput, 2013, 5: 264–277
Fuselli D, Angelis F D, Boaro M, et al. Action dependent heuristic dynamic programming for home energy resource scheduling. Int J Elec Power Energ Syst, 2013, 48: 148–160
Prashanth L A, Bhatnagar S. Reinforcement learning with function approximation for traffic signal control. IEEE Trans Intell Transp Syst, 2011, 12: 412–421
Dorf R C, Bishop R H. Modern Control Systems. 12th ed. New York: Prentice Hall, 2011
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wei, Q., Liu, D. A novel policy iteration based deterministic Q-learning for discrete-time nonlinear systems. Sci. China Inf. Sci. 58, 1–15 (2015). https://doi.org/10.1007/s11432-015-5462-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11432-015-5462-z
Keywords
- adaptive critic designs
- adaptive dynamic programming
- approximate dynamic programming
- Q-learning
- policy iteration
- neural networks
- nonlinear systems
- optimal control