A novel policy iteration based deterministic Q-learning for discrete-time nonlinear systems

Wei, QingLai; Liu, DeRong

doi:10.1007/s11432-015-5462-z

A novel policy iteration based deterministic Q-learning for discrete-time nonlinear systems

一种针对离散时间非线性系统的确定性 Q-学习策略迭代方法

Research Paper
Published: 18 November 2015

Volume 58, pages 1–15, (2015)
Cite this article

Science China Information Sciences Aims and scope Submit manuscript

QingLai Wei¹ &
DeRong Liu²

468 Accesses
49 Citations
Explore all metrics

Abstract

In this paper, a novel iterative Q-learning algorithm, called “policy iteration based deterministic Q-learning algorithm,” is developed to solve the optimal control problems for discrete-time deterministic nonlinear systems. The idea is to use an iterative adaptive dynamic programming (ADP) technique to construct the iterative control law which optimizes the iterative Q function. When the optimal Q function is obtained, the optimal control law can be achieved by directly minimizing the optimal Q function, where the mathematical model of the system is not necessary. Convergence property is analyzed to show that the iterative Q function is monotonically non-increasing and converges to the solution of the optimality equation. It is also proven that any of the iterative control laws is a stable control law. Neural networks are employed to implement the policy iteration based deterministic Q-learning algorithm, by approximating the iterative Q function and the iterative control law, respectively. Finally, two simulation examples are presented to illustrate the performance of the developed algorithm.

摘要

创新点

本文提出了一种新型策略迭代 Q 学习方法求解离散时间非线性系统最优控制问题. 本文主要思想是采用自适应动态规划(ADP)方法构建系统迭代控制控制律以及迭代 Q 函数, 使得系统性能指标达到最优. 当获得迭代 Q 函数后, 通过最小化迭代 Q 函数直接获得迭代控制而不需要系统的数学模型. 通过收敛性分析证明迭代 Q 函数单调递减收敛到最优. 论文首次分析策略迭代 Q 学习的稳定性, 证明了任意迭代控制均为容许控制. 最后通过仿真验证了方法的有效性.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Mohler R R, Kolodziej W J. Optimal control of a class of nonlinear stochastic systems. IEEE Trans Automat Contr, 1981, 26: 1048–1054
Article MathSciNet MATH Google Scholar
Liu C, Atkeson C G, Su J. Neighboring optimal control for periodic tasks for systems with discontinuous dynamics. Sci China Inf Sci, 2011, 54: 653–663
Article MathSciNet MATH Google Scholar
Wang J, Wang T, Yao C, et al. Active tension optimal control for WT wheelchair robot by using a novel control law for holonomic or nonholonomic systems. Sci China Inf Sci, 2014, 57: 112203
Google Scholar
Liu Z, Wang Y, Li H. Two kinds of optimal controls for probabilistic mix-valued logical dynamic networks. Sci China Inf Sci, 2014, 57: 052201
MathSciNet Google Scholar
Li X, Wang H, Ding B, et al. MABP: an optimal resource allocation approach in data center networks. Sci China Inf Sci, 2014, 57: 102801
Google Scholar
Yu H, Tang W, Li S. Joint optimal sensing time and power allocation for multi-channel cognitive radio networks considering sensing-channel selection. Sci China Inf Sci, 2014, 57: 042313
Google Scholar
Werbos P J. Advanced forecasting methods for global crisis warning and models of intelligence. General Systems Yearbook, 1977, 22: 25–38
Google Scholar
Werbos P J. A Menu of Designs for Reinforcement Learning Over Time, in Neural Networks for Control. Massachusetts: MIT Press, 1991. 67–95
Google Scholar
Modares H, Lewis F L. Optimal tracking control of nonlinear partially-unknown constrained-input systems using integral reinforcement learning. Automatica, 2014, 50: 1780–1792
Article MathSciNet MATH Google Scholar
Zhang H, Wei Q, Liu D. An iterative adaptive dynamic programming method for solving a class of nonlinear zero-sum differential games. Automatica, 2011, 47: 207–214
Article MathSciNet MATH Google Scholar
Kumar M, Rajagopal K, Balakrishnan S N, et al. Reinforcement learning based controller synthesis for flexible aircraft wings. IEEE/CAA J Automat Sin, 2014, 1: 435–448
Google Scholar
Kamalapurkar R, Klotz J R, Dixon W E. Concurrent learning-based approximate feedback-Nash equilibrium solution of N-player nonzero-sum differential games. IEEE/CAA J Automat Sin, 2014, 1: 239–247
Google Scholar
Zhang Z, Zhao D. Clique-based cooperative multiagent reinforcement learning using factor graphs. IEEE/CAA J Automat Sin, 2015, 1: 248–256
Google Scholar
Zhong X, He H, Zhang H, et al. Optimal control for unknown discrete-time nonlinear Markov jump systems using adaptive dynamic programming. IEEE Trans Neural Netw Learn Syst, 2014, 25: 2141–2155
Article Google Scholar
Wei Q, Liu D, Yang X. Infinite horizon self-learning optimal control of nonaffine discrete-time nonlinear systems. IEEE Trans Neural Netw Learn Syst, 2015, 26: 866–879
Article Google Scholar
Prokhorov D V, Wunsch D C. Adaptive critic designs. IEEE Trans Neural Networks, 1997, 8: 997–1007
Article Google Scholar
Wei Q, Liu D. Adaptive dynamic programming for optimal tracking control of unknown nonlinear systems with application to coal gasification. IEEE Trans Autom Sci Eng, 2014, 11: 1020–1036
Article Google Scholar
Song R, Xiao W, Sun C. A new self-learning optimal control laws for a class of discrete-time nonlinear systems based on ESN architecture. Sci China Inf Sci, 2014, 57: 068202
Google Scholar
Wei Q, Wang F, Liu D, et al. Finite-approximation-error based discrete-time iterative adaptive dynamic programming. IEEE Trans Cybern, 2014, 44: 2820–2833
Article Google Scholar
Ni Z, He B, Zhong X, Prokhorov D V. Model-free dual heuristic dynamic programming. IEEE Trans Neural Netw Learn Syst, 2015, 26: 1834–1839
Article Google Scholar
Molina D, Venayagamoothy G K, Liang J, et al. Intelligent local area signals based damping of power system oscillations using virtual generators and approximate dynamic programming. IEEE Trans Smart Grid, 2013, 4: 498–508
Article Google Scholar
Bertsekas D P, Tsitsiklis J N. Neuro-Dynamic Programming. Belmont: Athena Scientific, 1996
MATH Google Scholar
Si J, Wang Y T. On-line learning control by association and reinforcement. IEEE Trans Neural Networks, 2001, 12: 264–276
Article Google Scholar
Wei Q, Liu D. Data-driven neuro-optimal temperature control of water gas shift reaction using stable iterative adaptive dynamic programming. IEEE Trans Ind Electron, 2014, 61: 6399–6408
Article Google Scholar
Dierks T, Jagannathan S. Online optimal control of affine nonlinear discrete-time systems with unknown internal dynamics by using time-based policy update. IEEE Trans Neural Networks, 2012, 23: 1118–1129
Article Google Scholar
Dierks T, Thumati B, Jagannathan S. Optimal control of unknown affine nonlinear discrete-time systems using offlinetrained neural networks with proof of convergence. Neural Networks, 2009, 22: 851–860
Article Google Scholar
Wei Q, Liu D. An iterative ε-optimal control scheme for a class of discrete-time nonlinear systems with unfixed initial state. Neural Networks, 2012, 32: 236–244
Article MATH Google Scholar
Wei Q, Song R, Yan P. Data-driven zero-sum neuro-optimal control for a class of continuous-time unknown nonlinear systems with disturbance using ADP. IEEE Trans Neural Netw Learn Syst, 2015, PP: 1
Google Scholar
Lewis F L, Vrabie D, Vamvoudakis K G. Reinforcement learning and feedback control: using natural decision methods to design optimal adaptive controllers. IEEE Contr Syst, 2012, 32: 76–105
Article MathSciNet Google Scholar
Modares H, Lewis F L, Naghibi-Sistani M B. Adaptive optimal control of unknown constrained-input systems using policy iteration and neural networks. IEEE Trans Neural Netw Learn Syst, 2013, 24: 1513–1525
Article Google Scholar
Wei Q, Liu D, Y Xu. Policy iteration optimal tracking control for chaotic systems by adaptive dynamic programming approach. Chin Phys B, 2015, 24: 030502
Article MathSciNet Google Scholar
Modares H, Lewis F L, Naghibi-Sistani M B. Integral reinforcement learning and experience replay for adaptive optimal control of partially-unknown constrained-input continuous-time systems. Automatica, 2014, 50: 193–202
Article MathSciNet MATH Google Scholar
Murray J J, Cox C J, Lendaris G G, et al. Adaptive dynamic programming. IEEE Trans Syst Man Cybern Part C-Appl Rev, 2002, 32: 140–153
Article Google Scholar
Vamvoudakis K G, Lewis F L. Multi-player non-zero-sum games: online adaptive learning solution of coupled Hamilton-Jacobi equations. Automatica, 2011, 47: 1556–1569
Article MathSciNet MATH Google Scholar
Liu D, Wei Q. Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems. IEEE Trans Neural Netw Learn Syst, 2014, 25: 621–634
Article Google Scholar
Song R, Xiao W, Zhang H, et al. Adaptive dynamic programming for a class of complex-valued nonlinear systems. IEEE Trans Neural Netw Learn Syst, 2014, 25: 1733–1739
Article Google Scholar
Song R, Lewis F L, Wei Q, et al. Multiple Actor-critic structures for continuous-time optimal control using input-output data. IEEE Trans Neural Netw Learn Syst, 2015, 26: 851–865
Article Google Scholar
Song R, Lewis F L, Wei Q, et al. Off-policy actor-critic structure for optimal control of unknown systems with disturbances. IEEE Trans Cybern, 2015, PP: 1
Google Scholar
Al-Tamimi A, Lewis F L, Abu-Khalaf M. Discrete-time nonlinear HJB solution using approximate dynamic programming: convergence proof. IEEE Trans Syst Man Cybern Part B-Cybern, 2008, 38: 943–949
Article Google Scholar
Lincoln B, Rantzer A. Relaxing dynamic programming. IEEE Trans Automat Contr, 2006, 51: 1249–1260
Article MathSciNet Google Scholar
Wei Q, Wang D, Zhang D. Dual iterative adaptive dynamic programming for a class of discrete-time nonlinear systems with time-delays. Neural Comput Appl, 2013, 23: 1851–1863
Article Google Scholar
Zhang H, Wei Q, Luo Y. A novel infinite-time optimal tracking control scheme for a class of discrete-time nonlinear systems via the greedy HDP iteration algorithm. IEEE Trans Syst Man Cybern Part B-Cybern, 2008, 38: 937–942
Article Google Scholar
Wei Q, Liu D. Neural-network-based adaptive optimal tracking control scheme for discrete-time nonlinear systems with approximation errors. Neurocomputing, 2015, 149: 106–115
Article Google Scholar
Wei Q, Liu D. Stable iterative adaptive dynamic programming algorithm with approximation errors for discrete-time nonlinear systems. Neural Comput Appl, 2014, 24: 1355–1367
Article Google Scholar
Wei Q, Liu D. Numerically adaptive learning control scheme for discrete-time nonlinear systems. IET Control Theory Appl, 2013, 7: 1472–1486
Article MathSciNet Google Scholar
Kiumarsi B, Lewis F L, Modares H, et al. Reinforcement image-learning for optimal tracking control of linear discretetime systems with unknown dynamics. Automatica, 2014, 50: 1167–1175
Article MathSciNet MATH Google Scholar
Liu D, Wei Q. Finite-approximation-error-based optimal control approach for discrete-time nonlinear systems. IEEE Trans Cybern, 2013, 43: 779–789
Article Google Scholar
Wei Q, Liu D. A novel iterative θ-adaptive dynamic programming for discrete-time nonlinear systems. IEEE Trans Autom Sci Eng, 2014, 11: 1176–1190
Article Google Scholar
Wei Q, Liu D, Shi G, et al. Optimal multi-battery coordination control for home energy management systems via distributed iterative adaptive dynamic programming. IEEE Trans Ind Electron, 2015, 42: 4203–4214
Article Google Scholar
Wei Q, Liu D. Nonlinear neuro-optimal tracking control via stable iterative Q-learning algorithm. Neurocomputing, 2015, 168: 520–528
Article Google Scholar
Watkins C. Learning from delayed rewards. Dissertation for the Doctoral Degree. Cambridge: Cambridge University, 1989
Google Scholar
Watkins C, Danyan P. Q-learning. Mach Learn, 1992, 8: 279–292
MATH Google Scholar
Busoniu L, Babuska R, Schutter B D, et al. Reinforcement Learning and Dynamic Programming Using Function Approximators. Boca Raton: CRC Press, 2010
Book Google Scholar
Wei Q, Liu D, Shi G. A novel dual iterative Q-learning method for optimal battery management in smart residential environments. IEEE Trans Ind Electron, 2015, 62: 2509–2518
Article Google Scholar
Huang T, Liu D. A self-learning scheme for residential energy system control and management. Neural Comput Appl, 2013, 22: 259–269
Article Google Scholar
Boaro M, Fuselli D, Aagelis F D, et al. Adaptive dynamic programming algorithm for renewable energy scheduling and battery management. Cognitive Comput, 2013, 5: 264–277
Article Google Scholar
Fuselli D, Angelis F D, Boaro M, et al. Action dependent heuristic dynamic programming for home energy resource scheduling. Int J Elec Power Energ Syst, 2013, 48: 148–160
Article Google Scholar
Prashanth L A, Bhatnagar S. Reinforcement learning with function approximation for traffic signal control. IEEE Trans Intell Transp Syst, 2011, 12: 412–421
Article Google Scholar
Dorf R C, Bishop R H. Modern Control Systems. 12th ed. New York: Prentice Hall, 2011
Google Scholar

Download references

Author information

Authors and Affiliations

State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, 100190, China
QingLai Wei
School of Automation and Electrical Engineering, University of Science and Technology Beijing, Beijing, 100083, China
DeRong Liu

Authors

QingLai Wei
View author publications
You can also search for this author in PubMed Google Scholar
DeRong Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to DeRong Liu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wei, Q., Liu, D. A novel policy iteration based deterministic Q-learning for discrete-time nonlinear systems. Sci. China Inf. Sci. 58, 1–15 (2015). https://doi.org/10.1007/s11432-015-5462-z

Download citation

Received: 11 August 2015
Accepted: 23 October 2015
Published: 18 November 2015
Issue Date: December 2015
DOI: https://doi.org/10.1007/s11432-015-5462-z

Keywords

关键词

122203

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A novel policy iteration based deterministic Q-learning for discrete-time nonlinear systems

Abstract

摘要

创新点

Access this article

Similar content being viewed by others

A New Discrete-Time Iterative Adaptive Dynamic Programming Algorithm Based on Q-Learning

Data-Driven Optimal Tracking Control for Linear Systems Based on Output Feedback Approach

Input-Decoupled Q-Learning for Optimal Control

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

关键词

Navigation

A novel policy iteration based deterministic Q-learning for discrete-time nonlinear systems

Abstract

摘要

创新点

Access this article

Similar content being viewed by others

A New Discrete-Time Iterative Adaptive Dynamic Programming Algorithm Based on Q-Learning

Data-Driven Optimal Tracking Control for Linear Systems Based on Output Feedback Approach

Input-Decoupled Q-Learning for Optimal Control

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

关键词

Search

Navigation