Elsevier

Neurocomputing

Volume 285, 12 April 2018, Pages 51-59
Neurocomputing

Value iteration based integral reinforcement learning approach for H controller design of continuous-time nonlinear systems

https://doi.org/10.1016/j.neucom.2018.01.029Get rights and content

Abstract

In this paper, a novel integral reinforcement learning approach is developed based on value iteration (VI) for designing the H controller of continuous-time (CT) nonlinear systems. First, the VI learning mechanism is introduced to solve the zero-sum game problems, which is equivalent to the Hamilton–Jacobi–Isaacs (HJI) equation arising in H control problems. Since the proposed method is based on VI learning mechanism, it does not require the admissible control for the implementation, and thus satisfies a more general initial condition than the works based on policy iteration (PI). The iterative property of the value function is analysed with an arbitrary initial positive function, and the H controller can be derived as the iteration converges. For the implementation of the proposed method, three neural networks are introduced to approximate the iterative value function, the iterative control policy and the iterative disturbance policy, respectively. To verify the effectiveness of the VI based method, a linear case and a nonlinear case are presented, respectively.

Introduction

In various industrial applications, disturbance exists in many situations and always influences the controlled systems negatively. To handle this control problem, H control has been widely investigated and becomes an essential part of robust control. The goal of H control is to find a feedback controller for a given system while considering the robustness and control performance. In the early years, the H control problem was studied for the linear systems [1], [2]. Later, some researchers [3], [4], [5], [6], [7], [8] well developed the H control theory arising in the nonlinear systems. The work of [6] indicated that the H control problem could be equivalent to a two-player zero-sum differential game. The Nash equilibrium solution of the game could be solved by a equation called Hamilton–Jacobi–Isaacs (HJI), which is a nonlinear partial differential equation (PDE). For the linear case, the HJI equation reduces to a Riccati equation which can be efficiently solved. However, for the nonlinear case, there is still no approach to solve the HJI equation analytically. This has inspired researchers to study approaches for solving the HJI equation approximately, and some direct approaches have been proposed in early period [4], [9]. Unfortunately, the proposed direct approaches were restricted by computational load. In recent years, some researchers developed an indirect approach to approximate the solution of HJI equation by introducing reinforcement learning (RL) technique.

Over the last several decades, RL has been widely studied [10], [11], [12], [13], which attempts to imitate the natural law of learning in mammals. The concept of RL is learning how to map situations to actions, so as to maximize a numerical reward signal [12]. Unlike most forms in machine learning, the learner is not told which actions to take, but instead discover which actions can result in the most wanted reward by trying them. Actually, according to the RL technique, actions may affect not only the immediate reward but also the next situation and, through that, all subsequent rewards. Because of this important distinguishing feature, some researchers [14], [15], [16] introduced the idea of RL into solving the optimal problem arising in nonlinear control, and proposed an actor-critic structure to solve a nonlinear PDE called Hamilton–Jacobi–Bellman (HJB) equation approximately to derive the solution. This RL-based technique is named as approximate dynamic programming, or adaptive dynamic programming (ADP). Since the HJI equation is also a nonlinear PDE, much attention have been attracted to introduce this RL-based technique to seek for the solution of HJI [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27], [28], [29], [30], [31], [32]. Generally, there are two typical way in the ADP framework to solve for the PDE, the policy iteration (PI) and the value iteration (VI) [14].

For the H control problem arising in the continuous-time (CT) nonlinear systems, various works have been studied based on PI method. One feature of the PI is that it requires to solve a value function associated with an admissible control policy in the policy evaluation step [33], [34]. In [17], [20], the authors proved that the HJI equation can be solved by using PI, and the iterative convergence to the available storage function associated with a given L2-gain was proposed. In [18], [19], the H control problem with finite-horizon was studied by using PI. In [21], [26], a developed PI based method was proposed which can deal with the systems with unknown drift dynamics and be implemented in an on-line manner. In [22], a PI based method was proposed to shown that the mixed optimum of the zero-sum game can be derived even the saddle point solution does not exist. In [23], [24], the authors attempted to design PI based algorithms to seek for the solution of the HJI equation by using only one neural network. In [25], the authors developed a PI based integral reinforcement learning algorithm [34] for the H control of unknown CT linear systems. In [28], a novel PI based technique called off-policy was introduced to solve the HJI equation and arbitrary policies can be applied to generate the system data to tune the algorithm rather than the evaluating policy. The authors of [29], [31] developed the off-policy technique to design the H controller for unknown CT nonlinear systems. Although various well developed methods were proposed for the H controller design of CT nonlinear systems, all of them were based on PI, and thus the initial admissible control is assumed [33]. From a mathematical point of view, an admissible control can be regarded as a suboptimal control which requires to solve the nonlinear partial differential equations analytically. Thus, to ensure the admissibility may be a serious restrictive condition actually. To the best of our knowledge, there is still no approach to obtain such a control, especially for the nonlinear systems with the existence of disturbance.

On the other hand, the learning mechanism of VI ensures more free in the initial condition than PI, where the admissible control assumption is not required [35], [36], [37], [38], [39], [40], [41]. In [35], the convergence of VI method was proved with an initial zero value function for the optimal control arising in the discrete-time (DT) nonlinear systems. In [39], the authors discussed the convergence of VI in a more general way for the optimal control problem of DT nonlinear systems, where the algorithm can be initialized with an arbitrary positive value function. Since the benefits of initial condition, some researchers introduced VI to solve the H control problem arising in DT systems [42], [43]. In [42], the authors introduced the VI learning mechanism into the Q-learning method for the H control problem of DT linear systems. In [43], the authors developed a VI based algorithm to seek for the solution of the zero-sum game for DT nonlinear systems, which is equal to the solution of the HJI equation associated with H control problems. However, the above works were proposed for the DT nonlinear systems. The discussions on solving the H control problem by VI method for CT nonlinear systems are scarce, which motives our research.

In this paper, a novel VI based integral reinforcement learning method is proposed to design the H controller for CT nonlinear systems. First, the algorithm is proposed by introducing the VI learning mechanism into the integral reinforcement learning to solve the HJI equation arising in H control problems for CT nonlinear systems. Since the proposed method is based on VI learning mechanism, it satisfies a more general initial condition than the works based on PI which requires an initial admissible control for implementation. The iterative property of the value function is analysed with an arbitrary initial positive function, and the H controller can be derived as the iteration converges. For the implementation of the proposed method, three neural networks are introduced to approximate the iterative value function, the iterative control policy and the iterative disturbance policy, respectively. At last, two simulation cases are presented to illustrate the effectiveness of the proposed method.

Section snippets

Problem statement

Consider the CT nonlinear system described as x˙=f(x)+g(x)u+p(x)dy=z(x),where xRn is the system state vector, uRm is the control input, dRp is the external disturbance and yRq is the output. The dynamics of the system f(x)Rn, g(x)Rn×m and p(x)Rn×p are Lipschitz continuous on a set ΩRn and satisfy f(0)=0. The output dynamic satisfies the zero-state observability.

The control objective of H controller design is to seek for a control policy u(x) to ensure the asymptotically stability of

Main results

First, for convenience, the following definition is proposed.

Definition 1

Define xu,d(t+T) as the state of system (1) integrated by time length T from x(t) with control policy u(x) and disturbance policy d(x), i.e., xu,d(t+T)=x(t)+tt+Tf(x(τ))+g(x(τ))u(x(τ))+p(x(τ))d(x(τ))dτ.

Based on Definition 1, inspired by the work of [34], the VI based integral reinforcement learning algorithm for H control of CT nonlinear systems can be described as:

Value function update Vi+1(x)=tt+TU(x,ui,di)dτ+Vi(xui,di(t+T)).

Implementation of the proposed method

For the implementation of the ADP based methods, approximate tools, such as neural network, fuzzy basis function and so on, are required to do the approximations of the solutions in (14)–(16), respectively. In this paper, three three-layer back propagation neural networks are introduced named as critic neural network, actor neural network and disturbance neural network, which approximate Vi(x), ui(x) and di(x), respectively.

The three-layer back propagation neural network can be described as Θ(x)

Simulation

Two simulation cases are carried out to show the effectiveness of the VI based integral reinforcement learning method in this section. The first one is a linear case, thus the optimal solution can be solved by Algebraic Riccati Equation (ARE), and we attempt to compare it with the obtained results. The second one is a nonlinear case to validate the theoretical results for nonlinear systems.

Conclusions

In this paper, a novel integral reinforcement learning approach was developed based on VI for designing the H controller of CT nonlinear systems. The proposed algorithm does not require the admissible control for the implementation and thus satisfies a more general initial condition than the works based on PI. The iterative property of the value function was analysed with an arbitrary initial positive function, and the H controller could be derived as the iteration converges. For the

Acknowledgment

This work was supported by the National Natural Science Foundation of China (61433004), and IAPI Fundamental Research Funds 2013ZCX14. This work was supported also by the Development Project of Key Laboratory of Liaoning province.

Geyang Xiao received the B.S. degree in Automation Control from Northeastern University, Shenyang, China, in 2012. He has been pursuing the Ph.D. degree with Northeastern University, Shenyang, China, since 2012. His current research interests include reinforcement learning neural networks-based controls, non-linear optimal controls, adaptive dynamic programming, and their industrial applications.

References (47)

  • J.C. Doyle et al.

    State-space solutions to standard H2 and H control problems

    IEEE Trans. Autom. Control

    (1989)
  • A.J. van der Schaft

    L2-gain analysis of nonlinear systems and nonlinear state-feedback H control

    IEEE Trans. Autom. Control

    (1992)
  • HuangJ. et al.

    Numerical approach to computing nonlinear H-infinity control laws

    J. Guid. Control Dyn.

    (1995)
  • A. Isidori et al.

    H control via measurement feedback for general nonlinear systems

    IEEE Trans. Autom. Control

    (1995)
  • P.B. T. Baar

    H Optimal Control and Related Minimax Design Problems

    (1995)
  • R.W. Beard et al.

    Successive Galerkin approximation algorithms for nonlinear optimal and robust control

    Int. J. Control

    (1998)
  • T. Baar et al.

    Dynamic Noncooperative Game Theory

    (1999)
  • M.G. Crandall et al.

    Users guide to viscosity solutions of second order partial differential equations

    Bull. Am. Math. Soc.

    (1992)
  • P.J. Werbos, A Menu of Designs for Reinforcement Learning Over Time, MIT Press, pp....
  • D.P. Bertsekas

    Neuro-Dynamic Programming

    (1996)
  • R.S. Sutton et al.

    Reinforcement Learning: An Introduction

    (1998)
  • D. Silver et al.

    Mastering the game of go with deep neural networks and tree search

    Nature

    (2016)
  • P.J. Werbos, Approximate Dynamic Programming for Real-Time Control and Neural Modeling, Van Nostrand Reinhold, vol. ...
  • Cited by (22)

    • Critic-only adaptive dynamic programming algorithms’ applications to the secure control of cyber–physical systems

      2020, ISA Transactions
      Citation Excerpt :

      Different from PI-based methods, VI-based methods can start without the initial admissible condition. Motivated by the significant works [24,33,42,43], the VI method is presented in the following Algorithm 2. In this section, the secure control scheme is designed based on the solution of the zero-sum game, and the tuning laws of parameters are derived through Lyapunov stability theory.

    • Off-policy based adaptive dynamic programming method for nonzero-sum games on discrete-time system

      2020, Journal of the Franklin Institute
      Citation Excerpt :

      Lately, many researches on optimal control in multi-player game problem have been published. As a prior attempt, zero-sum game for two players interested a number of researchers, which has been addressed as H∞ control problem [26–31]. For zero-sum game, differing from the original HJB equation, a new formula depending on two control policies is developed and called Hamilton-Jacobi-Isaacs equation, where the control policies are treated as input control policy and disturbances.

    • Synchronous optimal control method for nonlinear systems with saturating actuators and unknown dynamics using off-policy integral reinforcement learning

      2019, Neurocomputing
      Citation Excerpt :

      It should be indicated that the integral reinforcement learning (IRL) algorithm is able to develop the Bellman equation, regardless of the system dynamics [12,13]. Xio et al. [14] developed a novel IRL approach, based on the value iteration for designing and obtaining the H∞ controller for nonlinear continuous-time systems. A number of IRL methods are based on off-policy methods.

    View all citing articles on Scopus

    Geyang Xiao received the B.S. degree in Automation Control from Northeastern University, Shenyang, China, in 2012. He has been pursuing the Ph.D. degree with Northeastern University, Shenyang, China, since 2012. His current research interests include reinforcement learning neural networks-based controls, non-linear optimal controls, adaptive dynamic programming, and their industrial applications.

    Huaguang Zhang received the B.S. degree and the M.S. degree in control engineering from Northeast Dianli University of China, Jilin City, China, in 1982 and 1985, respectively. He received the Ph.D. degree in thermal power engineering and automation from Southeast University, Nanjing, China, in 1991. He joined the Department of Automatic Control, Northeastern University, Shenyang, China, in 1992, as a Postdoctoral Fellow for two years. Since 1994, he has been a Professor and Head of the Institute of Electric Automation, School of Information Science and Engineering, Northeastern University, Shenyang, China. His main research interests are fuzzy control, stochastic system control, neural networks based control, nonlinear control, and their applications. He has authored and coauthored over 200 journal and conference papers, four monographs and co-invented 20 patents.

    Kun Zhang received the B.S. degree in mathematics and applied mathematics from Hebei Normal University, Shijiazhuang, China, in 2012 and the M.S. degree in management science and engineering from Northwest University for Nationalities, Lanzhou, China, in 2015. He is currently pursuing the Ph.D. degree in control theory and control engineering at Northeastern University, Shenyang, China. His main research interests include reinforcement learning, dynamic programming, neural networks-based controls and their industrial applications.

    Yinlei Wen received the B.S. degree in automation control in 2012 and the M.S. degree in control engineering in 2014 from Northeastern University, Shenyang, China. He has been pursuing the Ph.D. degree since 2015 in Northeastern University, Shenyang, China. His current research covers neural adaptive dynamic programming, neural networks, non-linear controls and their industrial applications.

    View full text