Elsevier

Neurocomputing

Volume 165, 1 October 2015, Pages 163-170
Neurocomputing

Online optimal control of unknown discrete-time nonlinear systems by using time-based adaptive dynamic programming

https://doi.org/10.1016/j.neucom.2015.03.006Get rights and content

Abstract

In this paper, an online optimal control scheme for a class of unknown discrete-time (DT) nonlinear systems is developed. The proposed algorithm using current and recorded data to obtain the optimal controller without the knowledge of system dynamics. In order to carry out the algorithm, a neural network (NN) is constructed to identify the unknown system. Then, based on the estimated system model, a novel time-based ADP algorithm without using system dynamics is implemented on an actor–critic structure. Two NNs are used in the structure to generate the optimal cost and the optimal control policy, and both of them are updated once at the sampling instant and thus the algorithm can be regarded as time-based. The persistence of excitation condition, which is generally required in adaptive control, is ensured by a new criterion while using current and recorded data in the update of the critic neural network. Lyapunov techniques are used to show that system states, cost function and control signals are all uniformly ultimately bounded (UUB) with small bounded errors while explicitly considering the approximation errors caused by the three NNs. Finally, simulation results are provided to verify the effectiveness of the proposed approach.

Introduction

The theory of optimal control is concerned with finding a control law for a given system and user defined optimality criterion. Traditional optimal control design methods are generally offline and require complete knowledge of the system dynamics [1]. Adaptive control techniques on the other side are designed for online use of uncertain systems. However, classical adaptive control methods are generally far from optimal.

During the last few decades, reinforcement learning (RL) [2], [3], [4] has successfully provided a way to bring together the advantages of adaptive and optimal control. A class of RL-based adaptive optimal controllers, called approximate/adaptive dynamic programming (ADP), was first developed by Werbos [5], [6]. Extensions of the RL-based controllers to DT systems have been considered by many researchers [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20]. In [7], the authors attempted to solve the DT nonlinear optimal control problem offline using ADP approaches and neural networks by assuming that there are no NN reconstruction errors. Based on the results of [7], other researchers developed offline ADP approaches in some complicated situations, such as optimal tracking problem [8], [14], [21], optimal control with control constraints [10], optimal control with time delays [11], [12], optimal control with finite approximation errors [15]. However, the above works are all required the knowledge of system dynamics and using offline tuning law.

Since the mathematical models of real-world system dynamics are often difficult to build, it has become one of the main foci of control practitioners to design the optimal controller for nonlinear systems with unknown dynamics. The work of [9] analyzed the convergence of unknown DT nonlinear systems using offline-trained neural networks, but this method introduced the Lebesgue integral [7], which required data of a subset of the plant, in the tuning law and thus spent too much time on off-line training. In [20], the authors developed one way to control the unknown DT nonlinear systems using globalized dual heuristic programming, and others employed the single network dual heuristic dynamic programming (SN-DHP) technique in the ADP algorithm in [19]. Both of them introduced the gradient-based adaptation tuning law instead of the way in [9]. However, without using recorded system data, iterations were needed in the tuning law [19], [20] and the critic NN and actor NN could not be updated with respect to time at each sampling interval. Moreover, although [9], [20], [21] constructed a NN to identify the unknown system dynamics, they assumed that the NN identification error approached to zero, and thus the effects of the estimation error on the convergence of the actor–critic algorithms were not considered.

On the other hand, online adaptive-optimal controller designs were presented in [17], [18], [22], [23], [24] to overcome the iterative offline training methodology. The central theme of the approaches in [24], [23] as well as several works in [22] is that the cost function and optimal control signal are approximated by online parametric structures, such as NN. Although the proposed methods in [22], [23], [24] are verified via numerical simulations, the approximation errors are not considered and proofs of convergence are not demonstrated. The work of [17] presented a novel approach that relied on current and recorded system data for adaptation and proved the convergence while the approximation errors are considered, and recently the authors in [18] improved this method in the presence of unknown internal dynamics and called it time-based ADP algorithm. However, since the requirement of the knowledge of control coefficient matrix in the tuning law, the general time-based ADP algorithm [17], [18] becomes invalid while dealing with the unknown DT nonlinear system. Meanwhile, most of the online adaptive optimal control algorithms with ADP require a persistence of excitation (PE) condition [25], [26], [27] that is important in NN identification. Refs. [17], [18] proposed a similar condition to ensure the PE requirement, but they did not give the lower bound in the proof.

The contributions of this paper lie in the development of an online adaptive learning algorithm to solve an infinite horizon optimal control problem for unknown DT nonlinear systems. By performing identification process, the time-based ADP algorithm, which makes use of current and recorded system data, is applicable to deal with the optimal control problem of unknown nonlinear systems. However, the general time-based ADP technique requires knowing the system dynamics. By using current and recorded system information, the PE condition is ensured by a new criterion with explicit lower bound and the unknown nonlinear DT system can be controlled once at the sampling instant. Convergence of the system states and NN implementation is demonstrated while explicitly considering all the NN reconstruction errors in contrast to previous works [9], [20], [21].

Section snippets

Background

Consider the affine DT nonlinear system described byxk+1=f(xk)+g(xk)u(xk)where xkRn, f(xk)Rn, g(xk)Rn×m and u(xk)Rm. Without loss of generality, assume that the system is controllable, sufficiently smooth, drift free, and that x=0 is a unique equilibrium point on a compact set Ω while the states are considered measurable. In the following part, u(xk) is denoted by uk for simplicity.

Define the infinite horizon cost functionJ(xk)=n=kQ(xn)+unTRun=Q(xk)+ukTRuk+J(xk+1)=ρ(xk,uk)+J(xk+1)where Q(x

NN identification of the unknown nonlinear system

To begin the NN identifier construction, the system dynamics (1) are rewritten asxk+1=f(xk)+g(xk)uk=H(xk,uk).The function H(xk,uk) has a NN representation on a compact set S according to the universal approximation property of NN, which can be written asH(xk,uk)=WsTθ(YsTzs(k))+εsk=WsTθ(z¯s(k))+εskwhere WsRl×n and YsR(n+m)×l are the constant ideal weight matrices. l is the number of hidden layer neurons. θ(·) is the NN activation function, zs(k)=[xkTukT]T is the NN input and let z¯s(k)=YsTzs(k)

Optimal control of unknown nonlinear DT systems

From the background section, it could be seen that the essential part of solving the DTHJB equation is to compute (3), (4) iteratively. However, since the system dynamics are unknown, the general time-based ADP algorithm [17], [18] cannot be implemented to solve the above two equations. To circumvent the deficiency, the general optimal control (4) can be substituted by (15) based on the NN identification of the unknown nonlinear system. And then, we propose a novel time-based ADP algorithm

Simulations

This section presents a simulation example to illustrate the effectiveness of the proposed optimal adaptive control algorithm. Consider a two order nonlinear discrete time system[x1(k+1)x2(k+1)]=[sin(0.5x2(k))cos(1.4x2(k))sin(0.9x1(k))]+[01]u(k)where the system dynamics are unknown. The quadratic cost functions are chosen as Q(x)=xTx and R=1.

We choose three-layer feedforward NNs as model neural network, critic neural network, and action neural network with the structures 3–8–2, 2–8–1, 2–8–1,

Conclusions

A new time-based ADP algorithm for unknown DT nonlinear systems, which is independent to the knowledge of the system dynamics, has been proposed in this paper. By using current and recorded system information, the PE requirement has been ensured by a new assumption with explicit lower bound and the unknown nonlinear DT system could be controlled once at the sampling instant. By considering an appropriate Lyapunov function, we have proven UUB of the overall closed loop system under the effects

Acknowledgement

This work was supported by the National Natural Science Foundation of China (61034005, 61433004), and the National High Technology Research and Development Program of China (2012AA040104) and IAPI Fundamental Research Funds 2013ZCX14. This work was supported also by the development project of Key Laboratory of Liaoning Province.

Geyang Xiao received the B.S. degree in Automation Control from Northeastern University, Shenyang, China, in 2012. He has been pursuing the Ph.D. degree with Northeastern University, Shenyang, China, since 2012. His current research interests include neural networks-based controls, non-linear controls, adaptive dynamic programming, and their industrial applications.

References (27)

  • F.L. Lewis et al.

    Reinforcement learning and feedback controlusing natural decision methods to design optimal adaptive controllers

    IEEE Control Syst.

    (2012)
  • P.J. Werbos, A menu of designs for reinforcement learning over time, Neural Networks for Control, 1990, pp....
  • P.J. Werbos, Approximate dynamic programming for real-time control and neural modeling, Handbook of Intelligent...
  • Cited by (0)

    Geyang Xiao received the B.S. degree in Automation Control from Northeastern University, Shenyang, China, in 2012. He has been pursuing the Ph.D. degree with Northeastern University, Shenyang, China, since 2012. His current research interests include neural networks-based controls, non-linear controls, adaptive dynamic programming, and their industrial applications.

    Huaguang Zhang received the B.S. degree and the M.S. degree in control engineering from Northeast Dianli University of China, Jilin City, China, in 1982 and 1985, respectively. He received the Ph.D. degree in thermal power engineering and automation from Southeast University, Nanjing, China, in 1991. He joined the Department of Automatic Control, Northeastern University, Shenyang, China, in 1992, as a Postdoctoral Fellow for two years. Since 1994, he has been a Professor and Head of the Institute of Electric Automation, School of Information Science and Engineering, Northeastern University, Shenyang, China. His main research interests are fuzzy control, stochastic system control, neural networks based control, nonlinear control, and their applications. He has authored and coauthored over 200 journal and conference papers, four monographs and co-invented 20 patents.

    Yanhong Luo received the B.S. degree in automation control, M.S. degree and Ph.D. degree in control theory and control engineering from Northeastern University, Shenyang, China, in 2003, 2006 and 2009, respectively. She is currently working in Northeastern University as an Associate Professor. Her research interests include approximate dynamic programming, neural networks adaptive control, fuzzy control and their industrial application.

    View full text