Brief paperPolicy iteration based feedback control☆
Introduction
Optimal control of stochastic dynamic systems is a difficult problem. Because exact solutions can only be obtained under some rather strict restrictions on system structures, numerical and approximate approaches have to be developed. Dynamic programming (Bellman, 1957) is one of the commonly used approaches to the problem, which solves, analytically or numerically, the well-known optimality equation called the Bellman equation.
The research in this paper was motivated by the results in two areas: optimal control of stochastic dynamic systems and Markov decision processes (MDPs). It has been realized for a long time that stochastic control systems can be viewed as Markov decision processes. For example, numerical algorithms based on value iterations (Kushner & Paul, 1992; Puterman, 1994; Tsitsiklis & Van Roy, 1996) are developed for solving the Bellman equation (Kushner, 1977). Hernandez-Lerma and Lasserre (1996) deal with general state MDPs and provide conditions for existence of stationary optimal policies. However, with value iteration and dynamic programming, the transition probabilities, or equivalently the system structure and parameters, have to be known. When the system structure and/or parameters are unknown, identification methods have to be used, and this further complicates the problem. Therefore, numerical methods and learning based approaches have to be developed.
In this paper, we consider average cost Markov decision problems. We propose a policy iteration based approach for the optimal control problem. At each iteration, we analyze the system's behavior under one policy and find another policy under which the system performs better. The main concept of this approach is the performance potential (Cao, 2000, Cao, 2003) (or the bias Puterman, 1994, or relative cost Bertsekas, 1995). When the system structure and parameters are known, the potentials can be obtained by solving the Poisson equation; otherwise, they can be estimated from a sample path (Cao & Wan, 1998). Compared with value iteration and dynamic programming, the policy iteration based approach can be implemented on-line without knowing all the system parameters, and learning based implementation algorithms can be developed. The approach treats nonlinear systems in the same way as the linear systems.
In Section 2, we describe how an (nonlinear) optimal control problem can be modelled as a Markov decision problem. In Section 3, we develop the policy iteration theory for MDPs with continuous state spaces. In Section 4, we apply the policy iteration approach to the jump linear quadratic (JLQ) problem. We obtain the closed form of the potentials for the problem and show that the optimal solution can be obtained via the coupled Riccati equations. Section 5 concludes the paper with a discussion.
The contributions of the paper is to define the performance potential for continuous state space, and propose the policy iteration approach to solve the optimal control problems and apply it to the JLQ problem. The policy iteration based approach can be implemented on-line and learning algorithms can be developed.
Section snippets
Control systems modelled as MDPs
Consider a stochastic control system of the formwhere denotes the discrete time, , , is an n-dimensional vector representing the system state at time l, is the random noise at time l, and is an m-dimensional vector representing the control applied to the system at time l, with U being a specified control constraint set of . We assume that , is a sequence of independent and identically distributed (i.i.d.) random variables
MDPs with continuous state spaces
As shown in Section 2, we need to extend the theory of policy iteration to continuous state spaces. The transition probability with a continuous state space is described by an operator (integration) on the function space. We will present the main ideas and will not go deep into the operator theory; especially, we will not study the general conditions for the infinite dimensional operators to be interchangeable in their orders. There are standard theorems for such interchangeability (e.g., see
JLQ problem
In this section, we derive the performance potentials for the JLQ problem. We show that with the approach developed in Section 3, we can directly obtain the optimal feedback and coupled Riccati equation, which are usually obtained by dynamic programming.
In a discrete time JLQ problem, we consider a two-level stochastic control system. The system state at time l, is denoted as , where represents the mode (high level) that the system is in, and denote the
Discussion and conclusion
In this paper, we apply the potential based policy iteration approach to MDPs with continuous state spaces to solve optimal control problems. We derive the potentials for JLQ problem and show that the solution to this problem can be obtained via coupled Riccati equations.
One of the main advantages of the policy iteration based approach is that it can be implemented on-line and learning based algorithms can be developed when the system structure and parameters are unknown. In addition, this
Acknowledgments
The author would like to express their gratitude to the four anonymous reviewers for their comments in helping to revise this paper.
Kan-Jian Zhang received the B.S. degree in mathematics from Nankai University, China in 1994, and the M.S. and Ph.D. degrees in control theory and control engineering from Southeast University, China in 1997 and 2000. He is currently an associate professor in Research Institute of Automation, Southeast University. His research is in nonlinear control theory and its applications, with particular interest in robust output feedback design and optimization control.
References (14)
A unified approach to Markov decision problems and performance sensitivity analysis
Automatica
(2000)Dynamic programming
(1957)- Bertsekas, D. P. (1995). Dynamic programming and optimal control (Vol. II). Belmont, MA: Athena...
From perturbation analysis to Markov decision processes and reinforcement learning
Discrete Event Dynamic Systems: Theory and Applications
(2003)- et al.
Algorithms for sensitivity analysis of Markov systems through potentials and perturbation realization
IEEE Transactions on Control Systems Technology
(1998) - et al.
Discrete-time LQ-optimal control problems for infinite Markov jump parameter systems
IEEE Transactions on Automatic Control
(1995) - et al.
Discrete-time Markov jump linear systems
(2005)
Cited by (21)
Optimal fault detection design via iterative estimation methods for industrial control systems
2016, Journal of the Franklin InstituteCitation Excerpt :To our best knowledge, there are few investigations on the optimization of process monitoring and fault detection for systems under unknown environment information. Over the last decades, application of reinforcement learning to deal with optimal control issues has received considerable attention [35–39]. It has been demonstrated that by means of reinforcement learning methods, real-time optimization of feedback control systems can be successfully achieved [35,36].
Introduction
2020, Communications and Control EngineeringPotential-Based Least-Squares Policy Iteration for a Parameterized Feedback Control System
2016, Journal of Optimization Theory and ApplicationsContinuous-time Markov decision process with average reward: Using reinforcement learning method
2015, Chinese Control Conference, CCC
Kan-Jian Zhang received the B.S. degree in mathematics from Nankai University, China in 1994, and the M.S. and Ph.D. degrees in control theory and control engineering from Southeast University, China in 1997 and 2000. He is currently an associate professor in Research Institute of Automation, Southeast University. His research is in nonlinear control theory and its applications, with particular interest in robust output feedback design and optimization control.
Yan-Kai Xu received his B.E. degree in automatic control in 2003 from Tsinghua University, Beijing, China. He is currently a Ph.D. candidate in the Center for Intelligent and Networked Systems (CFINS), Department of Automation, Tsinghua University. His research interests include optimization and control of stochastic systems, discrete event dynamic systems, and machine learning.
Xi Chen received her B.Sc. and M.Eng. from Nankai University, Tianjin, China, in 1986 and 1989, respectively. After graduation, she worked in the Software Engineering Institute at Beijing University of Aeronautics and Astronautics for seven years. From October 1996 she studied in the Chinese University of Hong Kong and received her Ph.D. in 2000. Then she worked as a post-doctoral fellow in Information Communication Institute of Singapore and in the Department of Systems Engineering and Engineering Management, Chinese University of Hong Kong. Since July 2003, she works in the Center for Intelligent and Networked Systems (CFINS), Department of Automation, Tsinghua University, Beijing, China. Her research interests include wireless sensor networks and stochastic control.
Xi-Ren Cao received the M.S. and Ph.D. degrees from Harvard University in 1981 and 1984, respectively, where he was a research fellow from 1984 to 1986. He then worked as a principal and consultant engineer/engineering manager at Digital Equipment Corporation, USA, until October 1993. Since then, he is a professor of the Hong Kong University of Science and Technology (HKUST), Hong Kong, China. He is the director of the Center for Networking at HKUST. He held visiting positions at Harvard University, University of Massachusetts at Amherst, AT&T Labs, University of Maryland at College Park, University of Notre Dame, Tsinghua University, University of Science and Technology of China, and other universities.
Dr. Cao owns three patents in data and tele-communications and published two books: Realization Probabilities—the Dynamics of Queuing Systems, Springer Verlag, 1994, and Perturbation Analysis of Discrete-Event Dynamic Systems, Kluwer Academic Publishers, 1991 (co-authored with Y.C. Ho). He received the Outstanding Transactions Paper Award from the IEEE Control System Society in 1987 and the Outstanding Publication Award from the Institution of Management Science in 1990. He is a fellow of IEEE, Chairman of the IEEE Fellow Evaluation Committee of the IEEE Control Systems Society, Editor-in-Chief of Discrete Event Dynamic Systems: Theory and Applications, Associate Editor at Large of IEEE Transactions of Automatic Control, member of the Board of Governors of IEEE Control Systems Society, member of IFAC Technical Board, and chairman of IFAC Coordinating Committee on Systems and Signals. He has been served as associate editor of a number of international journals and chairman of a few technical committees of international professional societies. His current research areas include discrete event dynamic systems, stochastic learning and optimization theory, performance analysis of communication systems, and signal processing.
- ☆
This paper was not presented at any IFAC meeting. This paper was recommended for publication in revised form by Associate Editor Bart De Schutter under the direction of Editor Ian Petersen.
- 1
Partially supported by the National Natural Science Foundation (60404006) of China.
- 2
Partially supported by the National Natural Science Foundation (60574064) of China.
- 3
Supported in part by a grant from Hong Kong UGC.