Elsevier

Automatica

Volume 44, Issue 4, April 2008, Pages 1055-1061
Automatica

Brief paper
Policy iteration based feedback control

https://doi.org/10.1016/j.automatica.2007.08.014Get rights and content

Abstract

It is well known that stochastic control systems can be viewed as Markov decision processes (MDPs) with continuous state spaces. In this paper, we propose to apply the policy iteration approach in MDPs to the optimal control problem of stochastic systems. We first provide an optimality equation based on performance potentials and develop a policy iteration procedure. Then we apply policy iteration to the jump linear quadratic problem and obtain the coupled Riccati equations for their optimal solutions. The approach is applicable to linear as well as nonlinear systems and can be implemented on-line on real world systems without identifying all the system structure and parameters.

Introduction

Optimal control of stochastic dynamic systems is a difficult problem. Because exact solutions can only be obtained under some rather strict restrictions on system structures, numerical and approximate approaches have to be developed. Dynamic programming (Bellman, 1957) is one of the commonly used approaches to the problem, which solves, analytically or numerically, the well-known optimality equation called the Bellman equation.

The research in this paper was motivated by the results in two areas: optimal control of stochastic dynamic systems and Markov decision processes (MDPs). It has been realized for a long time that stochastic control systems can be viewed as Markov decision processes. For example, numerical algorithms based on value iterations (Kushner & Paul, 1992; Puterman, 1994; Tsitsiklis & Van Roy, 1996) are developed for solving the Bellman equation (Kushner, 1977). Hernandez-Lerma and Lasserre (1996) deal with general state MDPs and provide conditions for existence of stationary optimal policies. However, with value iteration and dynamic programming, the transition probabilities, or equivalently the system structure and parameters, have to be known. When the system structure and/or parameters are unknown, identification methods have to be used, and this further complicates the problem. Therefore, numerical methods and learning based approaches have to be developed.

In this paper, we consider average cost Markov decision problems. We propose a policy iteration based approach for the optimal control problem. At each iteration, we analyze the system's behavior under one policy and find another policy under which the system performs better. The main concept of this approach is the performance potential (Cao, 2000, Cao, 2003) (or the bias Puterman, 1994, or relative cost Bertsekas, 1995). When the system structure and parameters are known, the potentials can be obtained by solving the Poisson equation; otherwise, they can be estimated from a sample path (Cao & Wan, 1998). Compared with value iteration and dynamic programming, the policy iteration based approach can be implemented on-line without knowing all the system parameters, and learning based implementation algorithms can be developed. The approach treats nonlinear systems in the same way as the linear systems.

In Section 2, we describe how an (nonlinear) optimal control problem can be modelled as a Markov decision problem. In Section 3, we develop the policy iteration theory for MDPs with continuous state spaces. In Section 4, we apply the policy iteration approach to the jump linear quadratic (JLQ) problem. We obtain the closed form of the potentials for the problem and show that the optimal solution can be obtained via the coupled Riccati equations. Section 5 concludes the paper with a discussion.

The contributions of the paper is to define the performance potential for continuous state space, and propose the policy iteration approach to solve the optimal control problems and apply it to the JLQ problem. The policy iteration based approach can be implemented on-line and learning algorithms can be developed.

Section snippets

Control systems modelled as MDPs

Consider a stochastic control system of the formXl+1=H(Xl,ul)+ξl,l=0,1,,where l=0,1,, denotes the discrete time, XlRn, R=(-,+), is an n-dimensional vector representing the system state at time l, ξlRn is the random noise at time l, and ulU is an m-dimensional vector representing the control applied to the system at time l, with U being a specified control constraint set of Rm. We assume that ξl, l=0,1,, is a sequence of independent and identically distributed (i.i.d.) random variables

MDPs with continuous state spaces

As shown in Section 2, we need to extend the theory of policy iteration to continuous state spaces. The transition probability with a continuous state space is described by an operator (integration) on the function space. We will present the main ideas and will not go deep into the operator theory; especially, we will not study the general conditions for the infinite dimensional operators to be interchangeable in their orders. There are standard theorems for such interchangeability (e.g., see

JLQ problem

In this section, we derive the performance potentials for the JLQ problem. We show that with the approach developed in Section 3, we can directly obtain the optimal feedback and coupled Riccati equation, which are usually obtained by dynamic programming.

In a discrete time JLQ problem, we consider a two-level stochastic control system. The system state at time l, l=0,1,, is denoted as (Ml,Xl), where MlM{1,2,,M} represents the mode (high level) that the system is in, and XlRn denote the

Discussion and conclusion

In this paper, we apply the potential based policy iteration approach to MDPs with continuous state spaces to solve optimal control problems. We derive the potentials for JLQ problem and show that the solution to this problem can be obtained via coupled Riccati equations.

One of the main advantages of the policy iteration based approach is that it can be implemented on-line and learning based algorithms can be developed when the system structure and parameters are unknown. In addition, this

Acknowledgments

The author would like to express their gratitude to the four anonymous reviewers for their comments in helping to revise this paper.

Kan-Jian Zhang received the B.S. degree in mathematics from Nankai University, China in 1994, and the M.S. and Ph.D. degrees in control theory and control engineering from Southeast University, China in 1997 and 2000. He is currently an associate professor in Research Institute of Automation, Southeast University. His research is in nonlinear control theory and its applications, with particular interest in robust output feedback design and optimization control.

References (14)

  • X.R. Cao

    A unified approach to Markov decision problems and performance sensitivity analysis

    Automatica

    (2000)
  • R.E. Bellman

    Dynamic programming

    (1957)
  • Bertsekas, D. P. (1995). Dynamic programming and optimal control (Vol. II). Belmont, MA: Athena...
  • X.R. Cao

    From perturbation analysis to Markov decision processes and reinforcement learning

    Discrete Event Dynamic Systems: Theory and Applications

    (2003)
  • X.R. Cao et al.

    Algorithms for sensitivity analysis of Markov systems through potentials and perturbation realization

    IEEE Transactions on Control Systems Technology

    (1998)
  • O.L.V. Costa et al.

    Discrete-time LQ-optimal control problems for infinite Markov jump parameter systems

    IEEE Transactions on Automatic Control

    (1995)
  • O.L.V. Costa et al.

    Discrete-time Markov jump linear systems

    (2005)
There are more references available in the full text version of this article.

Cited by (21)

View all citing articles on Scopus

Kan-Jian Zhang received the B.S. degree in mathematics from Nankai University, China in 1994, and the M.S. and Ph.D. degrees in control theory and control engineering from Southeast University, China in 1997 and 2000. He is currently an associate professor in Research Institute of Automation, Southeast University. His research is in nonlinear control theory and its applications, with particular interest in robust output feedback design and optimization control.

Yan-Kai Xu received his B.E. degree in automatic control in 2003 from Tsinghua University, Beijing, China. He is currently a Ph.D. candidate in the Center for Intelligent and Networked Systems (CFINS), Department of Automation, Tsinghua University. His research interests include optimization and control of stochastic systems, discrete event dynamic systems, and machine learning.

Xi Chen received her B.Sc. and M.Eng. from Nankai University, Tianjin, China, in 1986 and 1989, respectively. After graduation, she worked in the Software Engineering Institute at Beijing University of Aeronautics and Astronautics for seven years. From October 1996 she studied in the Chinese University of Hong Kong and received her Ph.D. in 2000. Then she worked as a post-doctoral fellow in Information Communication Institute of Singapore and in the Department of Systems Engineering and Engineering Management, Chinese University of Hong Kong. Since July 2003, she works in the Center for Intelligent and Networked Systems (CFINS), Department of Automation, Tsinghua University, Beijing, China. Her research interests include wireless sensor networks and stochastic control.

Xi-Ren Cao received the M.S. and Ph.D. degrees from Harvard University in 1981 and 1984, respectively, where he was a research fellow from 1984 to 1986. He then worked as a principal and consultant engineer/engineering manager at Digital Equipment Corporation, USA, until October 1993. Since then, he is a professor of the Hong Kong University of Science and Technology (HKUST), Hong Kong, China. He is the director of the Center for Networking at HKUST. He held visiting positions at Harvard University, University of Massachusetts at Amherst, AT&T Labs, University of Maryland at College Park, University of Notre Dame, Tsinghua University, University of Science and Technology of China, and other universities.

Dr. Cao owns three patents in data and tele-communications and published two books: Realization Probabilities—the Dynamics of Queuing Systems, Springer Verlag, 1994, and Perturbation Analysis of Discrete-Event Dynamic Systems, Kluwer Academic Publishers, 1991 (co-authored with Y.C. Ho). He received the Outstanding Transactions Paper Award from the IEEE Control System Society in 1987 and the Outstanding Publication Award from the Institution of Management Science in 1990. He is a fellow of IEEE, Chairman of the IEEE Fellow Evaluation Committee of the IEEE Control Systems Society, Editor-in-Chief of Discrete Event Dynamic Systems: Theory and Applications, Associate Editor at Large of IEEE Transactions of Automatic Control, member of the Board of Governors of IEEE Control Systems Society, member of IFAC Technical Board, and chairman of IFAC Coordinating Committee on Systems and Signals. He has been served as associate editor of a number of international journals and chairman of a few technical committees of international professional societies. His current research areas include discrete event dynamic systems, stochastic learning and optimization theory, performance analysis of communication systems, and signal processing.

This paper was not presented at any IFAC meeting. This paper was recommended for publication in revised form by Associate Editor Bart De Schutter under the direction of Editor Ian Petersen.

1

Partially supported by the National Natural Science Foundation (60404006) of China.

2

Partially supported by the National Natural Science Foundation (60574064) of China.

3

Supported in part by a grant from Hong Kong UGC.

View full text