Elsevier

Neurocomputing

Volume 242, 14 June 2017, Pages 73-82
Neurocomputing

Neural-network-based synchronous iteration learning method for multi-player zero-sum games

https://doi.org/10.1016/j.neucom.2017.02.051Get rights and content

Abstract

In this paper, a synchronous solution method for multi-player zero-sum games without system dynamics is established based on neural network. The policy iteration (PI) algorithm is presented to solve the Hamilton–Jacobi–Bellman (HJB) equation. It is proven that the obtained iterative cost function is convergent to the optimal game value. For avoiding system dynamics, off-policy learning method is given to obtain the iterative cost function, controls and disturbances based on PI. Critic neural network (CNN), action neural networks (ANNs) and disturbance neural networks (DNNs) are used to approximate the cost function, controls and disturbances. The weights of neural networks compose the synchronous weight matrix, and the uniformly ultimately bounded (UUB) of the synchronous weight matrix is proven. Two examples are given to show that the effectiveness of the proposed synchronous solution method for multi-player ZS games.

Introduction

The importance of strategic behavior in the human and social world is increasingly recognized in theory and practice. As a result, game theory has emerged as a fundamental instrument in pure and applied research [1]. Modern day society relies on the operation of complex systems, including aircraft, automobiles, electric power systems, economic entities, business organizations, banking and finance systems, computer networks, manufacturing systems, and industrial processes. Networked dynamical agents have cooperative team-based goals as well as individual selfish goals, and their interplay can be complex and yield unexpected results in terms of emergent teams. Cooperation and conflict of multiple decision-makers for such systems can be studied within the field of cooperative and noncooperative game theory [2]. It knows that many real-world systems are often controlled by more than one controller or decision maker with each using an individual strategy. These controllers often operate in a group with a general quadratic performance index function as a game. Therefore, some scholars research the multi-player games. In [3], off-policy integral reinforcement learning method was developed to solve nonlinear continuous-time multi-player non-zero-sum (NZS) games. In [4], a multi-player zero-sum (ZS) differential games for a class of continuous-time uncertain nonlinear systems were solved using upper and lower iterations. ZS game theory relies on solving the Hamilton–Jacobi–Isaacs (HJI) equations, a generalized version of the Hamilton–Jacobi–Bellman(HJB) equations appearing in optimal control problems. In the nonlinear case the HJI equations are difficult or impossible to solve, and may not have global analytic solutions even in simple cases. Therefore, many approximate methods are proposed to obtain the solution of HJI equations [5], [6], [7], [8].

Adaptive dynamic programming (ADP) algorithm is an effective approximate method in optimal control field [9], [10], [11], [12], [13]. ADP algorithms include value iteration (VI) and policy iteration (PI) [14], [15], [16], [17]. VI is a Lyapunov recursion, which is easy to implement and does not require Lyapunov equation solutions [18], [19], [20], [21]. In [22], discrete-time VI was proposed to solve HJB equation approximately with convergence analysis. In [23], a novel non-model-based, data-driven adaptive optimal controller was presented by continuous-time VI. In [24], a class of continuous-time nonlinear two-player ZS differential games was considered, VI ADP method was proposed for the situations that the saddle point exists or does not exist. On the other hand, PI refers to a class of algorithms built as a two-step iteration: policy evaluation and policy improvement [25], starting from evaluating the performance index function of a given initial admissible (stabilizing) controller [26], [27], [28]. In [29], PI algorithm and convergence analysis were given for nonlinear systems with saturating actuators. In [30], optimal model-free output synchronization of heterogeneous systems using off-policy reinforcement learning was developed. In [31], a data-driven ADP method was proposed for a class of continuous-time unknown nonlinear systems ZS optimal control problems. In [32], an online solution method for two-player ZS games was presented by synchronous PI.

Although the progress on ADP algorithm is significant in the optimal control field, within the radius of our knowledge, it is still an open problem about how to solve multi-player ZS games for completely unknown continuous-time nonlinear systems. In this paper, this open problem will be explicitly figured out. The main contributions of this paper are summarized as follows.

  • (1)

    A synchronous solution method based on PI algorithm and neural networks is established.

  • (2)

    It is proven that the iterative cost function converges to the optimal game value with system dynamics for traditional PI algorithm.

  • (3)

    Synchronous solution method is given to solve the off-policy HJB equation with convergence analysis, according to critic neural network (CNN), action neural networks (ANNs) and disturbance neural networks (DNNs).

  • (4)

    The uniformly ultimately bounded (UUB) of the synchronous weight matrix is proven.

The rest of this paper is organized as follows. In Section 2, we present the motivations and preliminaries of the discussed problem. In Section 3, the synchronous solution of multi-player ZS games is developed and the convergence proof is given. In Section 4, two examples are given to demonstrate the effectiveness of the proposed scheme. In Section 5, the conclusion is drawn.

Section snippets

Motivations and preliminaries

In this paper, we consider the continuous-time nonlinear system described by x˙=f(x)+g(x)i=1pui+h(x)j=1qdjwhere xΩRn is the system state, uiRm1 and djRm2 are the control input and the disturbance input, respectively. f(x) ∈ Rn, g(x) and h(x) are unknown functions. f(0)=0 and x=0 is an equilibrium point of the system. Assume that f(x), g(x) and h(x) are locally Lipschitz functions on the compact set Ω that contains the origin. The dynamical system is stabilizable on Ω. The performance

Synchronous solution of multi-player ZS games

In this section, off-policy algorithm will be proposed based on Algorithm 1. The neural networks implementation process is also given. Based on that, the stability of the synchronous solution method is proven.

Simulation study

In this section, two examples will be provided to demonstrate the effectiveness of the optimal control scheme proposed in this paper.

Conclusions

This paper proposed a synchronous solution method for multi-player zero-sum games without system dynamics based on neural network. PI algorithm is presented to solve the HJB equation with system dynamics. It is proven that the obtained iterative cost function by PI is convergent to optimal game value. Based on PI, off-policy learning method is given to obtain the iterative cost function, controls and disturbances. The weights of CNN, ANNs and DNNs compose synchronous weight matrix, which is

Acknowledgment

This work was supported in part by the National Natural Science Foundation of China under Grant 61304079, 61673054, 61374105, in part by Fundamental Research Funds for the Central Universities under Grant FRF-TP-15-056A3, and in part by the Open Research Project from SKLMCCS under Grant 20150104.

Ruizhuo Song received the Ph.D. degree in control theory and control engineering from Northeastern University, Shenyang, China, in 2012. She was a postdoctoral fellow with University of Science and Technology Beijing, Beijing, China. She is currently an Associate Professor with the School of Automation and Electrical Engineering, University of Science and Technology Beijing. She was a Visiting Scholar with the Department of Electrical Engineering at University of Texas at Arlington, Arlington,

References (42)

  • D. Vrabie et al.

    Adaptive optimal control for continuous-time linear systems based on policy iteration

    Automatica

    (2009)
  • SongR. et al.

    Adaptive dynamic programming for a class of complex-valued nonlinear systems

    IEEE Trans. Neural Netw. Learn. Syst.

    (2014)
  • M. Abu-Khalaf et al.

    Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach

    Automatica

    (2005)
  • K. Vamvoudakis et al.

    Online solution of nonlinear two player zero-sum games using synchronous policy iteration

    Int. J. Robust Nonlinear Control

    (2012)
  • F. Lewis et al.

    Optimal Control

    (2012)
  • YeungD.W.K. et al.

    Cooperative Stochastic Differential Games

    (2006)
  • F. Lewis et al.

    Optimal Control, Third Edition

    (2012)
  • SongR. et al.

    Off-policy integral reinforcement learning method to solve nonlinear continuous-time multi-player non-zero-sum games

    IEEE Trans. Neural Netw. Learn. Syst.

    (2017)
  • LiuD. et al.

    Multiperson zero-sum differential games for a class of uncertain nonlinear systems

    Int. J. Adapt. Control Signal Process.

    (2014)
  • MuC. et al.

    Iterative GDHP-based approximate optimal tracking control for a class of discrete-time nonlinear systems

    Neurocomputing

    (2016)
  • FangX. et al.

    Data-driven heuristic dynamic programming with virtual reality

    Neurocomputing

    (2015)
  • Cited by (40)

    View all citing articles on Scopus

    Ruizhuo Song received the Ph.D. degree in control theory and control engineering from Northeastern University, Shenyang, China, in 2012. She was a postdoctoral fellow with University of Science and Technology Beijing, Beijing, China. She is currently an Associate Professor with the School of Automation and Electrical Engineering, University of Science and Technology Beijing. She was a Visiting Scholar with the Department of Electrical Engineering at University of Texas at Arlington, Arlington, TX, USA, from 2013 to 2014. Her current research interests include optimal control, neural-networks-based control, nonlinear control, wireless sensor networks, and adaptive dynamic programming and their industrial application. She has published over 40 journal and conference papers, and coauthored 2 monographs.

    Qinglai Wei received the B.S. degree in Automation, and the Ph.D. degree in control theory and control engineering, from the Northeastern University, Shenyang, China, in 2002 and 2009, respectively. From 2009–2011, he was a postdoctoral fellow with The State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China. He is currently a Professor of the institute. He has authored three books, and published over 60 international journal papers. His research interests include adaptive dynamic programming, neural-networks-based control, optimal control, nonlinear systems and their industrial applications. Dr. Wei is an Associate Editor of IEEE Transaction on Systems Man, and Cybernetics: Systems since 2016, Information Sciences since 2016, Neurocomputing since 2016, Optimal Control Applications and Methods since 2016, Acta Automatica Sinica since 2015, and has been holding the same position for IEEE Transactions on Neural Networks and Learning Systems during 2014–2015. He is the Secretary of IEEE Computational Intelligence Society (CIS) Beijing Chapter since 2015. He was Registration Chair of the 12th World Congress on Intelligent Control and Automation (WCICA2016), 2014 IEEE World Congress on Computational Intelligence (WCCI2014), the 2013 International Conference on Brain Inspired Cognitive Systems (BICS 2013), and the Eighth International Symposium on Neural Networks (ISNN 2011). He was the Publication Chair of 5th International Conference on Information Science and Technology (ICIST2015) and the Ninth International Symposium on Neural Networks (ISNN 2012). He was the Finance Chair of the 4th International Conference on Intelligent Control and Information Processing (ICICIP 2013) and the Publicity Chair of the 2012 International Conference on Brain Inspired Cognitive Systems (BICS 2012). He was guest editors for several international journals. He was a recipient of Shuang-Chuang Talents in Jiangsu Province, China, in 2014. He was a recipient of the Outstanding Paper Award of Acta Automatica Sinica in 2011 and Zhang Siying Outstanding Paper Award of Chinese Control and Decision Conference (CCDC) in 2015. He was a recipient of Young Researcher Award of Asia Pacific Neural Network Society (APNNS) in 2016.

    Biao Song received the B.S. degree in electronic information engineering from Yanbian University, Yanbian, China, in 2011. He is currently a Ph.D. student with the School of Automation and Electrical Engineering, University of Science and Technology Beijing. His research interests include wireless sensor networks, adaptive dynamic programming.

    View full text