Data-driven heuristic dynamic programming with virtual reality
Introduction
Recent research on intelligent systems focuses on developing self-adaptive systems to imitate the signal processing mechanism of biological brain organisms for closing the gap between human behavior and computer decisions [1], [2], [3], [4]. Generally, an intelligent system should be designed with the following abilities:
- •
to acquire and retain knowledge through the perception of environments;
- •
to analyze and predict actions from prior experiences;
- •
to produce and adjust the actions to obtain consistent and reliable results over time [3], [4], [5], [6].
There are two important issues to be considered regarding intelligent systems. The first is how to develop biologically inspired general purpose learning models that can dynamically learn how to achieve a goal [4], [7], [8], [9], and the second is how to represent data information to support the decision-making process, in an efficient and effective way [10], [11], [12], [13].
In the literature, adaptive dynamic programming (ADP) [14], [15], [16], [17] provides a powerful learning mechanism [5], which has been successfully applied in many fields, such as dynamic control systems [8], [16], power systems [18], [19], missile systems [20], and communication systems [21] as well as partial observable Markov decision process [22]. The ADP method can be categorized as heuristic dynamic programming (HDP) [16], dual heuristic dynamic programming (DHP), and globalized dual heuristic dynamic programming (GDHP) [23]. The foundation of ADP can be traced back to the classic Bellman׳s principle of optimality [24]. It is related to the reinforcement learning (RL) while using the adaptive-critic (AC) design framework [25], [26], [27]. For example, in model-free direct HDP structure, two neural networks are used in order to approach on-line learning. The action network interacts with the external environment in order to produce actions according to the state vector observation. The critic network evaluates the performance by means of an external reinforcement signal. The weights of both networks are randomly initialized. The optimal solution is obtained by training the weights using backpropagation to approximate the optimal cost-to-go function [8], [28]. Such model-free technique has also been demonstrated in DHP design successfully in the community [29].
Based on the direct HDP design [16], an enhanced approach, called goal representation heuristic dynamic programming (GrHDP), was proposed in [4], [7]. In the GrHDP model, an additional network, called goal network, was integrated in the framework in order to interact with the critic network [7], [30]. The motivation of the goal network is to provide an internal reinforcement representation to help on-line learning and optimization [8], [31]. The goal network learns from an external reinforcement signal and adaptively generates an internal reinforcement signal to guide system behavior [8], [30], [31]. The internal reinforcement signal is also used to feed into the critic network. Since a continuous value is provided, the internal reinforcement signal can be considered more informative. This additional input will contribute to the fine-tuning of the critic network. Also, this internal reinforcement signal can be automatically and efficiently adjusted in order to provide a learning performance improvement in the goal network. Meanwhile, the GrHDP approach also saves the previous cost-to-go value in order to enable on-line learning, association, and optimization over time [7]. More recently, the corresponding goal representation dual heuristic programming (GrDHP) design is also investigated and developed on several control examples with promising results [32], [33].
During the simulation design, virtual reality (VR) provides a straightforward way for the construction of this solution [34], [35]. In general, VR is a powerful technique, which might be implemented through a set of complex hardware devices, such as a mainframe computer, a head-mounted display (HMD), motion trackers, sensor gloves, or either a computer automatic virtual environment (CAVE) [35], [36]. In fact, the key of VR is not related just to the hardware devices, but also to the human–computer interaction (HCI) involving the people participating and exploring a virtual environment (VE). Recently, many researchers focused on the design of visualization/interaction applications instead of the improvement of VR rendering algorithms or hardware devices [37]. In psychology, VE is used to treat patients suffering from anxiety or phobias [38], [39]. In education, a virtual learning environment was developed to help students learn their courses [40], [41]. In urban traffic research, VE has been used to analyze the interactive driving behaviors under different traffic scenarios [42], [43]. In military research, VE simulations have been developed to improve the soldiers׳ skills on decision-making [44] and the cross-cultural communication under different situations [45]. In surgery, a VR simulator was proposed as a cost-effective training platform [46], [47]. In physical rehabilitation, a VE has been used to help the patients to train their movement patterns and enhance their physical rehabilitation [48], [49].
There also seems to be an increasing need for visualization and VR platforms in the machine intelligence research, with the increasing dimension and volume of data information. To this end, a VE must be developed, using computer graphics (CG) software, such that researchers are able to design a virtual experiment adequate to their research tasks. Second, real data information can be used in the VE, making it a suitable representation for the real world. Third, during the dynamic simulation process, the researchers can interact with the experiment through the VE, analyzing the system behavior and possibly proposing changes to the experiment. For example, an additional virtual disturbance might be designed to verify the stability of a dynamic system.
Motivated by our previous works on ADP designs [7], [8], [30] and its applications using VR [50], [51], [52], the goal of this paper is to propose the integration of VR in the development of machine learning solutions. Specifically, we built an interactive VR platform and applied it to GrHDP experiments in order to illustrate how the use of a VR platform can enhance the development of a machine learning application. We first apply VR to GrHDP in the triple-link inverted pendulum balancing problem [7] and the maze navigation problem [30], [31], and then develop a new application benchmark, a robot navigation with obstacle avoidance problem. We hope to demonstrate how the combination of machine learning research and VR improves investigation of many real-world applications.
The rest of this paper is organized as follows. In Section 2, we discuss the structure of the GrHDP approach. The design of a VR interactive platform is described in detail in Section 3. Based on this platform, in 4 GrHDP approach to triple-link inverted pendulum balancing problem with VR, 5 Maze navigation with virtual reality, 6 Robot navigation with obstacle avoidance using virtual reality, we study, respectively, the design and simulation of experiments for the triple-link inverted pendulum balancing problem, the maze navigation problem, and the robot navigation with obstacle avoidance problem. Finally, the conclusion and discussion is provided in Section 7.
Section snippets
Online learning with the GrHDP approach
The main structure of GrHDP is given in Fig. 1. When compared to other ADP approaches [16], in GrHDP, an additional network (goal network) is integrated into the direct HDP structure, interacting with both the critic network and the action network. All the neural networks are multi-layer perceptrons (MLP) with one hidden layer [50]. The input of goal network and critic network could be defined as , respectively. Meanwhile, the input of action network is still the current
The interactive VR platform design
Basically, a desktop computer is one of the most common VR application interfaces. In a desktop computer, the monitor is used as a visualization platform for the virtual world representation; meanwhile, external equipment, such as the keyboard or the mouse, provides the interactive tools for human interaction. These hardware devices render a real-time visualization/interaction interface supported by CG software [34], [35]. In a CG software, virtual reality modeling language (VRML) is usually
GrHDP approach to triple-link inverted pendulum balancing problem with VR
The stabilization of nonlinear systems has been widely studied for control/learning research purposes [55]. The inverted pendulum balancing model is a popular demonstration benchmark of nonlinear systems. The task of this benchmark is to stabilize an inverted pendulum [16], [55]. We have previously applied GrHDP to the traditional triple-link inverted pendulum balancing problem in [7]. In this work we introduce the advantage of VR to improve the development of experiments. The structure of the
Maze navigation with virtual reality
Maze navigation is a path planning problem where the agent needs to find a proper path from an initial state to a given goal [57], [58]. It is a typical Markov decision process (MDP) problem where the control action is not related to previous state vectors, but only related to current state [30], [31]. The task of this problem is to find an optimal path. Once the optimal path is obtained, the future motion should follow this path. Evaluating value function (or state-action pair) is a
Robot navigation with obstacle avoidance using virtual reality
Robot navigation with obstacle avoidance is a popular research experiment in robot motion planning [57], [61]. The task can be described as to design a robot to observe the environment and find a collision-free path [62], [63], [64], [65]. The solution to this problem can be split into two parts: (a) detecting the robot׳s environment and obstacles; (b) the navigation control strategy design [62]. The VR platform is designed as a physical system for system states detection, a controller module
Conclusions and future research
The contribution of this work is the proposition of using a VR platform to improve machine learning research. Such platform provides a straightforward experiment/simulation interface for real-time dynamic process representation and human computer interaction. Specifically, the virtual experiment environment could be built with real-time data information stored in it. The VR platform visualization could tackle the data information transmission and reproduce the decision-making process.
Acknowledgements
The authors would like to thank the associate editor and the anonymous reviewers for their useful comments, which helped to improve this paper. This work was supported in part by the National Science Foundation (NSF) under Grant ECCS 1053717, and the Army Research Office under Grant W911NF-12-1-0378.
Xiao Fang received his B.S. degree in Department of Mechanical and Electrical Engineering from Hebei University of Engineering, China, in 2007. He is a current Ph.D. student in School of Electrical Engineering, Yanshan University, China. He is also a joint Ph.D. student in Department of Electrical, Computer, and Biomedical Engineering, University of Rhode Island, RI, USA. His major research interests include adaptive dynamic programming, reinforcement learning, computational intelligence, and
References (65)
Intelligence in the braina theory of how it works and how to build it
IEEE Trans. Neural Netw.
(2009)- et al.
A three-network architecture for on-line learning and optimization based on adaptive dynamic programming
Neurocomputing
(2012) - et al.
DCPE co-training for classification
Neurocomputing
(2012) - et al.
Reactive power control of grid-connected wind farm based on adaptive dynamic programming
Neurocomputing
(2014) - P.J. Werbos, Backwards differentiation in ad and neural nets: past links and new opportunities, in: Automatic...
- P.J. Werbos, What do neural nets and quantum theory tell us about mind and reality, in: No Matter, Never Mind:...
- et al.
Intelligent system
Int. J. Comput. Commun. Control
(2008) Self-Adaptive System for Machine Intelligence
(2011)Adpthe key direction for future research in intelligent control and understanding brain intelligence
IEEE Trans. Syst. Man Cybern.: Part B: Cybern.
(2008)- et al.
Adaptive learning in tracking control based on the dual critic network design
IEEE Trans. Neural Netw. Learn. Syst.
(2013)
Learning from imbalanced data
IEEE Trans. Knowl. Data Eng.
Sparse-representation-based classification with structure-preserving dimension reduction
Cognit. Comput.
On-line learning control by association and reinforcement
IEEE Trans. Neural Netw.
Adaptive critic designs
IEEE. Trans. Neural Netw.
Energy storage based low frequency oscillation damping control using particle swarm optimization and heuristic dynamic programming
IEEE Trans. Power Syst. (TPS)
Missile defense and interceptor allocation by neuro-dynamic programming
IEEE Trans. Syst. Man Cybern. Part A: Syst. Hum.
Dynamic re-optimization of a fed-batch fermentor using adaptive critic designs
IEEE Trans. Neural Netw.
Handbook of Intelligent Control
Dynamic Programming
Reinforcement Learning: An Introduction
Reinforcement learninga survey
J. Artif. Intell. Res.
Reinforcement learning and adaptive dynamic programming for feedback control
IEEE Circuits Syst. Mag.
Adaptive dynamic programmingan introduction
IEEE Comput. Intell. Mag.
Goal representation heuristic dynamic programming on maze navigation
IEEE Trans. Neural Netw. Learn. Syst.
Heuristic dynamic programming with internal goal representation
Soft Comput.
GrDHPa general utility function representation for dual heuristic dynamic programming
IEEE Trans. Neural Netw. Learn. Syst.
Cited by (15)
Off-policy synchronous iteration IRL method for multi-player zero-sum games with input constraints
2020, NeurocomputingCitation Excerpt :The nodus is that the analytic solutions may impossible to be obtained directly even for many simple linear systems. To solve this difficulty, intelligent algorithms were employed to approximate the optimal solutions of the ZS game [6–8]. No significant breakthrough has been made until the emergence of reinforcement learning (RL) method [9].
Co-evolutionary multi-task learning for dynamic time series prediction
2018, Applied Soft Computing JournalCitation Excerpt :In dynamic programming, a large problem is broken down into sub-problems, from which at least one sub-problem is used as a building block for the optimisation problem. Although dynamic programming has been primarily used for optimisation problems, it has been briefly explored for data driven learning [30,31]. The notion of using sub-problems as building block in dynamic programming can be used in developing algorithms for multi-task learning.
Co-evolutionary multi-task learning with predictive recurrence for multi-step chaotic time series prediction
2017, NeurocomputingCitation Excerpt :Dynamic programming has mainly been used for optimization problems. There has not been much work done for incorporating the approach into machine learning problems that requires knowledge from previous states as building blocks for further decision making [36,37]. Multi-step time series prediction is an application where a synergy between dynamic programming and multi-task learning can be developed.
Neural-network-based synchronous iteration learning method for multi-player zero-sum games
2017, NeurocomputingCitation Excerpt :In the nonlinear case the HJI equations are difficult or impossible to solve, and may not have global analytic solutions even in simple cases. Therefore, many approximate methods are proposed to obtain the solution of HJI equations [5–8]. Adaptive dynamic programming (ADP) algorithm is an effective approximate method in optimal control field [9–13].
Data-based robust adaptive control for a class of unknown nonlinear constrained-input systems via integral reinforcement learning
2016, Information SciencesCitation Excerpt :Nevertheless, most of existing ADP and RL methods applied to solve nonlinear optimal problems assume that the plant is known or partially available. To derive the optimal control of unknown nonlinear systems, data-driven ADP approaches are developed [9,38,48,53,55,59,69]. In [69], Zhang et al. proposed a data-driven robust optimal tracking control for continuous-time unknown nonlinear systems via an ADP approach.
Guaranteed cost neural tracking control for a class of uncertain nonlinear systems using adaptive dynamic programming
2016, NeurocomputingCitation Excerpt :The general structure utilized to implement ADP algorithms is the actor-critic architecture, where two kinds of NNs referred to as critic NNs and actor NNs are employed to approximate the optimal cost function and the optimal control, respectively. There are several synonyms used for ADP, including “adaptive dynamic programming” [11–18], “approximate dynamic programming” [19], “adaptive critic designs” [20], “neural dynamic programming” [21], and “reinforcement learning (RL)” [22–26]. In the past several years, applications of ADP approaches to optimal tracking control have been extensively studied in the literature [27–32].
Xiao Fang received his B.S. degree in Department of Mechanical and Electrical Engineering from Hebei University of Engineering, China, in 2007. He is a current Ph.D. student in School of Electrical Engineering, Yanshan University, China. He is also a joint Ph.D. student in Department of Electrical, Computer, and Biomedical Engineering, University of Rhode Island, RI, USA. His major research interests include adaptive dynamic programming, reinforcement learning, computational intelligence, and virtual reality.
Dezhong Zheng received the B.S. degree and M.S. degree in School of Electrical Engineering, Yanshan University, China, in 1982 and 1988, respectively. He is currently a professor at the Key Lab of Measuring Technology and Instrument of Hebei Province, School of Electrical Engineering, Yanshan University, China. His current research interests include dynamic control system, virtual reality, pattern recognition, virtual instrument, remote control technology, network information technology, and others. He has been an Associate Editor for Chinese Journal of Sensors and Actuators. He is the vice-president of the Computer Aided Design (CAD) Society and the Artificial Intelligence Society in Hebei, China.
Haibo He received the B.S. and M.S. degrees in electrical engineering from Huazhong University of Science and Technology, China, in 1999 and 2002, respectively, and the Ph.D. degree in electrical engineering from Ohio University in 2006. He is currently an Associate Professor at the Department of Electrical, Computer, and Biomedical Engineering, University of Rhode Island. From 2006 to 2009, he was an Assistant Professor at the Department of Electrical and Computer Engineering, Stevens Institute of Technology.
His current research interests include adaptive dynamic programming, computational intelligence, self-adaptive systems, machine learning, data mining, embedded intelligent system design (VLSI/FPGA), and various applications such as smart grid, cognitive radio, sensor networks, and others. He has published 1 sole-author research book (Wiley), edited 1 book (Wiley-IEEE) and 6 conference proceedings (Springer), and authored and co-authored over 120 peer-reviewed journal and conference papers. His research results have been covered by national and international medias such as The Wall Street Journal, Yahoo!, Providence Business News, among others. He has delivered numerous keynote and invited talks. He is currently the Chair of the IEEE Computational Intelligence Society (CIS), Neural Network Technical Committee (NNTC), among others. He regularly serves in the Organizing Committee of various international conferences including the General Chair of the IEEE Symposium Series on Computational Intelligence (SSCI 2014). He has been a Guest Editor for several journals including IEEE Computational Intelligence Magazine, IEEE Transactions on Smart Grid, Cognitive Computation (Springer), Applied Mathematics and Computation (Elsevier), Soft Computing (Springer), among others. Currently, he is an Associate Editor of the IEEE Transactions on Neural Networks and Learning Systems and IEEE Transactions on Smart Grid, and also serves on the Editorial Board for several international journals. He received the National Science Foundation (NSF) CAREER Award (2011) and Providence Business News (PBN) “Rising Star Innovator of The Year” Award (2011). He is a Senior Member of IEEE.
Zhen Ni received the B.S. degree in the Department of Control Science and Engineering from Huazhong University of Science and Technology (HUST), Wuhan, China, in 2010, and the M.S. degree in the Department of Electrical, Computer, and Biomedical Engineering from the University of Rhode Island, Kingston, Rhode Island, in 2012. Currently, he is pursuing Ph.D. degree in the same department.
His research interests include computational intelligence and reinforcement learning, specifically in the adaptive dynamic programming (ADP) and optimal/adaptive control.