Elsevier

Information Sciences

Volume 286, 1 December 2014, Pages 209-227
Information Sciences

Reinforcement learning with automatic basis construction based on isometric feature mapping

https://doi.org/10.1016/j.ins.2014.07.008Get rights and content

Abstract

Value function approximation (VFA) has been a major research topic in reinforcement learning. Although various reinforcement learning algorithms with VFA have been proposed, the performance of most previous algorithms depends on the predefined structure of the basis functions. To address this problem, this paper presents a novel basis learning method for VFA based on isometric feature mapping (IFM). In the proposed method, basis functions for VFA are automatically generated by constructing the optimal embedding basis of the data in a d-dimensional Euclidean space, which best preserves the estimated intrinsic geometry of the manifold. Furthermore, the IFM-based basis learning method is integrated with approximation policy iteration (API) for learning control in Markov decision problems with large state spaces. A new manifold reinforcement learning framework termed IFM-based API (IFM-API) is presented. Three learning control problems, including a real control system of the Googol single inverted pendulum, were studied to evaluate the performance of the proposed IFM-API algorithm. The simulation and experimental results show that, compared with other basis selection or learning methods, the IFM-based basis learning method can automatically compute an efficient set of basis functions with much fewer predefined parameters and less computational costs. Besides, it is illustrated that the proposed IFM-API algorithm can obtain better learning control policies than other API methods.

Introduction

In recent years, many encouraging achievements and applications have been received in the supervised and unsupervised learning domain [1], [5], [8], [14], [15], [23]. However, it is still difficult for them to deal with sequential decision-making or learning control problems efficiently [7], [24]. Reinforcement learning (RL) [30], as a machine learning framework, has been widely studied [7], [18], [30], [34], [35], [37] in the past decade for solving sequential decision making problems. These problems are usually modeled as Markov Decision Processes (MDPs). In RL, the action policy of a learning agent is modified to maximize its total rewards during the process of its interaction with an environment. Due to the property of RL, it is more suitable for solving learning control problems (which can be modeled as MDPs) than supervised learning methods and mathematical programming methods [7], [9], [24]. In earlier stages of RL research, most efforts were focused on learning control algorithms for MDPs with discrete state and action spaces. However, MDPs in many real-world applications are usually with continuous or large-scale state spaces. If the value functions in continuous or large-scale state spaces are still represented using discrete tabular, the learning rate and convergence rate will be influenced negatively [24].

Aiming at the above problem, hierarchical RL (HRL) and approximate RL methods were popularly studied. As illustrated in [4], existing work in HRL can be grouped into three aspects: the abstraction of the sets of actions [31], the abstraction of states [11] and the decomposition of state space [10]. Some successful applications of HRL have been reported, e.g., robot path tracking [38], web services composition [12] and so on. Approximate RL, which is also called approximate dynamic programming (ADP) [34], has received increasing attention in recent years. Previous research work on approximate RL methods mainly include there perspectives, e.g., policy search [3], value function approximation (VFA) [29], and actor-critic methods [25]. Actor-critic methods can be regarded as a combination of policy search and VFA. It has been illustrated that actor-critic algorithms can obtain better learning efficiency than policy search or VFA in online learning control of MDPs with large state spaces [25]. In actor-critic algorithms, an actor is used for policy learning or policy improvement and a critic is employed for policy evaluation or value function approximation. In recent years, more and more research efforts have been focused on actor-critic approaches. Adaptive critic designs (ACDs) [18] become a significant class of learning control methods for linear or nonlinear dynamic systems.

Despite of the above advances, it is well known that VFA is a central problem of all successful applications of RL, where a variety of non-linear and linear approximation architectures have been studied. Particularly, VFA approaches with linear approximation architecture have been widely used due to the advantage of convergence and stability properties. Recent advances in VFA with linear function approximators include linear SARSA-learning [29], and least-squares policy iteration (LSPI) [17], etc. Although better approximation ability of nonlinear VFA is exhibited than that of linear VFA, few conclusions were reported on rigorous theoretical analysis of RL applications using nonlinear VFA. In recent years, many research efforts have been put on RL algorithms using the linear VFA architecture. For example, LSPI has been popularly studied as an efficient RL algorithm with linear basis functions [9], [17]. However, the basis functions in LSPI need to be manually selected. In [36], a kernel-based least squares policy iteration (KLSPI) algorithm was proposed by replacing the hand-coded basis functions with kernel-based features. Although the kernel-based features can be generated in a data-driven style, the kernel functions still need to be selected carefully by the designer. Therefore, one common drawback of previous work in VFA is that the basis functions or kernel functions are usually hand-coded by human experts, rather than automatically constructed from the geometry of the underlying state space.

In pattern recognition and machine learning, one of the key problems is to develop appropriate and effective low-dimensional representations for complex high-dimensional data, namely the dimensionality reduction problem. A prevalent approach to solve the problem is based on the notion of manifold. It has been shown that a set of high-dimensional data can usually be described as a set of vectors in a low-dimensional nonlinear manifold [19], [27]. Given a set of data points x=[x1,x2,,xn] in a low-dimensional space RdL, let f:xiRdH be a smooth embedding, for some dH>dL. Manifold learning aims to recover x and f based on a given set of observed data {y=f(x)} in RdH. Until now, different manifold learning algorithms have been developed, e.g., locally linear embedding (LLE) [26], isometric mapping (ISOMAP) [32], Laplacian eigenmaps (LE) [6], etc. A general graph embedding framework was presented in [19] to unify various dimensionality reduction methods.

Although manifold learning has been widely studied in regression and classification tasks, there are few works on the integration of manifold learning into reinforcement learning problems. Recently, based on the principle of Laplacian eigenmaps, an algorithm called representation policy iteration (RPI) for value function approximation in RL was proposed [21]. Basis functions in RPI can be automatically constructed using the spectral analysis of a self-adjoint Laplacian operator, instead of using a hand-coded parametric architecture. However, in RPI, except for the numbers of nearest neighbors and basis functions, some other parameters also need to be predefined, e.g., the Laplacian type and the width of the Gaussian distance. Besides, it is still difficult to learn or construct a “representative” graph by trajectory sampling in continuous or large-scale state spaces.

Compared with other manifold learning methods, ISOMAP [2], [32] requires only one parameter to be determined and seeks to preserve the intrinsic geometry of the data. The key problem in ISOMAP is how to estimate the geodesic distance between faraway points only using input-space distances. As illustrated in [2], [32], the geodesic distance for neighboring points can be well approximated by the input-space distance. The geodesic distance for faraway points can be approximated by computing the total length of a series of shortcut paths between neighboring points. After constructing a graph that connects neighboring data points as edges, one can efficiently approximate the geodesic distance by searching the shortest paths in the graph. Inspired by the above idea, this paper presents a novel basis learning method based on isometric feature mapping (IFM). Using this method, basis functions can be constructed automatically by preserving the intrinsic geometry of the collected samples. Furthermore, the IFM-based basis learning method is integrated with approximation policy iteration (API) for learning control in MDPs with continuous or large-scale state spaces. A new manifold reinforcement learning framework called IFM-based API (IFM-API) is proposed. Three learning control problems, including a real system control of the Googol single inverted pendulum, were studied to evaluate the performance of the proposed IFM-API algorithm. The simulation and experimental results show that, compared with some other basis selection or learning methods, the IFM-based basis learning method can automatically compute an efficient set of basis functions with fewer predefined parameters and smaller computational costs. Besides, the clustering-based sub-sampling method [13], which is a better choice than the trajectory-based sub-sampling method [20], [21], is used in IFM-API. It is illustrated that the proposed IFM-API algorithm can obtain better learning control policies than some other API methods, e.g., KLSPI [36], LSPI [17] and RPI [21].

The rest of the paper is composed of the following parts. In Section 2, a brief introduction of MDPs, as well as related works on policy iteration and temporal difference learning, is given. The manifold RL framework based on IFM is presented in Section 3, where the IFM-API algorithm as well as its parameter selection rules and performance analysis are discussed. In Section 4, simulation and experimental results on three learning control problems with continuous state spaces are provided to illustrate the effectiveness of the proposed method. Robustness analysis of the proposed method is also given. Section 5 draws conclusions and suggests future work.

Section snippets

Markov decision process

A typical Markov Decision Process (MDP) can be expressed by a four tuple {S,A,R,P}. S and A denote the state space and the action space, respectively. R represents a defined reward function: S×AR. (R denotes the set of real numbers.), and P denotes the state transition probability. The policy of a MDP, which can be defined as π:SQ(A), can be viewed as a mapping from S to A. (Q(A) denotes a probability distribution in A.) The optimal policy π can be estimated by the following objective

The manifold RL framework based on isometric feature mapping

In this section, we will present the manifold RL framework based on isometric feature mapping (IFM) as well as the IFM-API algorithm. Fig. 2 denotes the overall flow chart. There are four main steps: sample collection, basis learning, policy evaluation and policy improving. In the first step samples can be collected using a random policy or a known policy. The second step involves constructing the basis functions from the collected samples through the isometric feature mapping process. The

Simulation and experimental studies

In this section, the effectiveness and efficiency of the proposed IFM-API algorithm are evaluated in the following three control problems: the mountain car problem, the acrobot swing-up problem and a real physical inverted pendulum system. The three problems have been studied as benchmarks for RL algorithms in learning control of Markov Decision Processes with continuous state spaces and nonlinear dynamics. In the three benchmark problems, the performance of IFM-API is compared with popular API

Conclusion

Basis function selection has been regarded as a key problem to improve the performance of RL algorithms with function approximation. In this paper, a novel basis construction approach based on isometric feature mapping is proposed for value function approximation in RL. Through isometric feature mapping, basis functions are generated automatically with fewer adjustable parameters and less computational cost. Simulation and experimental results illustrated that it is illustrated that the

Acknowledgements

This paper is supported by National Natural Science Foundation of China under Grant 61075072, & 91220301, the Program for New Century Excellent Talents in University under Grant NCET-10-0901.

References (38)

  • R.S. Sutton et al.

    Between MDPs and semi-MDPs: a framework for temporal abstraction in reinforcement learning

    Artif. Intell.

    (1999)
  • X. Xu et al.

    Reinforcement learning algorithms with function approximation: recent advances and applications

    Inform. Sci.

    (2014)
  • G. Anderson et al.

    Multicore scheduling based on learning from optimization models

    Int. J. Innovative Comput. Inform. Control

    (2013)
  • M. Balasubramanian et al.

    The isomap algorithm and topological stability

    Science

    (2002)
  • P.L. Bartlett et al.

    Infinite-horizon policy-gradient estimation

    J. Artif. Intell. Res.

    (2001)
  • A.G. Barto et al.

    Recent advances in hierarchical reinforcement learning

    Discrete Event Dynam. Syst.: Theory Appl.

    (2003)
  • J. Bather

    Decision Theory: An Introduction to Dynamic Programming and Sequential Decisions

    (2000)
  • M. Belkin et al.

    Laplacian eigenmaps for dimensionality reduction and data representation

    Neural Comput.

    (2003)
  • D.P. Bertsekas et al.

    Neuro-Dynamic Programming

    (1996)
  • C.M. Bishop

    Pattern Recognition and Machine Learning

    (2006)
  • L. Busoniu et al.

    Reinforcement Learning and Dynamic Programming Using Function Approximators

    (2010)
  • T.G. Dietterich

    Hierarchical reinforcement learning with the MAXQ value function decomposition

    J. Artif. Intell. Res.

    (2000)
  • T.G. Dietterich, State abstraction in MAXQ hierarchical reinforcement learning, in: Advances in Neural Information...
  • L. Feng et al.

    QOS optimization for web services composition based on reinforcement learning

    Int. J. Innovative Comput. Inform. Control

    (2013)
  • Z. Huang, X. Xu, J. Wu, L. Zuo, Fuzzy c-means method for representation policy iteration in reinforcement learning, in:...
  • E. Kaya et al.

    Learning weights of fuzzy rules by using gravitational search algorithm

    Int. J. Innovative Comput. Inform. Control

    (2013)
  • V. Kecxnan

    Learning and Soft Computing: Support Vector Machines, Neural Networks, and Fuzzy Logic Models

    (2001)
  • V. Kumar et al.

    Introduction to Parallel Computing: Design and Analysis of Algorithms

    (1994)
  • M.G. Lagoudakis et al.

    Least-squares policy iteration

    J. Mach. Learn. Res.

    (2003)
  • Cited by (19)

    • Adaptive intrusion detection via GA-GOGMM-based pattern learning with fuzzy rough set-based attribute selection

      2020, Expert Systems with Applications
      Citation Excerpt :

      The well accepted preprocessing method is attribute reduction (Du & Hu, 2016), which can be implemented by feature projection or feature selection (Khammassi & Krichen, 2017). The feature projection can be categorized into the linear method, e.g., principle component analysis (Ronao & Cho, 2016), factor analysis and the nonlinear mapping method, e.g., isometric feature mapping (Huang, Xu & Zuo, 2014), locally linear embedding (Roweis & Saul, 2000), embedding the original high-dimensional data into a low-dimensional manifold space. Either linear or nonlinear methods have high computational complexity; hence most of them are unsuitable for the online NID.

    • A generalized multi-dictionary least squares framework regularized with multi-graph embeddings

      2019, Pattern Recognition
      Citation Excerpt :

      Recently, manifold learning based algorithms have been presented with capabilities of extracting features from non-linear data. Such representative methods are Isometric Feature Mapping (ISOMAP) [12], Multi-manifold Discriminant Isomap (MMD-Isomap) [5], Multi-dimensional Scaling (MDS) [13], Maximum Variance Unfolding (MVU) [14], and Local Tangent Space Alignment (LTSA) [15]. In the second category, there exist methods such as, Locality Preserving Projections (LPP) [16], KLPP [17], Locality Sensitive Discriminant Analysis (LSDA) [18], Locally Linear Embedding (LLE) [19], Locality Preserving Graph Construction (LGPC) [20], Laplacian Eigenmaps (LE) [21], Neighborhood Preserving Projections (NPP) [22], and Neighborhood Preserving Embedding (NPE) [23].

    • A dimension reduction algorithm preserving both global and local clustering structure

      2017, Knowledge-Based Systems
      Citation Excerpt :

      Using the kernel trick, Kernel Principal Components Analysis (KPCA) [10] and Kernel Discriminant Analysis (KDA) [11] were developed to extract a group of new nonlinear features. In recent years, several new dimensionality reduction methods have been introduced, such as Isometric Feature Mapping (ISOMAP) [12], Multi-dimensional Scaling (MDS) [13], and Maximum Variance Unfolding (MVU) [14]. To conclude, all of these methods attempt to preserve solely the global properties of the data, but neglect the neighborhood structure or cluster structure of data, and therefore it makes the adjacent characteristics of the high dimensional samples will no longer keep in the subspace.

    • Data-based robust adaptive control for a class of unknown nonlinear constrained-input systems via integral reinforcement learning

      2016, Information Sciences
      Citation Excerpt :

      The general structure used to implement RL methods is the actor-critic architecture, where the actor performs actions by interacting with its surroundings, and the critic evaluates actions and offers feedback information to the actor, leading to the improvement in performance of the subsequent actor [23]. Over the past several years, many informative RL methods have been developed based on the actor-critic architecture [4,13,18,30,31,47,51,63,66,74]. It is significant to point out that though ADP and RL are proposed from different backgrounds, they have a close connection.

    • Optimal distributed synchronization control for continuous-time heterogeneous multi-agent differential graphical games

      2015, Information Sciences
      Citation Excerpt :

      Adaptive dynamic programming (ADP) [58,59], characterized by strong abilities of self-learning and adaptivity, has received significantly increased attention and becomes an important brain-like intelligent optimal control method [12,14,29,35,57,65–67]. There were several synonyms used for ADP, including “adaptive critic designs” [22,37], “adaptive dynamic programming” [25,52], “approximate dynamic programming” [33], “neuro-dynamic programming” [10], “neural dynamic programming” [11], and “reinforcement learning” [13,16,39,61,62]. Iterative methods, including value and policy iterations [20], are widely used in ADP to obtain the optimal performance index function indirectly [7,18,23–25,27,51,53–56].

    • Semi-supervised LPP algorithms for learning-to-rank-based visual search reranking

      2015, Information Sciences
      Citation Excerpt :

      The operation brings high demands on computation complexity and storage capacity, and also weakens the generalization ability of the learning algorithms. Unfortunately, existing dimensionality reduction methods are typically designed for classification and retrieval applications, rather than ranking [2,9,10,25,33,14,26,5]. For example, with the notion of “Laplacian of the graph”, He et al. [9] presented Locality Preserving Projections (LPP) algorithm to map images into a manifold subspace for analysis.

    View all citing articles on Scopus
    View full text