Reinforcement learning based on local state feature learning and policy adjustment

https://doi.org/10.1016/S0020-0255(03)00006-9Get rights and content

Abstract

The extension of reinforcement learning (RL) to large state space has inevitably encountered the problem of the curse of dimensionality. Improving the learning efficiency of the agent is much more important to the practical application of RL. Consider learning to optimally solve Markov decision problems in a particular domain, if the domain has particular characteristics that are attributable to each state, the agent might be able to take advantage of these features to direct the future learning. This paper firstly defines the local state feature, then a state feature function is used to generate the local state features of a state. Also a weight function is introduced to adjust current policy to the actions worth exploring. Based on the above, an improved SARSA algorithm, Feature-SARSA, is proposed. We validate our new algorithm by experiment on a complex domain, named Sokoban. The results show that the new algorithm has better performance.

Introduction

In recent years, there have been a tremendous growth in reinforcement learning (RL) research, not only primarily in the machine learning subfield of artificial intelligence, but also in neural networks and artificial intelligence more broadly.

In the meantime, the RL techniques have found application in many fields [1], especially in the field of Web mining. To search Web automatically, the agent and the mobile agent technologies are introduced [2], [3]. However, due to the explosive growth of the Web resources on one hand and relatively slow increase in network bandwidth on the other hand, the performance of the traditional agent based on breadth-first searching strategy is dissatisfactory. Many new technologies are introduced to solve this problem, one of which is the focused searching strategy based on RL. And the e-learning/e-training systems embedded with RL technology also will find that it is more efficient to access to Web resources. The extension of RL to e-learning/e-training is a promising work.

RL [1] is the problem faced by an agent that must learn behavior through trial-and-error interactions with a dynamic environment to achieve a goal. The environment of most RL is a Markov decision processes (MDPs). There are three fundamental classes of methods for solving the RL problem: dynamic programming, Monte Carlo methods, and temporal-difference (TD) learning. Because TD learning actually is a combination of Monte Carlo and dynamic programming ideas, much recent attention has been focused on the research of TD learning [4], [5]. And a classical on-policy TD algorithm named SARSA (State Action Reward State Action) [6], is considered to be critical to the success of the RL enterprise [7]. However, the extension of RL to the real world remains rather difficult. And one of RL’s biggest stumbling blocks is the curse of dimensionality, in which costs increase exponentially with the number of state and action variables. To MDPs with large state spaces, the performance of traditional RL algorithm is far from efficient. Considerable research has been directed toward the solution of this problem. State abstraction or generalization is a common method for reducing the search space in RL [8], [9]. States abstraction usually groups states together which have the same state feature values. Considering in large real-world domains there might be an enormous number of decisions to be made, and pay-off may be sparse and delayed. The method of macroaction was proposed in [10], [11]. Instead of learning all single fine-grain actions all at the same time this method learns a sequence of actions that are applicable as a unit in a set of states. Thus, the agent could conceivably learn much faster if it abstracts away the myriad of microdecisions, and focuses instead on a small set of important decisions. Another strategy for dealing with the large state space problem is to treat it as a hierarchy of learning problems [12], [13]. The main idea of hierarchical method is to decompose a task into smaller subtasks. In many cases, hierarchical solutions introduce slight sub-optimality in performance, but potentially gain a good deal of efficiency in execution time, learning time and space.

Moreover, in order to solve highly complex problems, bias exploration was incorporated into RL, which means RL algorithm has been endowed with some built-in knowledge [14], [15], [16], [17], [18]. Singer and Veloso presented a method to bias exploration through previous problems solution [15]. Mahadevan and Connell [16] decomposed the task into sets of simple sub-tasks, each with its own pre-wired applicability predication. Matáric [17] minimized the state space by transforming state action pairs to condition-behavior pairs and maximized learning by designing reward rich heterogeneous reinforcement. Millán [18] accelerated RL by integrating it with reflex rules that focus on exploration where it is mostly needed.

However, little attention is paid on the state feature research. This paper argues that in a particular domain the local state features can be viewed as bias and presents a method of using the local state features to direct future learning.

Consider learning to optimally solve Markov decision problems in a particular domain containing a natural set of features that are attributable to each state, the agent might be able to take advantage of these information. While traditional RL agent simply views all states as the same entity and pays no attention on these local features, this paper focuses on how to use local state feature to explore more effectively. Different from a plain state numbering based on location, which is commonly used by a reinforcement agent, we uniquely identify each state, so that these state features have the potential to represent state generalizations. This paper contributes a RL algorithm based on local feature learning and policy adjustment to learn a mapping from state features to actions worth exploring. thereby the learning becomes more effective.

The rest of the paper is organized as follows. Section 2 briefly introduces MDPs and SARSA algorithm. Section 3 describes our proposed Feature-SARSA algorithm based on the local state feature and the adjusting of policy. Section 4 describes the experiments of SARSA and Feature-SARSA algorithms and compares their performance. Finally, concluding remarks are given in Section 5.

Section snippets

Preliminaries

In this section, we describe the MDP and a classical on-policy TD learning algorithm, SARSA, so as to set up a necessary context for subsequent discussions.

Local state features

Since many domains may have many possible local state features that describe the states, the agent can take advantage of them. But what are the valuable local state features? RL usually involves numerous trials, each of which consists of many steps before the agent finds the optimal goal. On each step the agent may receive an immediate reward. If in some states there are some common local state features, an action always results in an identical reward, then these local state features should be

Background of the experiment––the Sokoban domain

Sokoban is an interesting domain of puzzles which falls in the general class of motion planning problems with movable obstacles [22]. The object of the puzzle is for an agent in a grid world to move a ball so that each is located on a destination grid cell. Each grid location is either open or blocked by a wall. The agent has at most four deterministic actions available to it: moving North, East, South, or West. But it may not move into a blocked location or a wall. Because the agent can only

Conclusions

In this paper, we have shown that local state features can be viewed as bias to future exploration in specific domains. A feature function is then defined to map the states to the local state features. Moreover, a weight function is introduced to adjust exploration strategy. Based on these, we proposed a principle approach to embedding the local state features into the agent’s exploration strategy. Also, we presented an improved SARSA algorithm named Feature-SARSA. We have implemented the

References (22)

  • S. Mahadevan et al.

    Automatic programming of behavior-based robots using reinforcement learning

    Arti. Intellig.

    (1992)
  • R.S. Sutton et al.

    Reinforcement Learning: An Introduction

    (1998)
  • J. Cho, H. Garcia-molina, Efficient crawling through URL ordering, in: Proceedings of World Wide Web Conference (WWW7),...
  • J. Fiedler, J. Hammer, Using the efficiently: mobile crawlers, in: Proceedings of Seventeenth AoM/IAoM International...
  • R.S. Sutton, Temporal credit assignment in reinforcement learning, Ph.D. thesis, University of Massachusetts,...
  • R.S. Sutton

    Learning to predict by the method of temporal differences

    Mach. Learn.

    (1998)
  • R.S. Sutton, Generalization in reinforcement learning: successful examples using sparse coarse coding, in: D.S....
  • R.S. Sutton, Open theoretical questions in reinforcement learning, in: Proceedings of EuroCOLT’99, 1999, pp....
  • J.A. Boyan, A.W. Moore, Generalization in reinforcement learning: safely approximating the value function, in: G....
  • T. Dean, S.H. Lin, Decomposition techniques for planning in stochastic domains, in: Proceedings of the Fourteenth...
  • D. Precup, R.S. Sutton, S.P. Singh, Planning with closed-loop macro-actions, in: Working notes of the 1997 AAAI Fall...
  • Cited by (18)

    • Formation and recurrence mechanism of residents' waste separation behavior under the intervention of an information interaction

      2020, Resources, Conservation and Recycling
      Citation Excerpt :

      This research team believes that this differentiated perception mechanism is also a special learning “stimulus”. Lin et al. (2008) with Lin and Li et al. (2003) discussed the mechanism of action of biased information (except situational factors) in the reinforcement learning process. They pointed out that presetting some bias by combining prior knowledge and posterior knowledge could accelerate learning speed.

    • Reinforced learning systems based on merged and cumulative knowledge to predict human actions

      2014, Information Sciences
      Citation Excerpt :

      In both iterative learning and incremental learning, strategies are required to manage the prediction or knowledge parameters for the learning systems. Reinforcement approaches are examples of such strategies [19,26,33,38,41,50]. Initially, reinforcement approaches use a scalar reward to assess the error between expected output data and obtained output data.

    • Backward Q-learning: The combination of Sarsa algorithm and Q-learning

      2013, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      In order to obtain its own experience, the RL agent offers a trade-off between exploration and exploitation, so it not only has to exploit what it already knows through greater action in the current experience, but also has to explore what action will work better in the future (Sutton and Barto, 1998). Therefore, how to balance the exploration and exploitation is an important challenge (Derhami et al., 2010; Guo et al., 2004; Hwang et al., 2004; Lin and Li, 2003; Wiering and Hasselt, 2007, 2008). RL has become one of the machine learning algorithms since 1980s (Dong et al., 2008) and has been applied to a wide range of fields, such as multiagent systems (Hwang et al., 2004; Wang and de Silva, 2008), robotics (Boubertakh et al., 2010; Harandi et al., 2009; Kamio and Iba, 2005; Li et al., 2011; Millan, 1996; Sharma and Gopal, 2008; Wurm et al., 2010), and other applications (Abdi et al., 2012; Chu et al., 2008; Mori and Ishii, 2011; Rahimiyan and Mashhadi, 2010).

    View all citing articles on Scopus
    View full text