Elsevier

Pattern Recognition Letters

Volume 111, 1 August 2018, Pages 30-35
Pattern Recognition Letters

Discrete space reinforcement learning algorithm based on support vector machine classification

https://doi.org/10.1016/j.patrec.2018.04.012Get rights and content

Highlights

  • SVM classification is applied to the reinforcement learning algorithm.

  • Advantage Actor-Critic Method is combined with SVM.

  • The result of SVM classification is used as the action selected by Actor.

  • SVM optimization uses the advantage value to speed up the convergence rate.

Abstract

When facing discrete space learning problems, the traditional reinforcement learning algorithms often have the problems of slow convergence and poor convergence accuracy. Deep reinforcement learning needs a large number of learning samples in its learning process, so it often faces with the problems that the algorithm is difficult to converge and easy to fall into local minimums. In view of the above problems, we apply support vector machines classification to reinforcement learning, and propose an algorithm named Advantage Actor-Critic with Support Vector Machine Classification (SVM-A2C). Our algorithm adopts the actor-critic framework and uses the support vector machine classification as a result of the actor's action output, while Critic uses the advantage function to improve and optimize the parameters of support vector machine. In addition, since the environment is changing all the time in reinforcement learning, it is difficult to find a global optimal solution for the support vector machines, the gradient descent method is applied to optimize the parameters of support vector machine. So that the agent can quickly learn a more precise action selection policy. Finally, the effectiveness of the proposed method is proved by the classical experimental environment of reinforcement learning. It is proved that the algorithm proposed in this paper has shorter episodes to convergence and more accurate results than other algorithms.

Introduction

Statistical learning [1] is a theory that studies the rules of machine learning in small samples. The theory establishes a set of new theoretical systems based on small sample statistic problem. In the system, statistical inference rules not only consider the demand of convergence, but pursue the best optimal results under the condition of limited information can be used [2]. Support vector machine (SVM) [3], [4] is a machine learning method based on statistical learning theory and structural risk minimization principle. Its learning strategy is ‘maximum margin’, that is, solving the optimal separating hyperplane with the maximal margin. In fact, it transforms a classification problem to a convex quadratic programming problem (QPP). By introducing the kernel function, SVM uses nonlinear conversion which can be applied to the nonlinear classification problem to map the training data into higher-dimensional space and transform a nonlinear classification problem to into a linear classification problem in a high dimensional space. It has many unique advantages in solving small sample, nonlinear and high-dimensional pattern recognition problems. To a great extent, it overcomes the problems of ‘Curse of dimensionality’, ‘over-fitting’ and so on. Since SVM was proposed, it has attracted extensive attention because of its superior performance. Many experts and scholars have devoted to SVM and put forward several improved algorithms. In 2015, Gu et al proposed incremental support ordinal regression (ISVOR) based on a sum-of-margins strategy [5]. In 2016, Gu et al. put forward a robust regularization Path algorithm for ν-Support vector classification (ν-SvcRPath) to avoid the exceptions and handle the singularities in the key matrix [6]. At the same time, Ding et al. put forward a variety of improved algorithms for support vector machines [7], [8]. At present, SVM has been successfully applied to many fields, such as pattern recognition [9],text classification [10] and so on.

Reinforcement Learning (RL) is an important research direction in the field of machine learning. Reinforcement learning learns the best response mapping policy from the environment to actions by repeated testing in the environment, so as to maximize a numerical reward signal, which is a closed-loop problems because the actions studied by learning system influence its later inputs. In addition, the learner should try to find which actions could achieve the most reward. As an important machine learning method, it has been extensively studied. In 1989, Watkins [11] proposed a model-free off-policy reinforcement algorithm, called Q-Learning. Subsequently, Rummery and Niranjan proposed an on-policy reinforcement learning algorithm, named SARSA, which modified Q-Learning and applying updates on-line during trials [12]. In recent years, the reinforcement learning has achieved a series of important achievements [13], [14]. For example, in 2015, Mnih et al. proposed the improved Deep Q-network(DQN), which can learn strategies directly from the high dimensional raw input data through End-to-End reinforcement learning training. In 2016, Silver et al. applied deep reinforcement learning to the game of Go and achieved a 99.8% winning rate. However, in the face of small-scale discrete space problem, the traditional reinforcement learning algorithms are often faced with the problem of slow convergence and the insufficient convergence accuracy. While due to learning process requires a lot of learning samples, deep reinforcement learning often faces with the problems that algorithm is difficult to converge and easy to fall into the local minimum.

In order to solve the above problem that the traditional reinforcement learning algorithms are easy to fall into the local minimum and have slow convergence rates when facing discrete space problems. We combine the advantage actor-critic method with support vector machine classification after analyzing the characteristics of support vector machine and the reinforcement learning. We use the results of support vector machine classification as Actor's action outputs, while Critic uses the advantage function to improve the choice of Actor's actions. In addition, we use the gradient descent method to optimize the parameters of support vector machine since the environment is changing all the time in reinforcement learning, which is difficult to find a global optimal solution for the support vector machines. So that agent can quickly learn to get a more accurate action selection policy. Finally, the effectiveness of the proposed algorithm is verified by experiments.

The rest of this paper is organized as follows. Section 2 describes the basic concepts of support vector machines and reinforcement learning. Section 3 makes a detailed description of the new algorithm. We combine support vector machines with advantage actor-critic(A2C), and propose an algorithm named advantage actor-critic with support vector machine classification(SVM-A2C). Experimental results are given in Section 4. Finally conclusions and future works appear in Section 5.

Section snippets

Support vector machines

Support vector machine (SVM) is a binary classification model, its mechanism is to find the optimal classification hyperplane, which can meet classification requirements. SVM can guarantee the classification accuracy of the hyperplane, at the same time, maximize the blank areas on either side of the hyperplane [15], [16]. When kernel functions are applied to SVM, SVM can be used to non-linear classification [17].

Given a training dataset (xi,yi),i=1,2,···,l,xRn,y{±1} in feature space, the

SVM-A2C algorithm

Based on the characteristics of support vector machines and reinforcement learning described in Section 2, we combine support vector machine classification with advantage actor-critic (A2C) [22] and propose a new algorithm named Advantage Actor-Critic with Support Vector Machine Classification (SVM-A2C).

Experimental environment

In order to verify the effectiveness of the proposed algorithm, we use the standard test environment of reinforcement learning, RandomWalk [20], to test the performance of the algorithm. RandomWalk is a standard reinforcement learning problem in a discrete space. Its diagram is shown in Fig. 3.

In the RandomWalk problem, the initial state is the position of the middle point, the agent should explores in limited steps to find a path that can reach the target point, while the reward value should

Conclusion and future work

To solve the problem that reinforcement learning algorithms in discrete space are easy to fall into the local minimum and have slow convergence rates, this paper proposes a reinforcement learning algorithm based on support vector machines (SVM) classification decision. Our algorithm adopts the actor-critic framework. Actor selects an action according to the result of support vector machine classification, while Critic adjusts its advantage function according to the feedback of environments,

Acknowledgments

This work is supported by the Fundamental Research Funds for the Central Universities (No. 2017XKZD03).

References (25)

  • C.J.C.H. Watkins

    Learning from delayed rewards

    Rob. Auton. Syst.

    (1989)
  • G.A. Rummery et al.

    On-line Q-learning Using Connectionist Systems

    (1994)
  • Cited by (32)

    • Discrete space reinforcement learning algorithm based on twin support vector machine classification

      2022, Pattern Recognition Letters
      Citation Excerpt :

      In this experiment, the number of states tested was 7, 11, 15, 19, 23, and 27 respectively. We have used the same parameter configuration as in [33], and the experimental results obtained are shown in Tables 1 and 2. We have used the mean of 50 experimental results as the final result of these experiments, to eliminate the impact of randomness.

    • Design of deep neural networks for transfer time prediction of spacecraft electric orbit-raising

      2022, Intelligent Systems with Applications
      Citation Excerpt :

      The DNN takes the current state as input and predicts the longitudinal downrange and the lateral crossrange of a flight. Yuexuan et al. (2018) have implemented an SVM in conjunction with advantage actor-critic (A2C) RL algorithm. The actor takes the actions and gets feedback from the environment.

    • Classification of nanofluids solutions based on viscosity values: A comparative study of random forest, logistic model tree, Bayesian network, and support vector machine models

      2022, Infrared Physics and Technology
      Citation Excerpt :

      SVM-C model is efficient in high dimensional spaces and uses a subset of training points in the decision function. In this work, the input data to the SVM-C algorithm were the ATR-FTIR spectra using data pre-processing methods [36–40]. PAM with 98% purity, an average molecular weight of 5 × 106, and a viscosity of 280 cp was prepared from Sigma-Aldrich (Germany) in powder form.

    • Optimization of haulage-truck system performance for ore production in open-pit mines using big data and machine learning-based methods

      2022, Resources Policy
      Citation Excerpt :

      In other words, the error of the SVM model is the lowest when all the data points are near or on the hyperplane. As a dynamic algorithm, SVM can solve useful regression problems, as well as classification problems (Huppertz et al., 2016; An et al., 2018; Bui et al., 2019a; Guo et al., 2019a; Nguyen, 2019; Nguyen et al. 2019a, 2020c). Herein, the SVM algorithm is applied to predict the production of ores in mines.

    • Intelligent fault recognition framework by using deep reinforcement learning with one dimension convolution and improved actor-critic algorithm

      2021, Advanced Engineering Informatics
      Citation Excerpt :

      First, although the actor-critic algorithm performs remarkably in complex decision-making problems, it has been proven to be unstable [48] because of the unsynchronized interrelationship between the actor network and the critic network. Besides, popular classification approaches for DRL algorithms are used in the field of image recognition, but they are mostly utilized with the training method [49–51], in which that the agent repeatedly moves on the pixels matrix of an image sample to learn the action trajectory and correct the prediction results. However, for a one-dimensional vibration signal, the agent can only move in two directions of a one-dimensional matrix under the request of an end-to-end training way.

    View all citing articles on Scopus
    View full text