Keywords

1 Introduction

In reinforcement learning, off-policy methods have been receiving much more attention. It breaks the dilemma of on-policy methods that the agent only can learn the policy it is executing. In off-policy setting, the agent is able to learn its target policy while executing another behavior policy. There are mainly two forms of off-policy algorithms, one based on the value function and the other based on the policy gradient.

The value-function approach has worked well in many applications. Q-learning, the prototype of DQN [9], is a classical value function based off-policy algorithm [16]. Different from the on-policy value function methods such as SARSA, Q-learning directly learns its optimal action-value function by executing an exploratory policy. However, Q-learning is just guaranteed to converge to optimal policy for the tabular case and may diverge when using function approximation [2]. The large overestimations of the action values also may lead it to perform poorly in many stochastic environments [1, 3, 5]. Using off-policy per-decision importance sampling Monte-Carlo method [11] is also a choice. However, using importance sampling to correct bias may produce large variance and therefore makes the learning unstable. In recent years, the work of Harutyunyan et al. [4] shows that if the behavior policy \(\mu \) and target policy \(\pi \) are not too far away, off-policy policy evaluation, without correcting for the “off-policyness” of a trajectory, still converges to the desired \(Q^\pi \). Using this conclusion, when \(\mu \) is similar to \(\pi \), we can directly think of the off-policy methods as the on-policy methods and don’t need to use importance sampling technique as before. However, the similarity between policies are difficult to control which makes their method is restrictive and not practical. Thus, using importance sampling seems to be still inevitable. Remi Munos et al. [10] proposed a new off-policy algorithm, Retrace(\(\lambda \)), which uses an importance sampling ratio truncated at 1. and can safely use samples collected from any behavior policy \(\mu \) regardless of the \(\mu \)’s degree of “off-policyness”. However, one of its inherent disadvantages is that the existence of importance sampling ratio makes it have to select a explicit behavior policy \(\mu \) in training and as we all know, the training performance is directly affected by behavior policy, but choosing a reasonable behavior policy, especially for a complex agent, is often a difficult task.

From the perspective of policy gradient, Degris et al. [5] proposes the off-policy policy-gradient theorem and introduces the first off-policy actor-critic method, called off-PAC. This method uses the actor-critic framework in which critic learns an off-policy estimator of the action value function by GTD(\(\lambda \)) algorithm and this estimator is then used by the actor to update the policy which uses incremental update algorithm with eligibility traces. However, facing the problem of biased estimation caused by different sample distribution, the gradient of objective function in off-PAC also chooses to use the importance sampling technique. Following the off-policy policy gradient theorem, Ziyu Wang et al. proposes a new off-policy actor critic algorithm with experience replay, called ACER [15]. To make it stable, sample efficient, and perform remarkably well on challenging environments, it introduces many innovations, such as truncated importance sampling technique, stochastic dueling network architectures, and a new trust region policy optimization method. However, like the problem in Retrace(\(\lambda \)), it also need to choose a reasonable behavior policy which sometimes is a hard work.

In summary, using importance sampling technique to correct the sample distribution difference is a popular method. Simultaneously, in order to avoid variance explosion problem of ordinary importance sampling, many variants of importance sampling are proposed, such as weighted importance sampling [8] or truncated importance sampling etc. However, it should be noted that these variants often come at the cost of increasing bias of the estimator. In addition, the most key drawback of importance sampling is that its behavior policy should be known, Markov(purely a function of the current state), and represented as explicit action probabilities. However, for complex agents, none of these may be true [11].

Based on the discussion above, we try to improve the estimator from the perspective of actor and critic to make our off-policy policy gradient estimator unbiased theoretically without using importance sampling technique. In detail, we use all-action method [14] in actor and exploit tree-backup method to achieve unbiased n-step return to estimate the action value function in critic. Meanwhile, inspired by the experience replay technique, in order to provide tree-backup algorithm with enough low correlation trajectory samples in learning process, we propose the episode-experience replay, which combines the naive episode-experience replay and experience replay. The experimental results demonstrate the advantages of the proposed method over the competed methods.

2 Preliminaries and Notation

In this paper, we consider the episodic framework, in which the agent interacts with its environment in a sequence of episodes, numbered \(m=1,2,\ldots \), each of which consists of a finite number of time steps, \(t=0,1,2,\ldots ,T_{end}^m\). The first state of each episode, \(s_0 \in \mathcal {S}\) is chosen according to a fixed initial distribution \(p_0(s_0)\). We model the problem as a Markov decision processes which comprises: a state space \(\mathcal {S}\), an discrete action space \(\mathcal {A}\), a distribution \(\mathcal {P}:\mathcal {S}\times \mathcal {S}\times \mathcal {A}\rightarrow [0,1]\), where \(p(s'|s, a)\) is the probability of transitioning into state \(s'\) from state s after taking action a, and an expected reward function \(\mathcal {R}:\mathcal {S}\times \mathcal {A}\rightarrow \mathbb {R}\) that provided an expected reward r(sa) for taking action a in state s and transitioning into \(s'\). We assume here that \(\mathcal {A}\) is finite and the environment is completely characterized by one-step state-transition probabilities, \(p(s'|s, a)\), and expected rewards, r(sa), for all \(s, s' \in \mathcal {S}\) and \(a \in \mathcal {A}\).

The target policy of an agent is noted as \(\pi _\theta \) which maps states to a probability distribution over the actions \(\pi _\theta \): \(\mathcal {S}\rightarrow \mathcal {P}(\mathcal {A})\), where \(\theta \in \mathbb {R}^n\) is a vector of n parameters. The return from a state is defined as the sum of discounted future reward \(R_t=\sum _{i=t}^{T_{end}}\gamma ^{i-t}r(s_i, a_i)\) with a discounted factor \(\gamma \in [0,1]\). Note that the return depends on the actions chosen, and therefore on the policy, and may be stochastic. We defined the state-value function for \(\pi _\theta \) to be: \(V^{\pi _\theta }(s)=E_{\pi _\theta }(R_0|s_0=s)\) and action-value function: \(Q^{\pi _\theta }(s, a)=E_{\pi _\theta }(R_0|s_0=s, a_0=a)\), both of which are the expected total discounted reward.

The behavior policy is noted as \(b_\mu \): \(\mathcal {S}\rightarrow \mathcal {P}(\mathcal {A})\), where \(\mu \in \mathbb {R}^m\) is a vector of m parameters. We observe a stream of data, which includes state \(s_t \in \mathcal {S}\), action \(a_t \in \mathcal {A}\), and reward \(r_t \in \mathbb {R}\) for \(t=1,2,\ldots \) with actions selected from a distinct behavior policy, \(b_\mu (a|s) \in (0,1]\). Our aim is to choose \(\theta \) so as to maximize the following scalar objective function:

$$\begin{aligned} J(\theta )=\sum _{s \in \mathcal {S}}d^b(s)V^{\pi _\theta }(s) \end{aligned}$$
(1)

where \(d^b(s)=lim_{t \rightarrow \infty }P(s_t=s| s_0, b)\) is the limiting distribution of states under b and \(P(s_t=s| s_0, b)\) is the probability that \(s_t=s\) when starting in \(s_0\) and executing b. The objective function is weighted by \(d^b\) for the reason that in the off-policy setting, data is obtained according to the behavior distribution. For simplicity of notation, we will write \(\pi \) and implicitly mean \(\pi _\theta \).

3 Off-Policy Actor-Critic Combined with Tree-Backup

In off-policy setting, for the reason that using samples from behavior policy’s distribution different from the sample distribution of target policy to calculate the naive policy gradient estimator may introduce the bias and therefore changes the solution that the estimator will converge to [12], many off-policy methods choose to use importance sampling technique, one general technique for correcting this bias of estimator. However, as discussed above, taking into account the shortcomings of this technology such as acquiring behavior policy to explicitly be represented as action probabilities and may leading the estimator to cause large variance, we choose to use all-action method and tree-backup algorithm to allow the estimator can estimate the policy gradient without using importance sampling.

3.1 Off-Policy Actor-Critic

The off-policy policy-gradient theorem proposed by Degris et al. [5] is:

$$\begin{aligned} g(\theta )=E_{s_t, a_t \thicksim b}[\rho (s_t,a_t)\psi (s_t,a_t)Q^{\pi }(s_t,a_t)] \end{aligned}$$
(2)

where \(\rho (s,a)=\frac{\pi (a|s)}{b(a|s)}\) is the ordinary importance sampling ratio, \(\psi (s,a)=\frac{\nabla _{\theta }\pi (a|s)}{\pi (a|s)}\) is the eligibility vector. Using ordinary importance sampling radio \(\rho (s,a)\) in equation above can achieve the unbiased estimator of the actor-critic policy gradient under target policy \(\pi \) given samples from the behavior policy b’s sample distribution. However, in the case that an unlikely event occurs, \(\rho (s,a)\) can be very large and thus leads the estimator to cause a high variance and instability. There are also many techniques to reduce the variance of the estimator, at the cost of introducing the bias, such as weighted importance sampling technique which performs a weighted average of the samples and therefore smooth the variance [8] or importance weight truncation technique which directly uses constant value c to truncate the importance sampling ratio, i.e. \(\overline{\rho _{t}}=min\{c,\rho _{t}\}\) [15], etc. However, the most key inherent drawback of importance sampling technique is that its behavior policy should be known, Markov(purely a function of the current state), and represented as explicit action probabilities. But for complex agents, none of these may be true [6].

3.2 Off-Policy Actor-Critic Combined with All-Action and Tree-Backup

Based on the disadvantages of the importance sampling discussed above, we try to find another way to eliminate the bias without using importance sampling.

Here, firstly, we start from the Eq. (2) and do some changes on it. This process is shown as below:

$$\begin{aligned} \begin{aligned} g(\theta )&=E_{s_t, a_t \thicksim b}[\rho (s_t,a_t)\psi (s_t,a_t)Q^{\pi }(s_t,a_t)]\\&=E_{s_t, a_t \thicksim b}[\rho (s_t,a_t)\psi (s_t,a_t)Q^{\pi }(s_t,a_t))]\\&=E_{s_t \thicksim d^{b}}[\sum _{a \in A}b(a|s_{t})\frac{\pi (a|s_{t})}{b(a|s_{t})}\frac{\nabla _{\theta }\pi (a|s_{t})}{\pi (a|s_{t})}Q^{\pi }(s_{t},a)]\\&=E_{s_t \thicksim d^{b}}[\sum _{a \in A}\nabla _{\theta }\pi (a|s_{t})Q^{\pi }(s_{t},a))]\\&=E_{s_t \thicksim d^b}[\nabla _\theta \pi (a_t|s_t)Q^\pi (s_t,a_t)+\sum _{a\in A,a\ne a_t}\nabla _\theta \pi (a|s_t)Q^\pi (s_t,a)] \end{aligned} \end{aligned}$$
(3)

where we extract the series over actions to remove random variable \(a_t\). Such operation can avoid introducing importance sampling in the actor. Algorithms of this form are called all-action methods because an update is made for all actions possible in each state encountered irrespective of which action was actually taken [14].

Since the exact value of action value function \(Q^\pi (s, a)\) is unknown, the next step is to replace \(Q^\pi (s, a)\) with some estimator. Here, limited by the lack of trajectory samples starting with the \(s_t\), \(a(a \ne a_t)\), we can only use action-value function approximation \(Q^\omega (s_t, a)\) to estimate \(Q^\pi (s_t, a)\) directly. But for \(Q^\pi (s_t, a_t)\), considering the error reduction property of n-step return [13] and that we can obtain the trajectory samples beginning with \(s_t\), \(a_t\), we choose to use n-step return to estimate it.

However, due to the difference of sample distribution, directly using naive n-step return to estimate will produce bias. Thus, we need some techniques to remove this bias. In order to avoid the disadvantages of importance sampling, we choose to use tree-backup algorithm to estimate the \(Q^\pi (s_t, a_t)\). Tree-backup algorithm itself is designed to estimate the action-value function in off-policy setting. At each step along a trajectory, there are several possible choices according to the target policy. The one-step target combines the value estimations for these actions according to their probabilities of being taken under the target policy. At each step, the behavior policy chooses one of the actions, and for that action, one time step later, there is a new estimation of its value, based on the reward received and the estimated value of the next state. The tree-backup algorithm then forms a new target, using the old value estimations for the actions that were not taken, and the new estimated value for the action that was taken [11]. If the process above is iterated over n steps, we can get the n-step tree-backup estimator \(G_{t:t+n}\) of \(Q^{\pi }(s_{t},a_{t})\):

$$\begin{aligned} G_{t:t+n}=\gamma ^{n}Q(s_{t+n},a_{t+n})\prod _{i=t+1}^{t+n}\pi _{i}+\sum _{k=t+1}^{t+n}\gamma ^{k-t+1}\prod _{i=t+1}^{k-1}\pi _{i}[r_{k}+\gamma \sum _{a\ne a_{k}}\pi (a|s_{k})Q(s_{k},a)] \end{aligned}$$
(4)

where \(\pi _{i}\) is short for \(\pi (a_{i}|s_{i})\). In order to improve the computational efficiency, we can use the iterative form as below:

$$\begin{aligned} G_{t:t+n}=r_{t+1}+\gamma \sum _{a\ne a_{t+1}}\pi (a|s_{t+1})Q(s_{t+1},a)+\gamma \pi (a_{t+1}|s_{t+1})G_{t+1:t+n} \end{aligned}$$
(5)

In general, we replace the \(Q^\pi (s_t,a)\) with differentiable action-value function \(Q^\omega (s_t,a)\) directly, and use n-step tree-backup estimator \(G_{t:t+n}\) to replace the \(Q^\pi (s_t, a_t)\). So, we can get the estimator \(\hat{g(\theta )}\) as below to estimate the off-policy actor-critic policy gradient \(g(\theta )\):

$$\begin{aligned} \begin{aligned} \hat{g(\theta )}=\frac{1}{N}\sum _{i=1}^{N}[\nabla _\theta \pi (a_i|s_i;\theta )G_{i: i+n}+\sum _{a \in A,a \ne a_i}\nabla _{\theta }\pi (a| s_i;\theta )Q^\omega (s_i, a))] \end{aligned} \end{aligned}$$
(6)

where \(G_{i: i+n}=r_{i+1}+\gamma \sum _{a \ne a_{i+1}}\pi (a| s_{i+1};\theta )Q^\omega (s_{i+1},a)+\gamma \pi (a_{i+1}| s_{i+1})G_{i+1:i+n}\).

One key advantage of tree-backup algorithm is that we won’t need to determine the behavior policy any more. As we all know, the selection of behavior policy directly affects the algorithm’s performance and choosing a reasonable behavior policy, especially for a complex agent, is always a difficult task.

It should be noted that the selection of n reflects the trade-off of bias and variance [15]. Typically the only-actor policy gradient estimators using Monte-Carlo return \(R_t\) as its critic, such as REINFORCE [17], will have higher variance and lower bias whereas the actor-critic estimators using function approximation as critic will have higher bias and lower variance. The greater the value of n is selected, the more information the \(G_{t: t+n}\) will get and therefore the smaller bias the \(G_{t: t+n}\) will produce, however, the larger the variance of \(G_{t: t+n}\) will be. Extremely when \(n=0\), then \(G_{t: t+n}\) degenerates the normal \(Q^\omega (s_t, a_t)\). In the experimental part of this paper, we will explain this problem through one simple experiment.

3.3 Episode-Experience Replay

In practice, the experiences obtained by trial-and-error usually take an expensive price, such as the loss of the equipment or the consumption of the time, etc. If these experiences are just utilized to adjust the networks only once and then thrown away, it may be very wasteful [7]. Experience replay technique is a straight and effective way to reuse the experience. DQN [9] uses the experience replay which store the experiences from the Atari games to gain the excellent performance.

There is no doubt that experience replay technique is very useful for one-step TD algorithms. However, tree-backup algorithm or other n-step TD algorithms all need trajectory samples rather than single experience samples. In previous papers, people usually execute one behavior policy and exploit the eligibility trace with importance sampling to learn along off-policy trajectory in the on-line way or use one behavior policy to produce an off-policy episode, and then use this episode to achieve trajectory samples to learn in the off-line way, et. There are some inherent drawbacks in those sampling methods. Firstly, because these trajectory samples stems from a same episode, there is strong correlations between samples. Secondly, it is difficult to provide the training with sufficient samples productively like the experience replay. Last but not least, these sampling methods also need a behavior policy explicitly represented as action probabilities, which, as we discussed above, is not an easy work.

In order to overcome all these shortcomings above, we propose episode-experience replay technique which combines the naive episode-experience replay technique and experience replay technique. Unlike the experience replay which just stores the single experience \((s_t, a_t, r_t, s_{t+1}) \thicksim b\), episode-experience replay works on episodes by storing the complete episode \(s_0,a_0,r_1,s_1,a_1,r_2,s_2\ldots \thicksim b\) in the episode-experience pool and selecting episodes from the pool randomly. For the reason that the policy parameters are updated all the time, therefore the episodes stored in pool which are originated from policy with old parameters can be viewed as off-policy episodes. In off-policy setting, this way is able to provide enough off-policy episodes productively for agent to learn and can greatly improve the data efficiency and the speed that we achieve the off-policy samples. However, directly using consecutive samples \(\{(s_t, a_t, r_{t+1},s_{t+1})_{t=0:T_{end}^i-1}\}_{i=1\ldots M}\) from those selected episodes, which we called naive episode-experience replay, as a training batch to learn is ineffective, due to the strong correlations between samples. So we combine the experience replay and episode-experience replay to get off-policy episode samples effectively while reducing correlations between samples. The process is shown as below in Fig. 1:

Fig. 1.
figure 1

The process of sampling in episode experience replay.

In this process, we first select m episode samples randomly from episode-experience pool. Then from each episode, we just produce only one trajectory sample. In detail, we also randomly choose one experience in each episode and use it as the starting point of the trajectory sample. The terminal point of trajectory point is the last experience of each episode. It should be noted that when we choose a starting experience in each episode, we should use one list to record the place in the corresponding episode for that experience which can help us go back to the trajectory’s concrete position in episode quickly. At the same time, the m experience samples just can be used as the training samples to train critic.

3.4 Algorithm

Pseudocode for our method is shown in Algorithm 1. Here, in order not to introduce the importance sampling, we use deep q-learning algorithm to train the critic network and the training samples for critic come from the starting experiences sampled in the process of episode-experience replay. For the reason that the network \(Q^\omega \) being updated is also used in calculating the target value and it will make the target value drastically change with the change of the \(Q^\omega \)’s parameters which may cause instability, inspired by [6], we use target network \(Q^{\omega '}\) to calculate the target value and use the “soft” target update, rather than directly copying the weights.

figure a

4 Experiment

In this section, we will check the rationality and effectiveness of our algorithm by learning under several simulation environments from OpenAI gym. We designed our experiment to investigative the following questions:

  1. 1.

    What are the effect that different choices of n make on the algorithm performance?

  2. 2.

    Compared to the original sampling method, using episode-experience Replay will bring much improvement in performance?

  3. 3.

    Compared to some commonly used reinforcement learning methods, how much improvement will out algorithm bring?

4.1 Performance Comparison on Difference Choices of n

As we propose in Sect. 3, the choice of n reflects the trade-off of variance and bias. Here, we use the CartPole-v0 simulation environment and choose \(n=1,3,6,9,15\) for training. Figure 2 shows performance comparison on different n. We can find that different choices of n have a direct effect on the algorithm’ s performance and in this simulation environment, in terms of stability, convergence speed and performance, \(n = 9\) is the best choice relatively (Here, we use naive episode-experience replay to highlight the effect).

Fig. 2.
figure 2

Compare on different choice of n.

Fig. 3.
figure 3

Compare on different sampling methods.

4.2 Performance Comparison on Difference Sampling Methods

In this subsection, we mainly compare the algorithm’s performance when using two kind different trajectory sampling methods, naive episode-experience replay and episode-experience replay during training. Due to the strong correlations between samples from naive episode-experience replay, as Fig. 3 shows, it has great fluctuations in performance during training. We can Obviously find that regardless of convergence speed and algorithm stability, episode-experience replay is more advantageous than naive episode-experience replay.

Fig. 4.
figure 4

Compare with other methods.

4.3 Performance Comparison with Other Conventional Algorithms

We compare the algorithm’s performance with DQN and Actor-Critic algorithm under three simulation environments from OpenAI Gym: CartPole-v0, MountainCar-v0, Acrobot-v0. As Fig. 4 shows, compared to DQN, Actor-Critic policy gradient algorithm performs more stable during training. DQN seems to have a great volatility which may stems from the inherent drawback of value function method, discontinuous change of policy based on value function. However, although Actor-Critic algorithm is more stable than DQN, the final performance it converges to is worser than DQN. Table 1 lists the average return of each algorithm, our method achieve the highest average return under these three simulation environments. In general, from the perspective of convergence speed and performance stability, our method performs better.

5 Conclusion

In this paper, we mainly study the Actor-Critic policy gradient problems in off-policy setting. Considering sample distribution difference between behavior policy and target policy will cause the estimator to produce bias and the the limitations of using the importance sampling technique, we use all-action method and tree-backup algorithm to allow the estimator to use samples from behavior policy directly to unbiasedly estimate the target policy gradient. Here, the all-action method helps remove the random variable \(a_t\), thus we can avoid importance sampling in the actor. The tree-backup method can help avoid importance sampling in the critic when using n-step return as the estimator of action value function.

In addition, in order to improve efficiency, we propose episode-experience replay technique which combines naive episode-experience replay technique and experience replay technique to overcome some main disadvantages of previous sampling methods, such as strong correlations between samples, low production of trajectory samples and requiring explicitly represented behavior policy, etc.

By experiments on the OpenAI gym simulation platform, results demonstrate the advantages of the proposed method over the competed methods.